clickhouse

Month: 2023-12

2023-12-03

chewei 23:47:16
租屋資料
https://www.facebook.com/share/p/QrHWnwTgBJDEmsRL/

Facebook

Log into Facebook

Log into Facebook to start sharing and connecting with your friends, family, and people you know.

2023-12-04

pm5 11:11:29
來 cue 一下 @ddio
pm5 11:11:29
來 cue 一下 @ddio
😆 1
ronnywang 11:16:22
所以要來開 clickhouse 分享會了嗎 XD
ronnywang 11:16:22
所以要來開 clickhouse 分享會了嗎 XD
ddio 13:20:15
什麼什麼?歐歐歐我看到了,總之先給程式碼 XD

https://github.com/g0v/tw-rental-house-data/blob/master/csv-aggregator/merge.sql

其實就是一堆 groupBy ,之後會再解釋一下怎麼從 psql migrate 過來~
ddio 13:20:15
什麼什麼?歐歐歐我看到了,總之先給程式碼 XD

https://github.com/g0v/tw-rental-house-data/blob/master/csv-aggregator/merge.sql

其實就是一堆 groupBy ,之後會再解釋一下怎麼從 psql migrate 過來~

<https://github.com/g0v/tw-rental-house-data/blob/master/csv-aggregator/merge.sql | merge.sql>

``` SET format_csv_null_representation = '-'; -- generate raw csv from multiple monthly file select "物件編號", anyHeavy("租屋平台") "租屋平台", anyHeavy("物件網址") "物件網址", anyHeavy("物件首次發現時間") "物件首次發現時間", anyHeavy("物件最後更新時間") "物件最後更新時間", anyHeavy("縣市") "縣市", anyHeavy("鄉鎮市區") "鄉鎮市區", anyHeavy("約略地點_x") "約略地點_x", anyHeavy("約略地點_y") "約略地點_y", anyHeavy("房屋出租狀態") "房屋出租狀態", anyHeavy("出租大約時間") "出租大約時間", anyHeavy("出租所費天數") "出租所費天數", anyHeavy("月租金") "月租金", anyHeavy("押金類型") "押金類型", anyHeavy("押金月數") "押金月數", anyHeavy("押金金額") "押金金額", anyHeavy("需要管理費?") "需要管理費?", anyHeavy("月管理費") "月管理費", anyHeavy("提供車位?") "提供車位?", anyHeavy("需要停車費?") "需要停車費?", anyHeavy("月停車費") "月停車費", anyHeavy("每坪租金(含管理費與停車費)") "每坪租金(含管理費與停車費)", anyHeavy("建築類型") "建築類型", anyHeavy("物件類型") "物件類型", anyHeavy("自報頂加?") "自報頂加?", anyHeavy("所在樓層") "所在樓層", anyHeavy("建物樓高") "建物樓高", anyHeavy("距頂樓層數") "距頂樓層數", anyHeavy("坪數") "坪數", anyHeavy("陽台數") "陽台數", anyHeavy("衛浴數") "衛浴數", anyHeavy("房數") "房數", anyHeavy("客廳數") "客廳數", anyHeavy("格局編碼(陽台/衛浴/房/廳)") "格局編碼(陽台/衛浴/房/廳)", anyHeavy("額外費用_電費?") "額外費用_電費?", anyHeavy("額外費用_水費?") "額外費用_水費?", anyHeavy("額外費用_瓦斯?") "額外費用_瓦斯?", anyHeavy("額外費用_網路?") "額外費用_網路?", anyHeavy("額外費用_第四台?") "額外費用_第四台?", anyHeavy("附近有_學校?") "附近有_學校?", anyHeavy("附近有_公園?") "附近有_公園?", anyHeavy("附近有_百貨公司?") "附近有_百貨公司?", anyHeavy("附近有_超商?") "附近有_超商?", anyHeavy("附近有_傳統市場?") "附近有_傳統市場?", anyHeavy("附近有_夜市?") "附近有_夜市?", anyHeavy("附近有_醫療機構?") "附近有_醫療機構?", anyHeavy("附近的捷運站數") "附近的捷運站數", anyHeavy("附近的公車站數") "附近的公車站數", anyHeavy("附近的火車站數") "附近的火車站數", anyHeavy("附近的高鐵站數") "附近的高鐵站數", anyHeavy("附近的公共自行車數(實驗中)") "附近的公共自行車數(實驗中)", anyHeavy("有身份限制?") "有身份限制?", anyHeavy("有性別限制?") "有性別限制?", anyHeavy("性別限制") "性別限制", anyHeavy("可炊?") "可炊?", anyHeavy("可寵?") "可寵?", anyHeavy("有產權登記?") "有產權登記?", anyHeavy("刊登者類型") "刊登者類型", anyHeavy("刊登者編碼") "刊登者編碼", anyHeavy("仲介資訊") "仲介資訊", anyHeavy("提供家具_床?") "提供家具_床?", anyHeavy("提供家具_桌子?") "提供家具_桌子?", anyHeavy("提供家具_椅子?") "提供家具_椅子?", anyHeavy("提供家具_電視?") "提供家具_電視?", anyHeavy("提供家具_熱水器?") "提供家具_熱水器?", anyHeavy("提供家具_冷氣?") "提供家具_冷氣?", anyHeavy("提供家具_沙發?") "提供家具_沙發?", anyHeavy("提供家具_洗衣機?") "提供家具_洗衣機?", anyHeavy("提供家具_衣櫃?") "提供家具_衣櫃?", anyHeavy("提供家具_冰箱?") "提供家具_冰箱?", anyHeavy("提供家具_網路?") "提供家具_網路?", anyHeavy("提供家具_第四台?") "提供家具_第四台?", anyHeavy("提供家具_天然瓦斯?") "提供家具_天然瓦斯?" from file("raw/*.csv") group by "物件編號" order by "物件編號" asc into outfile 'result/raw.csv' format CSVWithNames; select "租屋平台", count(*) "物件數" from file("result/raw.csv") group by "租屋平台" format Pretty; select count("物件編號") "重複物件數", max("物件編號") "最大物件編號", min("物件編號") "最小物件編號", max("物件首次發現時間") "最大物件首次發現時間", min("物件首次發現時間") "最小物件首次發現時間", max("物件最後更新時間") "物件最後更新時間", "租屋平台", "縣市", "鄉鎮市區", anyHeavy("約略地點_x") "常見約略地點_x", anyHeavy("約略地點_y") "常見約略地點_y", max("房屋出租狀態") "房屋曾出租過", max("出租大約時間") "最後出租時間", max("出租所費天數") "最大出租所費天數", "月租金", "押金類型", "押金月數", "押金金額", "需要管理費?", "月管理費", "提供車位?", "需要停車費?", "月停車費", "每坪租金(含管理費與停車費)", "建築類型", "物件類型", "自報頂加?", "所在樓層", "建物樓高", "距頂樓層數", "坪數", "陽台數", "衛浴數", "房數", "客廳數", "格局編碼(陽台/衛浴/房/廳)", "額外費用_電費?", "額外費用_水費?", "額外費用_瓦斯?", "額外費用_網路?", "額外費用_第四台?", "附近有_學校?", "附近有_公園?", "附近有_百貨公司?", "附近有_超商?", "附近有_傳統市場?", "附近有_夜市?", "附近有_醫療機構?", "附近的捷運站數", "附近的公車站數", "附近的火車站數", "附近的高鐵站數", "附近的公共自行車數(實驗中)", "有身份限制?", "有性別限制?", "性別限制", "可炊?", "可寵?", "有產權登記?", "刊登者類型", anyHeavy("刊登者編碼") "常見刊登者編碼", "仲介資訊", "提供家具_床?","提供家具_桌子?","提供家具_椅子?","提供家具_電視?","提供家具_熱水器?", "提供家具_冷氣?","提供家具_沙發?","提供家具_洗衣機?","提供家具_衣櫃?","提供家具_冰箱?", "提供家具_網路?","提供家具_第四台?","提供家具_天然瓦斯?" from file("result/raw.csv") where "建築類型" in (0, 1, 2) and "物件類型" in (0, 1, 2, 3, 4) and "建物樓高" < 90 and "所在樓層" < 90 and "坪數" < 500 and "每坪租金(含管理費與停車費)" < 15000 group by "租屋平台", "縣市", "鄉鎮市區", "月租金", "押金類型", "押金月數", "押金金額", "需要管理費?", "月管理費", "提供車位?", "需要停車費?", "月停車費", "每坪租金(含管理費與停車費)", "建築類型", "物件類型", "自報頂加?", "所在樓層", "建物樓高", "距頂樓層數", "坪數", "陽台數", "衛浴數", "房數", "客廳數", "格局編碼(陽台/衛浴/房/廳)", "額外費用_電費?", "額外費用_水費?", "額外費用_瓦斯?", "額外費用_網路?", "額外費用_第四台?", "附近有_學校?", "附近有_公園?", "附近有_百貨公司?", "附近有_超商?", "附近有_傳統市場?", "附近有_夜市?", "附近有_醫療機構?", "附近的捷運站數", "附近的公車站數", "附近的火車站數", "附近的高鐵站數", "附近的公共自行車數(實驗中)", "有身份限制?", "有性別限制?", "性別限制", "可炊?", "可寵?", "有產權登記?", "刊登者類型", "仲介資訊", "提供家具_床?","提供家具_桌子?","提供家具_椅子?","提供家具_電視?","提供家具_熱水器?", "提供家具_冷氣?","提供家具_沙發?","提供家具_洗衣機?","提供家具_衣櫃?","提供家具_冰箱?", "提供家具_網路?","提供家具_第四台?","提供家具_天然瓦斯?" order by "重複物件數" desc into outfile 'result/deduplicated.csv' format CSVWithNames; select "租屋平台", count(*) "物件數" from file("result/deduplicated.csv") group by "租屋平台" format Pretty; ```

pm5 13:24:13
這是跑 time-series 還是全部的資料呀?
input 是逐月的資料集,每個月一份,一次會拿 3 ~ 12 個月來跑,應該比較接近 time series ?因為都在處理 aggregation 。
pm5 13:24:13
這是跑 time-series 還是全部的資料呀?
input 是逐月的資料集,每個月一份,一次會拿 3 ~ 12 個月來跑,應該比較接近 time series ?因為都在處理 aggregation 。

2023-12-06

pm5 12:03:54
不知道有沒有人用過 ClickHouse 的 vector search?https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes
pm5 12:03:54
不知道有沒有人用過 ClickHouse 的 vector search?https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes

clickhouse.com

Approximate Nearest Neighbor Search Indexes [experimental] | ClickHouse Docs

Nearest neighborhood search is the problem of finding the M closest points for a given point in an N-dimensional vector space. The most

2023-12-09

ddio 21:40:56
input 是逐月的資料集,每個月一份,一次會拿 3 ~ 12 個月來跑,應該比較接近 time series ?因為都在處理 aggregation 。

2023-12-13

Laurence Chen 22:12:59
@humorless has joined the channel