#disinfo | g0v.tw | slack archive

bruce 16:47:20

gogogo

pm5 19:03:28

`batch_discover.py` now runs daily on middle2.

pm5 19:04:22

There are some problems with duplicated article url hash. Here is the log: https://middle2.com/project/cronlog/tainan-sun-500796

wenyi 05:00:19

wenyi 05:20:39

有發現針對一些website (e.g. 中國台灣網& udn) 會有這些錯誤，但還不太知道為什麼QQ 至少他不會被送進db

wenyi 05:21:23

is_active column update完成, batch_discover should work well next run on middle2!

🎉 1

pm5 17:32:29

we can start discussing what kind of dataset can we put together before the hackathon? data visualization would probably the easiest way for ppl to contribute.

🙌 1

chihao 20:01:30

@fockerlee there is a link to hangout in your google calendar 🙂

bruce 20:03:02

oh

chihao 20:05:38

@fockerlee 需要連結嗎？

bruce 20:06:34

我點了，但他顯示有錯誤，我在嘗試另外一個瀏覽器

chihao 20:06:41

OK

bruce 20:18:37

posts-10.csv

chihao 20:21:42

meeting notes https://g0v.hackmd.io/BChGrPg-TkWsvks2ey5q6A?edit

g0v.hackmd.io

disinfoRG & 0archive meeting notes - HackMD

chihao 20:50:12

tech spec https://g0v.hackmd.io/lMQO37z6SbWNWo3R4-X_EA

g0v.hackmd.io

零時檔案系統技術文件 0archive Technical Spec - HackMD

pm5 11:38:31

@fockerlee @ayw255 @chihao updated db schema for FB posts & comments https://g0v.hackmd.io/lMQO37z6SbWNWo3R4-X_EA?view#%E8%B3%87%E6%96%99%E5%BA%AB%E6%9E%B6%E6%A7%8B

g0v.hackmd.io

零時檔案局技術文件 0archive Technical Spec - HackMD

pm5 2019-12-04 11:42:43

主要幾個安排：FB posts & comments 都是 Article（它們都有 url），用 article_type 區分；抓下來的內容放在 FacebookPostSnapshot or FacebookCommentSnapshot；按讚數等等資料存在 reactions JSON column

pm5 2019-12-04 11:43:05

看一下感覺怎麼樣？

chihao 2019-12-04 11:54:22

看看

chihao 2019-12-04 11:59:39

1. FacebookPostSnapshot → FbPostSnapshot 覺得讓他短一點
2. 感覺沒問題，但隱約覺得 comment 的 snapshot 好像可以綁在 article snapshot，不用獨立，不確定，想討論

wenyi 2019-12-05 00:14:33

comment snapshot因為有回覆哪一篇文章以及reaction等有別於article的架構, 可能獨立出來會比較好
這讓我想到我們也可以搜集大眾新聞網站下面的留言，存在類似的架構中，以及之後youtube影片下面的留言

bruce 2019-12-05 02:12:20

我也覺得 comment 感覺獨立出另一個 table 比較方便。方便了解是屬於哪篇文章（post）下的留言，透過每個comment有一個 article_id，來知道屬於哪一篇article。如果comment跟article放在同一個table，感覺不容易分辨從屬關係。

pm5 2019-12-05 09:08:15

我也覺得 comment snapshot 分開比較好（當然）。不過 comment snapshot 的 article_id 是這個 comment 的 id。reply_to 才是指向它回覆的那一篇。而且 FB comment 可以是回覆另一個 comment，所以 reply_to 所指向的另一個 Article 可能也是個 FB comment，要看 Article.article_type 決定

pm5 2019-12-06 11:17:18

可能還是看我們蒐集資料的對象吧。假訊息主要出現在 post 還是 comment？如果 comment 也是常見傳播假訊息的管道之一，那獨立出來比較好。當然如果要 post & comment 綁在一起的，這也可以在後面的 pipeline 做。

❤️ 2

pm5 11:42:43

主要幾個安排：FB posts & comments 都是 Article（它們都有 url），用 article_type 區分；抓下來的內容放在 FacebookPostSnapshot or FacebookCommentSnapshot；按讚數等等資料存在 reactions JSON column

pm5 11:43:05

看一下感覺怎麼樣？

chihao 11:54:22

看看

chihao 11:59:39

1. FacebookPostSnapshot → FbPostSnapshot 覺得讓他短一點
2. 感覺沒問題，但隱約覺得 comment 的 snapshot 好像可以綁在 article snapshot，不用獨立，不確定，想討論

poga 16:10:05

0archive 有要爬主流新聞媒體嗎？

poga 16:10:29

因為 AILabs 這邊已經有人在爬了，看要不要合作一下

👍 1

poga 16:15:07

看到 airtable 了，看起來是有爬

wenyi 2019-12-04 23:58:22

嗚喔有在爬～

wenyi 2019-12-04 23:59:14

目前有爬airtable上看到的那幾個site，除了蘋果新聞網（蘋果即時有，新聞網還沒）

wenyi 23:57:31

`drop_unused` merged with branch `master` — ‘init_site.py’ is dropped from repo

wenyi 23:58:22

嗚喔有在爬～

wenyi 23:59:14

目前有爬airtable上看到的那幾個site，除了蘋果新聞網（蘋果即時有，新聞網還沒）

wenyi 00:14:33

comment snapshot因為有回覆哪一篇文章以及reaction等有別於article的架構, 可能獨立出來會比較好
這讓我想到我們也可以搜集大眾新聞網站下面的留言，存在類似的架構中，以及之後youtube影片下面的留言

wenyi 00:24:47

因為時常面臨找不到相關hackmd文件的窘境所以加了[NOTES.md] @ `master` 統整文件歡迎@chihao 再補其他文件進去

🙌 1

bruce 02:12:20

我也覺得 comment 感覺獨立出另一個 table 比較方便。方便了解是屬於哪篇文章（post）下的留言，透過每個comment有一個 article_id，來知道屬於哪一篇article。如果comment跟article放在同一個table，感覺不容易分辨從屬關係。

pm5 09:08:15

我也覺得 comment snapshot 分開比較好（當然）。不過 comment snapshot 的 article_id 是這個 comment 的 id。reply_to 才是指向它回覆的那一篇。而且 FB comment 可以是回覆另一個 comment，所以 reply_to 所指向的另一個 Article 可能也是個 FB comment，要看 Article.article_type 決定

pm5 11:24:27

@ronnywang 我們現在 db 用掉多少硬碟空間啦？

ronnywang 11:25:03

我等等中午看看，應該是還沒佔多少…如果硬碟吃到 70% 以上我會收到警告信

ronnywang 12:51:32

目前用了 20G

ronnywang 12:51:41

主要都是 ArticleSnapshot

ronnywang 12:52:51

我把 ArticleSnapshot 啟動壓縮看看

ronnywang 12:59:53

我之前 newsdiff 的 raw html 只會保留三個月，因為原始 HTML 量真的會太多，但會永久保留解出來的title 和 body 資訊

pm5 13:01:29

Snapshot 也是件麻煩事。我之前看了一下，應該有很多文章重新 snapshot 但是文章內容其實沒有改變，改變的是 sidebar 裡最新文章列表這一類的資料。

ronnywang 13:02:35

newsdiff 那邊是解出 title, body ，title, body 有變才會存新的 snapshot

ronnywang 13:02:48

但是要 title body 就要針對每個網站都要寫 parser

ronnywang 13:03:18

現在已經收集了不少 snapshot 了，可以來實驗看看 readability 解 title, body 的效果了？

pm5 13:05:30

@ayw255 這禮拜會測試看看

🙌 1

ronnywang 13:05:35

snapshot 的部份我在 newsdiff 是一個月存一個 table ，然後超過三個月就 drop table

ronnywang 13:05:47

壓縮完了，剩 5.6G

ronnywang 13:06:03

以後資料就是壓縮的資料了

2

ronnywang 13:15:20

現在的資料量也可以大概預估硬碟成長量需求了？

ronnywang 13:17:01

目前 disinfo 專案在 middle2 上面的 mysql 硬碟空間還剩 40G ，但到 80% 時我就會處理加大空間了，所以 disinfo 大概還可以多用 25G 左右是我可以先不管他的

ronnywang 13:17:32

等到用到 25G 時，我們就要討論把 disinfo 獨立一台資料庫了

pm5 13:22:31

粗估一下大約可以用到⋯⋯12 月底？

pm5 13:23:33

readability 夠好的話就可以多撐很久

ronnywang 13:24:12

snapshot 的部份也可以實驗看看戳 archive.org 讓他們那邊來存

pm5 13:24:19

不然也可以發佈兩個禮拜以上的 Article 就停止 snapshot

ronnywang 13:24:44

現在 Article 會每天都 snapshot 沒設定停止日期？

pm5 13:26:31

應該沒有。也才剛開始每天跑，一個禮拜左右

ronnywang 13:27:23

如果沒設 snapshot 日期，這樣用量不是會 O(N^2) 成長 XD

pm5 13:29:01

喔，現在一篇文章最多 snapshot 7 次

wenyi 13:29:27

我有已經寫了一版解title, body 但不是用readability，所以需要每一個site存一個css selector for main text body。readability 的部分我有看python api (https://github.com/buriy/python-readability) 但是基本上解不出來body

GitHub

buriy/python-readability

fast python port of arc90's readability tool, updated to match latest readability.js! - buriy/python-readability

pm5 2019-12-08 22:25:17

@ayw255 你遇到解不出來 body 的情況是怎麼樣？我今天試了一下，他有點 unicode 的問題，修掉以後我試了幾篇爬下來的文章，看起來可以解出還不錯的 body

pm5 2019-12-08 22:25:35

要修掉一個 unicode 的問題：https://github.com/pm5/python-readability/commit/8e1f4ceb6af9dc7794ec7169fae8923565f73998

wenyi 2019-12-09 08:01:15

@pm5 👍👍可以了，把parser更新成使用你forked的這個readability了，結果看起來不錯，明天討論一下資料要存哪裡

wenyi 13:30:19

新的code在branch `parse_article`

ronnywang 13:31:17

有的網站很討厭，會把廣告版位塞在 main text body 中，造成每次抓因為廣告不同而 main text 不同

ronnywang 13:31:23

但是其實文章沒變

ronnywang 13:40:09

另外之前 table name 好像是用第一個字大寫，像是 Article ，不過習慣上好像 table name 都會用小寫
像是 ArticleSnapshot 在 table name 會用 article_snapshot ，在程式的 class name 才用 ArticleSnapshot

😆 1

chihao 14:15:09

今晚 8pm-9pm 有 disinfo community hangout 哦（好像什麼假訊息製造的社群）忘了事先在 #general 上宣傳了 😆

chihao 14:15:22

來準備一下等一下發訊息

chihao 14:16:03

視訊工具的話，如果選 jitsi 不知道進入門檻會不會太高？

hcchien 14:47:23

@hcchien407 has joined the channel

deeper 15:14:00

@cstsai has joined the channel

yukai 17:13:25

@yukai has joined the channel

ael 17:39:48

@aelcenganda has joined the channel

kiwi 19:10:52

@auroral.13king518 has joined the channel

loooffy 19:42:45

@loooffy has joined the channel

cwkung2016 20:08:02

@cwkung2016 has joined the channel

Anping 02:19:04

@zhaoanping has joined the channel

pm5 11:17:18

可能還是看我們蒐集資料的對象吧。假訊息主要出現在 post 還是 comment？如果 comment 也是常見傳播假訊息的管道之一，那獨立出來比較好。當然如果要 post & comment 綁在一起的，這也可以在後面的 pipeline 做。

fly 15:41:28

> 與真人用戶不同，假帳號可能由一個人或一個集團控制，他們通常一起行動，一部分人負責發出宣傳訊息或者假消息，其他人贊同然後轉發，合成一個消息動向，吸引真人用戶加入網絡討論。
> BotSlayer 的系統收集所有符合這些條件的推文，然後儲存到一個數據庫，以供日後偵查。此外，它還連結到 Hoaxy 系統，可以得知 Twitter 帳號在一段時間內的互動情況，辨識最有影響力以及最有可能散播假消息的帳號。
<https://www.cup.com.hk/2019/12/03/botslayer-the-system-aims-to-find-and-kill-internet-bots/>

<https://osome.iuni.iu.edu/tools/botslayer/>

*CUP

如何挖出並殲滅假帳號 - *CUP

由電腦系統操縱的假帳號（bot）干擾言論自由、帶動政治宣傳風向的普遍現象，日益引起美國互聯網研究員的關注，美國印第安納大學兩位博士生研發出 Botslayer 軟件系統，專門為媒體記者和公眾討論區服務，幫助他們找出網絡假帳號大軍，清理門戶。

1

LoraC 17:36:18

@shadowcrow594 has joined the channel

pm5 22:25:17

@ayw255 你遇到解不出來 body 的情況是怎麼樣？我今天試了一下，他有點 unicode 的問題，修掉以後我試了幾篇爬下來的文章，看起來可以解出還不錯的 body

pm5 22:25:35

要修掉一個 unicode 的問題：https://github.com/pm5/python-readability/commit/8e1f4ceb6af9dc7794ec7169fae8923565f73998

GitHub

fix: unicode problem · pm5/python-readability@8e1f4ce

fast python port of arc90's readability tool, updated to match latest readability.js! - pm5/python-readability

wenyi 08:01:15

@pm5 👍👍可以了，把parser更新成使用你forked的這個readability了，結果看起來不錯，明天討論一下資料要存哪裡

1

chihao 19:50:23

@pm5 @ayw255 @fockerlee 再十分鐘 dev meeting 😄

kiang 21:40:04

@kiang has joined the channel

gugod 23:06:53

https://metacpan.org/source/GUGOD/NewsExtractor-v0.0.3/lib/NewsExtractor/Extractor.pm#L21

我開始漸漸將 CSS 規則加入自己做的爬蟲 NewsExtractor.pm 了
打算先依網址區別，一個站做一組 CSS 規則。
用來對付那些無法以 Readibility 演算法處理的網站。

gugod 2019-12-11 08:43:58

https://github.com/perltaiwan/NewsExtractor/blob/master/lib/NewsExtractor/Extractor.pm#L23

gugod 23:08:11

雖然剛剛只做了一組規則 😛 不過… 歡迎拿去用。(或送 PR 來)

2

gugod 23:09:48

https://github.com/g0v/people-in-news/blob/master/etc/news-sites.txt

目標是把在這裡的列出的新聞來源都能對應完畢 (好多 orz)

GitHub

g0v/people-in-news

公眾人物新聞的追蹤. Contribute to g0v/people-in-news development by creating an account on GitHub.

gugod 08:43:58

https://github.com/perltaiwan/NewsExtractor/blob/master/lib/NewsExtractor/Extractor.pm#L23

GitHub

perltaiwan/NewsExtractor

Contribute to perltaiwan/NewsExtractor development by creating an account on GitHub.

pei 15:11:11

@stronghead.wu has joined the channel

pofeng 16:34:39

@pofeng has joined the channel

Can 17:12:48

@can has joined the channel

gugod 20:11:33

剛發現 chinatimes.com 的新聞網頁 DOM 中有 JSON-LD 資料可以利用

gugod 2019-12-11 20:12:19

長得像這樣

``` <script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"articleSection": "生活",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "https://www.chinatimes.com/realtimenews/20191211004429-260405"
},
"headline": "作曲、繪畫難不倒 AI也能玩藝術",
"image": {"@type": "ImageObject","url":"https://images.chinatimes.com/newsphoto/2019-12-11/900/20191211004430.jpg","height":369,"width":656},"author": {"@type": "Person","name": "王寶兒" }, "url":"https://www.chinatimes.com/realtimenews/20191211004429-260405",
"description":"王寶兒／新北報導紀錄片《AI新紀元》揭示，預計203...
....
</script>```

gugod 2019-12-11 20:12:38

而且 “description” 欄基本上就是新聞全文了….

gugod 2019-12-12 08:20:30

自由時報也有。但 description 不是全文，並且 “author.name” 不是記者名，而是「自由時報」四個字

view-source:https://news.ltn.com.tw/news/society/breakingnews/3006341

gugod 2019-12-12 08:21:50

嗯。看來就是 Slack 抓到的這一段字。（或許 Slack 就是抓 `script[type="application/ld+json"]` )

gugod 2019-12-12 08:22:21

喔，也有可能是 og:title + og:description

pm5 2019-12-22 08:37:47

ctwant.com 也有，果然是同一家的

pm5 2019-12-22 08:42:11

其實 Nooho 也有。從這裡抓 datePublished 可能是個好辦法

㊙️ 3

gugod 20:12:19

長得像這樣

``` <script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"articleSection": "生活",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "https://www.chinatimes.com/realtimenews/20191211004429-260405"
},
"headline": "作曲、繪畫難不倒 AI也能玩藝術",
"image": {"@type": "ImageObject","url":"https://images.chinatimes.com/newsphoto/2019-12-11/900/20191211004430.jpg","height":369,"width":656},"author": {"@type": "Person","name": "王寶兒" }, "url":"https://www.chinatimes.com/realtimenews/20191211004429-260405",
"description":"王寶兒／新北報導紀錄片《AI新紀元》揭示，預計203...
....
</script>```

gugod 20:12:38

而且 “description” 欄基本上就是新聞全文了….

😱 1

tmonk 20:35:40

@felixtypingmonkey has joined the channel

gugod 08:20:30

自由時報也有。但 description 不是全文，並且 “author.name” 不是記者名，而是「自由時報」四個字

view-source:https://news.ltn.com.tw/news/society/breakingnews/3006341

自由電子報

新北警浮報嘉獎弊案涉案2警訊後交保 - 社會 - 自由時報電子報

新北市海山警分局8名派出所基層員警，被檢舉自2017年起，涉嫌不實填寫「線上立破」案查獲過程，藉以浮報嘉獎，檢調10月底約談8名涉案員警到案，其中一名陳姓警員被依涉犯貪污罪收押禁見，其他7名警員分別以10到20萬元不等金額交保，新北地檢署追查發現另有2名警員涉案，昨再約談洪姓與劉姓離職警員，今晨訊後諭令洪、劉分別以10、5萬元交保。任職文聖派出所的陳員被收押後，檢方追查發現，陳員先前任職江翠派出所期間，另有洪、劉姓同事配合偽造「線上立破」文件，因此昨指揮新北市警局督察室、政風室搜索約江翠派出所及洪、劉住處，另以證人身分約談江翠派出所所長及4名員警。

gugod 08:21:50

嗯。看來就是 Slack 抓到的這一段字。（或許 Slack 就是抓 `script[type="application/ld+json"]` )

gugod 08:22:21

喔，也有可能是 og:title + og:description

gugod 16:14:11

https://docs.rsshub.app/traditional-media.html

好像可以跟這個專案交關一下 😛

docs.rsshub.app

传统媒体 | RSSHub

:cake: 万物皆可 RSS

chihao 2019-12-12 19:40:31

作者 github id 是 `DIYgod` 😆

2

chihao 19:26:19

萬物傑克 RSS

chihao 19:26:26

@gugod ++++

chihao 19:28:48

等一下 8:00 就是 disinfoRG community hangout 囉，歡迎來玩！
網址：https://meet.jit.si/disinfo
共筆：https://g0v.hackmd.io/BChGrPg-TkWsvks2ey5q6A

meet.jit.si

Jitsi Meet

Join a WebRTC video conference powered by the Jitsi Videobridge

g0v.hackmd.io

disinfoRG meeting notes - HackMD

chihao 19:40:31

作者 github id 是 `DIYgod` 😆

poga 20:07:09

https://islander.ailabs.tw/2019-12-12/19/

1

reason1130 23:59:47

@reason1130 has joined the channel

pm5 12:14:02

上次 @fockerlee 講到可以用 airtable 來釋出一些資料。因為 airtable API 確實蠻好用的，我想可以用釋出部份資料的方式來解決資料量的問題。這個「部份資料」的篩選可以想一下對資料分析比較有意義的方式，像是針對一個新聞事件，釋出事件後 3 天內所有網站的發佈文章資料，或是釋出所有內文有特定關鍵詞的發文資料

gugod 2019-12-13 12:36:46

或許可以試著算出與指定關鍵詞相關的其他關鍵詞

gugod 2019-12-13 12:37:52

作個詞與詞之間的 pagerank table 之類的

❤️ 3

1

gugod 12:36:46

或許可以試著算出與指定關鍵詞相關的其他關鍵詞

gugod 12:37:52

作個詞與詞之間的 pagerank table 之類的

gugod 18:54:15

來丟個問題：先不論事實成份多少與立場問題，各位覺得一篇新聞的「製作品質」，該怎麼來定義？ (thread)

gugod 2019-12-14 18:55:20

我在寫爬蟲 ( NewsExtractor / people-in-news ) 時一面也在想這個問題。但只想得一些隨意的評量標準。

gugod 2019-12-14 18:58:19

比方說：

1. 要有「標題」「日期時間」「記者名」「內文」這四欄位。
2. 內文與標題所闡述的主題應相符（否則就是「名不符實」，偏向 disinfomation 那一側的寫作法）

gugod 2019-12-14 18:59:41

3. 內文所闡述事件的時間不應與目前時間相距太遠。（舊聞）

gugod 2019-12-14 19:07:43

4. 內文與標題不應出現「網友」兩字

粗略觀察起來，大致上用了「網友」兩字的新聞內文都是帶一點虛假成分。或是「網友表示〇〇〇」為文中主要論點的支持論點，但去搜尋又找不到網友是在哪裡表示〇〇〇。變成是一種「虛的參考來源」。

gugod 2019-12-14 19:08:42

（不過是有點看狀況。如果「網友」本身是文章主題可能就不算）

tkirby 2019-12-14 20:06:29

感覺要參考這本書

tkirby 2019-12-14 20:06:53

tkirby 2019-12-14 20:13:11

裡面也有在探討不實訊息：

tkirby 2019-12-14 20:13:31

pm5 2019-12-15 08:57:25

限定在新聞的話，可能寫作格式？前 3 個段落的長度、句數、全文裡驚嘆號或某些特殊符號的數量

pm5 2019-12-15 08:58:15

某些用詞像是 https://oops.udn.com/oops/story/6698/2616762

pm5 2019-12-15 09:09:45

「引語」和「過渡語」的數量與分佈？ https://www.prlass.com/3737/%e7%b6%b2%e8%b7%af%e6%96%b0%e8%81%9e%e7%a8%bf%e5%af%ab%e4%bd%9c/

pm5 2019-12-15 09:14:24

我覺得製作品質是一個技術性的東西，與真實不真實、有沒有操控的意圖，不大有關係。可以有製作精良的假消息，也可以有透露事實的流言。另外製作品質本身也沒有好或不好的區別，只有適不適合，或者說適合在哪種消息管道裡流通的差別。

pm5 2019-12-15 09:25:06

但是到了「新聞」這種特別的傳播體系裡的內容製作品質，就有一些好與不好的判準了。

gugod 2019-12-15 10:31:31

> 我覺得製作品質是一個技術性的東西，與真實不真實、有沒有操控的意圖，不大有關係。
我個人的直覺與此稍有不同。「不精良」的東西，製做過程通常不耗時力。而「事實成分低少」的新聞文字，製做起來也比較不秏時力[*]。

做事實查核很費力，但如果有方法能很快速而大致上篩選、區別出「精質」與「粗糙」的新聞文字的話，或許一定程度上可以將此方法做為轉守為攻的工具。

[*]: 這一點是我個人假設

gugod 2019-12-15 10:33:00

一定程度上就像食品被要求以特定包裝、成份表，以及生產履歷那樣。

gugod 2019-12-15 12:18:10

https://medium.com/@tunaBR/%E5%88%A4%E5%AE%9A%E6%96%B0%E8%81%9E%E5%85%A7%E5%AE%B9%E5%93%81%E8%B3%AA%E7%9A%84%E5%8F%A6%E9%A1%9E%E6%96%B9%E6%B3%95-%E7%B6%B2%E9%A0%81%E7%B5%90%E6%A7%8B%E5%88%86%E6%9E%90-bfcf4e990df8

我的想法跟這篇文章裡提到的「新聞品質評分計畫」跟好像很類似

gugod 2019-12-15 12:20:01

「文章配圖」這點不錯：

> 假若一篇長篇文章配圖來源不明（可能是盜圖）或是花一點小錢買來的圖庫內容（很容易辨識），這篇文章多半有問題；

gugod 2019-12-15 12:21:03

至少能以圖片搜尋引擎去快速而簡單地檢查

tkirby 2019-12-15 12:58:33

大家覺得像這樣的新聞品質上有什麼問題呢? https://tw.appledaily.com/new/realtime/20131022/278921/

gugod 2019-12-15 22:36:52

除了「內文所述這件事情跟我基本上無關」、可能就是它很短，資訊量也很少。推測，大概也不需要多少時間與心力就可以被製作出來。

記者名有標示還可以。

再我對「今早」這個時間表示格式有點意見，我認為內文的時間也要盡可能精確。如果只是把微博發文拷貝下來，至少可以把原文的發文時間一起標出。

chihao 2019-12-18 23:30:52

「網友」有時候還會被進一步縮寫成「網」，真的很煩。

🙌 2

gugod 18:55:20

我在寫爬蟲 ( NewsExtractor / people-in-news ) 時一面也在想這個問題。但只想得一些隨意的評量標準。

gugod 18:58:19

比方說：

1. 要有「標題」「日期時間」「記者名」「內文」這四欄位。
2. 內文與標題所闡述的主題應相符（否則就是「名不符實」，偏向 disinfomation 那一側的寫作法）

❤️ 1

gugod 18:59:41

3. 內文所闡述事件的時間不應與目前時間相距太遠。（舊聞）

gugod 19:07:43

4. 內文與標題不應出現「網友」兩字

粗略觀察起來，大致上用了「網友」兩字的新聞內文都是帶一點虛假成分。或是「網友表示〇〇〇」為文中主要論點的支持論點，但去搜尋又找不到網友是在哪裡表示〇〇〇。變成是一種「虛的參考來源」。

🙌 1

gugod 19:08:42

（不過是有點看狀況。如果「網友」本身是文章主題可能就不算）

tkirby 20:06:29

感覺要參考這本書

tkirby 20:06:53

Image from iOS

1

tkirby 20:13:11

裡面也有在探討不實訊息：

tkirby 20:13:31

Image from iOS

1

pm5 08:57:25

限定在新聞的話，可能寫作格式？前 3 個段落的長度、句數、全文裡驚嘆號或某些特殊符號的數量

pm5 08:58:15

某些用詞像是 https://oops.udn.com/oops/story/6698/2616762

OOPS! 新鮮事

最令人討厭的新聞用語？網友直呼：後果「恐」不堪設想！ | OOPS! 新鮮事

說到台灣的新聞媒體，嗯，大家都知道的事情就不多說了...畢竟平常被罵沒營養、只會抄臉書和行車紀錄器也不是一天兩天的事情了，那最近就有批踢踢網友討論到，在新聞媒體常用的播報詞彙裡面，有那些是常聽到又讓人

1

pm5 09:09:45

「引語」和「過渡語」的數量與分佈？ https://www.prlass.com/3737/%e7%b6%b2%e8%b7%af%e6%96%b0%e8%81%9e%e7%a8%bf%e5%af%ab%e4%bd%9c/

PR Lass

網路新聞稿寫作大原則，不可忽略的新聞撰寫原則要點

只是倒金字塔寫法、5w1h還不足以完成一篇新聞稿，這篇分享更多的新聞稿撰寫方式和原則，從導言、首段倒末段的寫作建議...

pm5 09:14:24

我覺得製作品質是一個技術性的東西，與真實不真實、有沒有操控的意圖，不大有關係。可以有製作精良的假消息，也可以有透露事實的流言。另外製作品質本身也沒有好或不好的區別，只有適不適合，或者說適合在哪種消息管道裡流通的差別。

pm5 09:25:06

但是到了「新聞」這種特別的傳播體系裡的內容製作品質，就有一些好與不好的判準了。

pm5 09:56:27

@ayw255 how is it going? After some thoughts I believe it'd be better if we put parsers in the same git repo, at least for the current stage of the project.

pm5 2019-12-15 09:59:57

We can probably add a `ArticleParser` dir under project root for the parsers. They can have their own db settings and migrations there. We can also just rename `codes` dir to `NewsScrapper` to better reflect its functions, and move news db settings and migrations to some subdir.

pm5 2019-12-15 10:04:55

This "semi-gigantic repo" project structure would make it easier to deploy to middle2.

wenyi 2019-12-15 10:13:31

@pm5 parser is almost done now in branch ‘parser’ in codes/parser.py. waiting for @chihao for new db and/or elasticsearch for it to be fully functional.

wenyi 2019-12-15 10:13:51

Umm the code structure sounds fine to me

wenyi 2019-12-15 10:15:42

But, I wonder why parser needs a sub dir, why don’t we just put env settings in the same .env folder and maybe named them as, for example, ‘PARSER_DB_URL’

wenyi 2019-12-15 10:16:24

Since parser only has need 1 code? Unless we are looking to develop more codes for post-processing the parsed results (and decided to put them in the same repo)

pm5 2019-12-15 10:29:12

oh we can use `PARSER_DB_URL` and shared `.env` and Python dependencies. It's just that I am anticipating more post-processing code for FB, Twitter, maybe YouTube etc.

pm5 2019-12-15 10:39:42

Parser db schema proposal https://g0v.hackmd.io/lMQO37z6SbWNWo3R4-X_EA?view#Parser-database-schema

pm5 2019-12-15 10:41:11

...and a revised project structure https://github.com/disinfoRG/NewsScraping/tree/parserdb

chihao 2019-12-15 17:03:46

@ayw255 sorry I’m lagging behind on the middle2 front. Will pick that up!

pm5 2019-12-16 11:11:36

I think it will be some time before we have 2 mysql db? Guess we will have to put scrapper and parser data in the same place for now.

chihao 2019-12-16 11:12:33

I’m not sure if we can have 2 mysql dbs in the same middle2 project yet. (I should have started this process earlier. My apologies.) I could just start a new middle2 project and a new db in that project?

chihao 2019-12-16 11:15:01

OK starting one now

pm5 2019-12-16 11:17:37

That was fast.

pm5 2019-12-16 11:19:01

We could also do that. Then we need a new repo for parsers.

pm5 2019-12-16 11:19:51

Here we go.

chihao 2019-12-16 11:20:40

Done 🙂 new db for parsed content in new middle2 project `taoyuan-chu-975484`

chihao 2019-12-16 11:21:02

I’ve added @pm5 @ayw255 @fockerlee to the new middle2 project as well

chihao 2019-12-16 11:21:36

Turns out you can switch between mysql dbs in phpmyadmin across middle2 projects 🤔

chihao 2019-12-16 11:22:11

Again, I should have done this earlier. Sorry 😞

pm5 2019-12-16 11:29:45

Well, for a project like this to build a data pipeline, in the end it is better if we own the infrastructure to some extend. But we probably are not going to *really* need that in the first 6 months.

pm5 2019-12-16 11:33:20

Because a data pipeline consists of a set of data storage/files/db connected with many processing lines in between, each db might be accessed by more than one codebases with different purposes in mind. So it's usually better if a db is not *owned* by anyone of the codebases but is maintained individually.

pm5 2019-12-16 11:35:47

And so deploying that kind of projects is a bit more involved, usually employs Puppet or Ansible or some infrastructure-as-code tool.

pm5 2019-12-16 11:39:01

And when the data volume is getting really high or when the pipeline topology gets complicated, dbs would start to restrict their access with APIs, so that changes to their internal schemas don't propagate through the project so quickly that it breaks.

pm5 2019-12-16 11:42:05

But we are not at that stage yet. We can let the scrapper project owns a db, the parser project owns another. Either a gigantic repo or two separated projects would do, since middle2 projects can share dbs among projects.

pm5 09:59:57

We can probably add a `ArticleParser` dir under project root for the parsers. They can have their own db settings and migrations there. We can also just rename `codes` dir to `NewsScrapper` to better reflect its functions, and move news db settings and migrations to some subdir.

pm5 10:04:55

This "semi-gigantic repo" project structure would make it easier to deploy to middle2.

wenyi 10:13:31

@pm5 parser is almost done now in branch ‘parser’ in codes/parser.py. waiting for @chihao for new db and/or elasticsearch for it to be fully functional.

wenyi 10:13:51

Umm the code structure sounds fine to me

wenyi 10:15:42

But, I wonder why parser needs a sub dir, why don’t we just put env settings in the same .env folder and maybe named them as, for example, ‘PARSER_DB_URL’

wenyi 10:16:24

Since parser only has need 1 code? Unless we are looking to develop more codes for post-processing the parsed results (and decided to put them in the same repo)

gugod 10:18:25

https://newsapi.org/s/taiwan-news-api

突然發現這家收費 crawler api 有支援一些台灣新聞媒體

News API

Taiwan news API - Live top headlines from Taiwan

Get live top and breaking news headlines from Taiwan in JSON format.

1

pm5 10:29:12

oh we can use `PARSER_DB_URL` and shared `.env` and Python dependencies. It's just that I am anticipating more post-processing code for FB, Twitter, maybe YouTube etc.

🙌 2

1

gugod 10:31:31

> 我覺得製作品質是一個技術性的東西，與真實不真實、有沒有操控的意圖，不大有關係。
我個人的直覺與此稍有不同。「不精良」的東西，製做過程通常不耗時力。而「事實成分低少」的新聞文字，製做起來也比較不秏時力[*]。

做事實查核很費力，但如果有方法能很快速而大致上篩選、區別出「精質」與「粗糙」的新聞文字的話，或許一定程度上可以將此方法做為轉守為攻的工具。

[*]: 這一點是我個人假設

gugod 10:33:00

一定程度上就像食品被要求以特定包裝、成份表，以及生產履歷那樣。

pm5 10:39:42

Parser db schema proposal https://g0v.hackmd.io/lMQO37z6SbWNWo3R4-X_EA?view#Parser-database-schema

g0v.hackmd.io

零時檔案局技術文件 0archive Technical Spec - HackMD

🙌 2

pm5 10:41:11

...and a revised project structure https://github.com/disinfoRG/NewsScraping/tree/parserdb

gugod 12:18:10

https://medium.com/@tunaBR/%E5%88%A4%E5%AE%9A%E6%96%B0%E8%81%9E%E5%85%A7%E5%AE%B9%E5%93%81%E8%B3%AA%E7%9A%84%E5%8F%A6%E9%A1%9E%E6%96%B9%E6%B3%95-%E7%B6%B2%E9%A0%81%E7%B5%90%E6%A7%8B%E5%88%86%E6%9E%90-bfcf4e990df8

我的想法跟這篇文章裡提到的「新聞品質評分計畫」跟好像很類似

Medium

判定新聞內容品質的另類方法：網頁結構分析／Frederic Filloux

本文是筆者一系列內容品質評分系統研究的一環。只要簡單查看任何HTML頁面的組成要素，就能得到許多內容可靠性的線索。問題在於各大內容傳播平台，並沒有在這些線索上花過心思。

gugod 12:20:01

「文章配圖」這點不錯：

> 假若一篇長篇文章配圖來源不明（可能是盜圖）或是花一點小錢買來的圖庫內容（很容易辨識），這篇文章多半有問題；

gugod 12:21:03

至少能以圖片搜尋引擎去快速而簡單地檢查

tkirby 12:58:33

大家覺得像這樣的新聞品質上有什麼問題呢? https://tw.appledaily.com/new/realtime/20131022/278921/

蘋果新聞網

歐陽妮妮穿外套　意外發現200元｜蘋果新聞網｜蘋果日報

【蔡維歆╱台北報導】歐陽妮妮今一早起床，拿起一件很久沒穿的外套，結果竟意外發現外套口袋裡有200塊，讓她又驚又喜，於是在微博發了一篇文。她笑...

🤔 2

chihao 17:03:46

@ayw255 sorry I’m lagging behind on the middle2 front. Will pick that up!

gugod 22:36:52

除了「內文所述這件事情跟我基本上無關」、可能就是它很短，資訊量也很少。推測，大概也不需要多少時間與心力就可以被製作出來。

記者名有標示還可以。

再我對「今早」這個時間表示格式有點意見，我認為內文的時間也要盡可能精確。如果只是把微博發文拷貝下來，至少可以把原文的發文時間一起標出。

pm5 11:11:36

I think it will be some time before we have 2 mysql db? Guess we will have to put scrapper and parser data in the same place for now.

chihao 11:12:33

I’m not sure if we can have 2 mysql dbs in the same middle2 project yet. (I should have started this process earlier. My apologies.) I could just start a new middle2 project and a new db in that project?

chihao 11:15:01

OK starting one now

pm5 11:17:37

That was fast.

pm5 11:19:01

We could also do that. Then we need a new repo for parsers.

pm5 11:19:51

Here we go.

chihao 11:20:40

Done 🙂 new db for parsed content in new middle2 project `taoyuan-chu-975484`

chihao 11:21:02

I’ve added @pm5 @ayw255 @fockerlee to the new middle2 project as well

chihao 11:21:36

Turns out you can switch between mysql dbs in phpmyadmin across middle2 projects 🤔

chihao 11:22:11

Again, I should have done this earlier. Sorry 😞

pm5 11:29:45

Well, for a project like this to build a data pipeline, in the end it is better if we own the infrastructure to some extend. But we probably are not going to *really* need that in the first 6 months.

👍 1

pm5 11:33:20

Because a data pipeline consists of a set of data storage/files/db connected with many processing lines in between, each db might be accessed by more than one codebases with different purposes in mind. So it's usually better if a db is not *owned* by anyone of the codebases but is maintained individually.

1

pm5 11:35:47

And so deploying that kind of projects is a bit more involved, usually employs Puppet or Ansible or some infrastructure-as-code tool.

pm5 11:39:01

And when the data volume is getting really high or when the pipeline topology gets complicated, dbs would start to restrict their access with APIs, so that changes to their internal schemas don't propagate through the project so quickly that it breaks.

pm5 11:42:05

But we are not at that stage yet. We can let the scrapper project owns a db, the parser project owns another. Either a gigantic repo or two separated projects would do, since middle2 projects can share dbs among projects.

👍 1

Ping Lin 18:14:57

@ping.lin has joined the channel

cookie_dog 17:21:01

@iverylikedog has joined the channel

chihao 23:29:39

Just finished tonight’s live fact-checking session for the presidential policy presentations.

chihao 23:30:03

Note for the data set: remember to decide on the license.

chihao 23:30:52

「網友」有時候還會被進一步縮寫成「網」，真的很煩。

Chen 01:28:20

@iping.ch has joined the channel

wenyi 06:12:38

@chihao @pm5 @fockerlee 大松投影片第一版
1. [英文] https://docs.google.com/presentation/d/10dlv6ouTGGKbJSR6xWAendaCP_D3UFRK2fowz5F_V0o/edit?usp=sharing
2. [中文] https://docs.google.com/presentation/d/1G8LN9jXWYjEQc95XAXkIE4ynXow65ahxwK5GtvL6Gjg/edit?usp=sharing

wenyi 2019-12-21 01:41:15

@chihao @pm5 @fockerlee any comments or things to add? Should we use the chinese or english version?

chihao 2019-12-21 01:43:42

@ayw255 it’s been a hectic day here in Tainan. Hold on let me see!

chihao 2019-12-21 01:48:16

@ayw255 ++ I think it looks really good :) I’ll update some text tomorrow. I’m not sure if we should use Mandarin or English version... slides aside, would you like to speak in Mandarin or English for the 3-minute pitch?

chihao 2019-12-21 01:48:40

Gotta get some sleep now... I’ll come back to this early morning :)

wenyi 2019-12-21 01:48:52

sure sure! night night

wenyi 2019-12-21 06:01:20

I can do the pitch in English, and also if there are ~20 audiences who don’t understand mandarin I think it’s better to use English?

wenyi 2019-12-21 09:12:45

Anyways, I’m on the dev meeting google hangout

pm5 2019-12-21 09:33:25

one moment

wenyi 2019-12-21 09:34:19

Np! I’ll log in ~ 10 am

pm5 2019-12-21 09:36:24

Let's pitch in English, since we registered as one of the FtO projects.

pm5 2019-12-21 09:56:25

Filled in the links to the datasets and minihackathon schedule.

1 ❤️ 1

Relax 19:17:15

@healthya2776 has joined the channel

wenyi 01:41:15

@chihao @pm5 @fockerlee any comments or things to add? Should we use the chinese or english version?

chihao 01:43:42

@ayw255 it’s been a hectic day here in Tainan. Hold on let me see!

chihao 01:48:16

@ayw255 ++ I think it looks really good :) I’ll update some text tomorrow. I’m not sure if we should use Mandarin or English version... slides aside, would you like to speak in Mandarin or English for the 3-minute pitch?

chihao 01:48:40

Gotta get some sleep now... I’ll come back to this early morning :)

wenyi 01:48:52

sure sure! night night

chewei 哲瑋 03:43:30

@chewei has joined the channel

wenyi 06:01:20

I can do the pitch in English, and also if there are ~20 audiences who don’t understand mandarin I think it’s better to use English?

paulpengtw 08:53:21

@paulpengtw has joined the channel

wenyi 09:12:45

Anyways, I’m on the dev meeting google hangout

pm5 09:33:25

one moment

wenyi 09:34:19

Np! I’ll log in ~ 10 am

1

pm5 09:36:24

Let's pitch in English, since we registered as one of the FtO projects.

pm5 09:56:25

Filled in the links to the datasets and minihackathon schedule.

1

chihao 13:11:23

@ayw255 ++ for the pitch!

hkazami 13:15:33

@mail.hkazami has joined the channel

ronnywang 13:47:08

昨天跟 Bruce 提到我要開源我這邊爬檔案廣告庫的程式，https://github.com/ronnywang/fb-ad-lib

GitHub

ronnywang/fb-ad-lib

Contribute to ronnywang/fb-ad-lib development by creating an account on GitHub.

2

ronnywang 13:47:50

主要是用 nodejs 的 casperjs 抓 HTML 下來，然後用 PHP parse

pm5 13:56:38

內容農場 dataset in CSV https://drive.google.com/open?id=1oVA-4W3HPbtnVoudlwfuLCtCYoiUSp80

❤️ 1

pm5 13:57:14

新聞內容 dataset in CSV https://drive.google.com/open?id=1efDrL8Q3guQ6yjT6E1bdndBJqZ9P6YYs

fly 15:04:20

looked cool https://github.com/festum/anti-bias-media/blob/master/README-zh_TW.md

GitHub

festum/anti-bias-media

A block list for ublock and other adblocker / 純公益阻擋共媒跟內容農場 - festum/anti-bias-media

gugod 2019-12-21 15:12:56

既然有這個名單…. 建議可以一併做成可餵給 hohser 的格式。

https://github.com/pistom/hohser/

可自所有搜尋引擎結果頁中將指定網域的網頁移除

gugod 2019-12-21 15:13:20

(hohser 也是個 browser ext.)

gugod 2019-12-21 15:17:42

https://gist.githubusercontent.com/gugod/3091ef98a6628ef842d488a435731bd2/raw/55c9bb67d7c0826730388ede236d15fd2468c32a/hohser-domains.json

範例。

fly 2019-12-25 13:53:44

酷，原來可以在 search engine 擋下

gugod 15:12:56

既然有這個名單…. 建議可以一併做成可餵給 hohser 的格式。

https://github.com/pistom/hohser/

可自所有搜尋引擎結果頁中將指定網域的網頁移除

GitHub

pistom/hohser

Highlight or Hide Search Engine Results. Contribute to pistom/hohser development by creating an account on GitHub.

1

gugod 15:13:20

(hohser 也是個 browser ext.)

gugod 15:17:42

https://gist.githubusercontent.com/gugod/3091ef98a6628ef842d488a435731bd2/raw/55c9bb67d7c0826730388ede236d15fd2468c32a/hohser-domains.json

範例。

pm5 15:58:25

我們沒有爬上報呀？

pm5 2019-12-21 16:00:51

看起來是同一篇文章在怒吼 https://nooho.net/2019/11/DPPfraud27471/

pm5 2019-12-21 16:00:58

在上報 https://www.upmedia.mg/news_info.php?SerialNo=75396

pm5 2019-12-21 16:01:23

應該可以用 text diff 連起來

pm5 2019-12-21 16:02:11

喔... 怒吼是轉 Yahoo，Yahoo 是轉上報

chihao 2019-12-21 16:07:13

哎唷喂呀

pm5 16:00:51

看起來是同一篇文章在怒吼 https://nooho.net/2019/11/DPPfraud27471/

nooho.net

【她列不分區第一】蘇貞昌力挺愛女閨蜜洪簡廷卉遭網起底爆卦3點直指她不適任

民進黨不分區立委名單難產13日

pm5 16:00:58

在上報 https://www.upmedia.mg/news_info.php?SerialNo=75396

upmedia.mg

【她列不分區第一】蘇揆力挺愛女閨蜜　洪簡廷卉遭網起底爆3卦--上報

民進黨不分區立委名單難產13日擠出草案。值得注意的是，由行政院長蘇貞昌力推，位列第一的原住民族代表洪簡......

pm5 16:01:23

應該可以用 text diff 連起來

pm5 16:02:11

喔... 怒吼是轉 Yahoo，Yahoo 是轉上報

chihao 16:07:13

哎唷喂呀

pm5 07:35:17

我們的中時電子報可能沒有抓完整。從政治版首頁看起來估計一天有個 160~180 篇，但 scrapper db 裡只抓到 4206 篇

pm5 08:37:47

ctwant.com 也有，果然是同一家的

pm5 08:42:11

其實 Nooho 也有。從這裡抓 datePublished 可能是個好辦法

pm5 09:22:07

qiqi.today 大概十一月底開始沒有抓到

pm5 2019-12-22 09:24:24

我們抓到的最後一篇是 http://www.qiqi.today/show/882067

pm5 2019-12-22 09:24:36

現在差不多到 http://www.qiqi.today/show/888361

😮 2

pm5 09:24:24

我們抓到的最後一篇是 http://www.qiqi.today/show/882067

琦琦看新聞

港理大「包圍戰」第六日：飯堂傳出惡臭，仍留校內的「頑固派」僅50人

香港警方圍捕香港理工大學瀑力分子的行動踏入第六天。據香港《星島日報》報道，一批瀑徒仍留守在香港理工大學校園與警方對峙，目前氣氛平靜。從今天(22日)凌晨至清早，先後再有約14人自願離開，估計留於校內的人數約有50人。報道稱，選擇在凌晨時分一同離開的其中6人，手牽手由正門走出，在防瀑警察監視下，走入對...

pm5 09:24:36

現在差不多到 http://www.qiqi.today/show/888361

琦琦看新聞

一旦蘋果手機退出中國，會有什麼後果？郭台銘回答點名

蘋果公司生產的iPhone 系列手機，可以說是開拓多了智慧型手機的市場，開啟了3G時代，前幾年喬布斯的逝世也讓國內熱議了許久，蘋果可以說是喬布斯一手帶大的孩子，這位蘋果之父離開後，也有人說蘋果已經不是蘋果了，但在庫克的帶領下雖然沒有喬布斯時代的突破，還是中規中矩，在中國依舊占就著主要市場。有調皮的網...

chihao 19:29:02

@pm5 @ayw255 @fockerlee dev meeting in 30 mins 🙂

gugod 21:23:46

https://zht.globalvoices.org/2019/11/29/30049/ // seems revelant

Global Voices 繁體中文

推特揭發中國對香港反送中運動大打資訊戰 · Global Voices 繁體中文

推特在聲明中表示，由20萬推特帳號共同組成的龐大網軍正試圖削弱香港運動的合法性。

chihao 2019-12-23 22:04:29

🤔 1

chihao 22:03:16

We just finished this week’s dev meeting 🙂

👍 1

chihao 22:04:29

Readr

Twitter 大戰中國網軍——解密被刪帳號資料集 - 讀+ Readr

8 月 19 日，Twitter 公佈了刪除疑似中國網軍帳號的震撼消息。Twitter 將 936 個「源自中國」的帳號停權，並表示這些帳號近期發表大量與香港「反送中」運動有關的假訊息。READr 分析 Twitter 官方釋出資料的幾個發現

pm5 22:11:26

@fockerlee 平常要方便看網頁裡的 JSON-LD 詮釋資料的話，也可以裝這個 addon http://osds.openlinksw.com/

chihao 09:28:49

\@mail.hkazami/ 0archive airtable 目前僅開放給確定加入開發的開發者編輯 🙇 不過有 read-only 版本 https://airtable.com/shrKvjXMO7GaUg1vd 😄

Airtable

零時檔案局 0archive - Airtable

Explore the "零時檔案局 0archive" base on Airtable.

hkazami 2019-12-24 15:37:28

沒事沒事，我只是看到它問我要不要request access就先按下去再說ＸＤ

chihao 2019-12-24 15:37:42

好 😆

👍 1

hkazami 15:37:28

沒事沒事，我只是看到它問我要不要request access就先按下去再說ＸＤ

chihao 15:37:42

好 😆

Victoria Welborn 17:46:53

@vwelborn has joined the channel

chihao 17:47:18

Hi @vwelborn 🙂 welcome to #disinfo. This is where most discussions about 0archive happends 🙂

Victoria Welborn 2019-12-27 09:01:29

Thank you @chihao and everyone else!!

I wanted to ask the group two things:

1. I am hoping to spend some time with one of you for an interview on g0v and more specifically information on 0archive (as an example of g0v's work) https://g0v.hackmd.io/@kVqWpZq8R12VCa4i1N68UA/SJiTyGyJ8/edit. I will be in town the 6-12, and wanted to see if one person might be available and willing for an on camera interview for the video (who is comfortable speaking primarily in English)? I hope it will only be about an hour of your time (we are also happy to conceal identity if folks would prefer).
2. I wanted to see how this group would feel about a camera attending the mini-hackathon on the 4th to take some b-roll for the video.
Happy to provide more information, thank you all for your time and help :)) We are SO excited to tell the amazing story of the amazing work you all are doing!

pm5 2019-12-27 20:27:27

Hi Victoria! There are four of us (@chihao @ayw255 @fockerlee and me) working on 0archive now, with help from others in this channel. I for one would be happy to talk about the project and g0v. As for filming the mini-hackathon, personally I am fine with that, but I imagine we should see if most of the people coming to the hackathon are okay to be filmed, since we haven't mentioned that in registration?

chihao 2019-12-28 10:27:38

@pm5++ I could join as well 🙌 also +1 on asking participants before filming during hackathon.

chihao 2019-12-28 10:29:08

@vwelborn do you mind registering to the hackathon on 1/4? This way I can send you the location just like other participants 🙂

chihao 2019-12-28 10:29:23

Registration is here → https://forms.gle/kqffyonCYWTdeUgk8

Victoria Welborn 2019-12-28 11:14:37

Of course! The only thing is that I will unfortunately be arriving on the 5th so the camera crew (which will only be two other people) will attend without me on the 4th, but I will still register! How would you suggest seeing if people going to the hackathon would mind be filmed?

That would be great about the two of you being filmed and interviewed! Will there also be another hackathon on the 8th? If so, we could hold your interviews there (before, during, or after) at the event.

chihao 2019-12-28 11:20:42

@vwelborn
(1) Either you register or the camera crew register. Either way, might be best to link me, pm5, you, and the camera crew together in the same email thread.

(2) Ask the participants on site. If they refuse, don’t film them.

(3) Events on the 8th are hosted by other projects. I personally am not sure if I’m attending yet. I usually decide on that night if I’m swinging by and bringing beer 😛

Victoria Welborn 2019-12-28 11:39:07

Perfect - that all sounds great! I will register them and link you all in the same email thread.

As for your interview and @pm5's interview - would either of you be open to perhaps us meeting up separately from the hackathon to do the interview? Maybe just let me know a few dates/times that work (and a preferred location - I can find one if you prefer).

Victoria Welborn 2019-12-28 11:41:02

And just to clarify - on the 8th the only events that are happening are these two?: (I'm still working on this list - https://g0v.hackmd.io/HaIQVQpLQ8i-ixUDyXtBCQ)

• 1/8 (Wed) vTaiwan mini-hackathon
• Location: Social Innovation Experiment Center, Room A9, 99, Section 3, Ren-ai Road, Taipei
• 1/8 (Wed) disfactory mini-hackathon
• Location: Citizen of the Earth (Taipei Office), 9F 28 Beiping East Road, Taipei
Thank you!

chihao 2019-12-28 12:03:38

Wow this seems like a lot of work! Are there alternative ways to do the interview on 1/4?

Victoria Welborn 2019-12-30 12:03:20

Unfortunately I won't be in Taiwan on the fourth and am the only person conducting the interviews, I'm sorry! However, I'm happy to come to a location that is convenient for you during the 5-12!

Victoria Welborn 2019-12-30 12:06:57

Otherwise, maybe I can just interview one of you, if that would be less stress :)

chihao 2019-12-31 00:10:31

Not sure what’s a good solution here, but I think missing the 1/4 event is indeed unfortunate 😩 I’m a bit worried about the camera crew on 1/4. Would they be self-guided? 🤔

Victoria Welborn 2019-12-31 01:49:31

It would just be one or two people with one camera taking some broll video, which they know how to do in a non intrusive way, but if it's a problem we can just get broll footage on the 8th instead!

Victoria Welborn 2019-12-31 01:51:24

What do you think about me interviewing you and/or pm5 on a separate day and location?

Victoria Welborn 2020-01-02 02:34:11

Hi @chihao and @pm5 happy new year! I spoke to the production company and Luke - my POC there - said he’d happily interview you both on the 4th at the hackathon, without me (if that still works for you both)! I will introduce you both to him via email shortly. Thank you for your patience and understanding!

chihao 2020-01-03 15:25:47

📧 sent 😉

🙂 1

fly 09:46:40

https://github.com/danny0838/content-farm-terminator 有沒有一個類似這個又能讓群眾共識決擋下農場網站的服務呢，這個用 G form 中心化管理，我覺得這樣管理員太辛苦

GitHub

danny0838/content-farm-terminator

「終結內容農場」瀏覽器套件 / Content Farm Terminator browser extension - danny0838/content-farm-terminator

chihao 2019-12-25 13:29:47

那就要來設計共識決的機制囉 😁

fly 2019-12-25 14:01:43

發現若只是標示農場，那麼以顏色濃度來表示回報是農場的人數，來提示農場可能性等級，或許就沒有這麼「絕對」

👍 1

chihao 10:24:03

@pm5 @ayw255 @fockerlee disinf0thon 報名表做好了，你們覺得呢？https://docs.google.com/forms/d/e/1FAIpQLSeo4vsBzAXGUPYW2_grAuf_Gr9hCe3eIcaaL3tDP_asgVnJ9g/viewform

pm5 2019-12-25 10:44:47

加一個「所填寫的資料僅做為 g0v disinfoRG 舉辦不實訊息松使用」？

chihao 2019-12-25 10:49:18

\已加/

wenyi 2019-12-25 11:10:04

Looks good!! 小提議 “exact location will be emailed to you once registration info is verified” 改成 “We’d email you the exact location once the registration is confirmed.”

chihao 2019-12-25 11:11:10

我是想避免 `We` 😆

chihao 2019-12-25 11:11:30

不知道還有沒有需要加別的欄位 🤔

pm5 2019-12-25 11:17:44

「你可以做什麼」？

chihao 2019-12-25 11:18:39

專長之類的嗎？

pm5 2019-12-25 11:30:23

yes

chihao 2019-12-25 11:32:19

然後 @bil 就搶先報名了 😆

chihao 2019-12-25 11:32:36

已加：「你的技能」

bil 2019-12-25 11:36:35

可惡@mrorz 私訊我「這裡有一個松」我就愉快的報名了XDDDD
鄰鄰技能：會走路的中文網路謠言資料庫

chihao 2019-12-25 11:37:09

好我幫你加 😆 @mrorz 不要逃避（？）

chihao 2019-12-25 11:38:42

揀日不如撞日，人算不如社群算，那就開放報名囉 😆 等一下來寫文案

isabelhou 2019-12-26 08:52:21

XD

🙌 1 ❤️ 1

chihao 10:36:02

另外 0archive 的 HackMD book https://g0v.hackmd.io/@chihao/0archive/ 😄

HackMD

0archive 零時檔案局 :closed_book: - HackMD

# 0archive 零時檔案局 :closed_book: ## 想跳坑嗎 - [跳坑指南](<https://g0v.hackmd.io/cdctnMJWQpKWQYhSxB8sCw>) - [Roadmap](/L1

👍 1

pm5 10:44:47

加一個「所填寫的資料僅做為 g0v disinfoRG 舉辦不實訊息松使用」？

chihao 10:49:18

\已加/

pm5 11:07:43

第一份資料集釋出了，可以來重新盤點一下期待與 roadmap？我先丟了一些想法在這裡 https://g0v.hackmd.io/@chihao/0archive/%2FL1l9m6joRhCWhmGFZYAO5A

HackMD

0archive 零時檔案局 :closed_book: - HackMD

# 0archive 零時檔案局 :closed_book: ## 想跳坑嗎 - [跳坑指南](<https://g0v.hackmd.io/cdctnMJWQpKWQYhSxB8sCw>) - [Roadmap](/L1

chihao 2019-12-26 20:41:54

@mail.hkazami @tkirby 來 tag 有說想跳坑的 Steven 和 kirby 😛

hkazami 2019-12-26 23:31:12

拜讀中ＸＤ

🙌 4

wenyi 11:10:04

Looks good!! 小提議 “exact location will be emailed to you once registration info is verified” 改成 “We’d email you the exact location once the registration is confirmed.”

🙌 1

chihao 11:11:10

我是想避免 `We` 😆

chihao 11:11:30

不知道還有沒有需要加別的欄位 🤔

pm5 11:17:44

「你可以做什麼」？

chihao 11:18:39

專長之類的嗎？

pm5 11:30:23

yes

chihao 11:32:19

然後 @bil 就搶先報名了 😆

chihao 11:32:36

已加：「你的技能」

bil 11:36:35

可惡@mrorz 私訊我「這裡有一個松」我就愉快的報名了XDDDD
鄰鄰技能：會走路的中文網路謠言資料庫

chihao 11:37:09

好我幫你加 😆 @mrorz 不要逃避（？）

chihao 11:38:42

揀日不如撞日，人算不如社群算，那就開放報名囉 😆 等一下來寫文案

gugod 13:04:32

[知識好好玩] 【哲學S03E06】民調都是假的啦！？　談歸納論證幾個需注意的推論 #
https://podplayer.net/?id=89999317 via @PodcastAddict

----

這節目裡講到幾種邏輯概念，似乎可以用來幫助定義新聞品質

Podplayer.net

{{ title }}

{{ description }}

chihao 13:29:47

那就要來設計共識決的機制囉 😁

fly 13:53:44

酷，原來可以在 search engine 擋下

fly 14:01:43

發現若只是標示農場，那麼以顏色濃度來表示回報是農場的人數，來提示農場可能性等級，或許就沒有這麼「絕對」

fly 14:03:51

目前的標示方式，只有是非，新生出來的 domain 無法馬上反應顯得較沒效率

image.png

fly 14:06:10

https://github.com/danny0838/content-farm-terminator/issues/34

GitHub

建議建立去中心化地回報列表及標識 · Issue #34 · danny0838/content-farm-terminator

有沒有一個類似這個又能讓群眾共識決擋下農場網站的服務呢，這個用 G form 中心化管理，我覺得這樣管理員太辛苦若只是標示農場，那麼以顏色濃度來表示回報是農場的人數，來提示農場等級，或許就沒有這麼「絕對」目前的標示方式，只有是非，新生出來的 domain 無法馬上反應顯得較沒效率

gugod 14:19:50

我印象中這個 browser extention 可以直接從指定網址抓網址名單下來。是個有點類似 adblock 的做法。

fly 2019-12-25 15:43:07

目前有 g0v 版農場名單 API 嗎？類似 https://danny0838.github.io/content-farm-terminator/files/blocklist/content-farms.txt

👍 1

fly 15:42:24

坑主回好快
> 只要有人願意架伺服器寫程式，把結果輸出成網路黑名單放在特定網址即可，不過建置和維護都有成本，做這個也不容易營利，就看有沒有人願意做了……

發現若只是標示農場，那麼以顏色濃度來表示回報是農場的人數，來提示農場可能性等級，或許就沒有這麼「絕對」

fly 15:43:07

目前有 g0v 版農場名單 API 嗎？類似 https://danny0838.github.io/content-farm-terminator/files/blocklist/content-farms.txt

chihao 16:14:13

不實訊息到底長什麼樣子呢？🤔 下週六 1/4 是 disinf0thon 第零次不實訊息松，一起來研究吧。
報名由此去 → https://forms.gle/kqffyonCYWTdeUgk8

Google Docs

disinf0thon 第零次不實訊息松

不實訊息到底長什麼樣子呢？一起來研究吧。 :alarm_clock: 時間：2020/1/4 10:30-17:30 :round_pushpin: 地點：台北市（報名確認後以 email 通知） :wave: 主辦單位：g0v disinfoRG 不實資訊研究組、0archive 《零時檔案局》團隊 :alarm_clock: Time: 2020/1/4 10:30-17:30 :round_pushpin: Location: Taipei City (exact location will be emailed to you once registration info is verified) :wave: Host: g0v disinfoRG & 0archive 你所填寫的資料僅做為 g0v disinfoRG 舉辦不實訊息松使用。Info you submitted is only used by g0v disinfoRG for the purpose of hosting disinfo hackathons.

❤️ 2

wenyi 06:49:25

@pm5 kknews & 壹讀在middle2上面跑被擋了（週一dev meetings時提到用selenium但抓不到東西的兩個網站），看起來是ip的問題，試著幾個proxy 但拿回來空白的html，有什麼可能的解法嗎？@ronnywang

Untitled

ronnywang 2019-12-26 09:55:49

他們前面有用 cloudflare ，然後被 cloudflare 擋的？

ronnywang 2019-12-26 09:58:01

kknews 好像是擋了 aws 和 linode IP ?

ronnywang 2019-12-26 10:00:39

ronnywang 2019-12-26 10:01:22

看看要不要在 hicloud 租一個最小機器架 proxy 當跳板 XD hicloud 應該不至於被擋

ronnywang 2019-12-26 10:10:43

hicloud 最小台的每月 360 NTD

ronnywang 2019-12-26 10:12:46

如果連 hicloud 也被擋的話，最保險就是花錢拉一條實體中華電信 ADSL ，然後用個 raspberry pi 來當 proxy ，只要 ip 被擋就自動重新撥接換 IP

ronnywang 2019-12-26 11:13:29

想說開個 hicloud 來試試看好了，結果 hicloud 的機器開通竟然是人工作業….

ronnywang 2019-12-26 11:14:20

等了半小時還沒開好…然後閒閒沒事看看他授權條款，結果發現裡面寫到「乙方使用本服務以秒、時、日、月計費者之最短租期為為連續租用一個月」，所以不能用個一小時就關掉…

chihao 2019-12-26 11:40:30

ronny 各種 solution 參上 😆

ronnywang 2019-12-26 11:53:28

hicloud 開主機一個小時還沒開好…..

hcchien 2019-12-26 12:03:50

“擋了 aws 和 linode IP ?” 那 gcs 呢

ronnywang 2019-12-26 12:04:11

gcs 我還沒試，想說直接試個最不可能被擋的 XD

ronnywang 2019-12-26 12:04:46

有人手上有 gcs 的 server 可以直接試的嗎？

ronnywang 2019-12-26 12:06:30

在上面
```curl https://kknews.cc/story/ov3pljp.html```
就可以了

chihao 2019-12-26 12:08:32

得到 302

chihao 2019-12-26 12:08:40

→ https://kknews.cc/challenge.html?ctx=MTRmZGFiMWY0NzAyNDAxNDg3ZjQzZjE1MWM4YjY3YjdkYTg5N2M0ZVsxNTc3MzMzMjM5Ljc3MTA3OCwiMzUuMjI5LjE1MC4xNjEiLCJcL3N0b3J5XC9vdjNwbGpwLmh0bWwiXQ--

chihao 2019-12-26 12:09:10

然後就會拿到 `正在驗證您的訪問請求` `系統偵測到您的電腦網路發出了異常流量，您需經驗證才能訪問我們的網站。`

chihao 2019-12-26 12:09:21

我手上的 GCP vm

pm5 2019-12-26 12:47:13

DigitalOcean 也被擋

ronnywang 2019-12-26 12:55:07

我放棄 HiCloud 了….我來貢獻我家的光世代動態 IP 好了

chihao 2019-12-26 12:55:28

HiCloud → ByeCloud

pm5 2019-12-26 12:58:00

噗

ronnywang 2019-12-26 14:13:01

終於開好 hicloud 了 XD

ronnywang 2019-12-26 14:13:08

hicloud 沒有被擋

ronnywang 2019-12-26 16:14:32

ok, 我這邊可以提供兩個跳板支援
1. HiCloud
2. 我家的光世代 + raspberry pi
前者成本大概 300 NTD / month ，速度比較快一些（一定比 raspberry pi 快），但缺點是換 IP 不方便，假如 IP 被擋需要手動去網頁界面換 IP 或換新機器

後者不需要額外成本（可能有極少的電費 XD），IP 被擋只要重連 PPPoE 就好，這部份可以做到 API 化，缺點是 raspberry pi 本身速度慢，還有一個是用了我本人的 Hinet 帳號，這點不確定會不會有什麼問題

chihao 2019-12-26 16:15:05

我正想推坑你就被搶先跳了！？😆

pm5 2019-12-26 16:57:15

@ronnywang 用 raspberry pi 的話，你最近有時間弄嗎？大概多久可以架好？

pm5 2019-12-26 18:59:56

我是想說 raspberry pi 可以包好容易安裝的話，就可以多找幾個人裝起來 round robin

chihao 2019-12-26 20:42:39

真･分散式

chihao 2019-12-26 20:44:22

把後段架構加入 roadmap \o/

isabelhou 08:52:21

XD

ronnywang 09:55:49

他們前面有用 cloudflare ，然後被 cloudflare 擋的？

ronnywang 09:58:01

kknews 好像是擋了 aws 和 linode IP ?

ronnywang 10:00:39

image.png

ronnywang 10:01:22

看看要不要在 hicloud 租一個最小機器架 proxy 當跳板 XD hicloud 應該不至於被擋

2

ronnywang 10:10:43

hicloud 最小台的每月 360 NTD

ronnywang 10:12:46

如果連 hicloud 也被擋的話，最保險就是花錢拉一條實體中華電信 ADSL ，然後用個 raspberry pi 來當 proxy ，只要 ip 被擋就自動重新撥接換 IP

1

ronnywang 11:13:29

想說開個 hicloud 來試試看好了，結果 hicloud 的機器開通竟然是人工作業….

ronnywang 11:14:20

等了半小時還沒開好…然後閒閒沒事看看他授權條款，結果發現裡面寫到「乙方使用本服務以秒、時、日、月計費者之最短租期為為連續租用一個月」，所以不能用個一小時就關掉…

⌛ 1 😮 3

isabelhou 11:28:39

disinfo小松如果要用NPO HUB場地，我可以當key holder.

🙌 1

chihao 11:30:31

@isabelhou ++ 我們目前暫時訂了一個秘密的場地（？）

🖖 1

isabelhou 11:30:44

好喔

chihao 11:40:30

ronny 各種 solution 參上 😆

😆 1

ronnywang 11:53:28

hicloud 開主機一個小時還沒開好…..

hcchien 12:03:50

“擋了 aws 和 linode IP ?” 那 gcs 呢

ronnywang 12:04:11

gcs 我還沒試，想說直接試個最不可能被擋的 XD

ronnywang 12:04:46

有人手上有 gcs 的 server 可以直接試的嗎？

ronnywang 12:06:30

在上面
```curl https://kknews.cc/story/ov3pljp.html```
就可以了

chihao 12:08:32

得到 302

chihao 12:08:40

→ https://kknews.cc/challenge.html?ctx=MTRmZGFiMWY0NzAyNDAxNDg3ZjQzZjE1MWM4YjY3YjdkYTg5N2M0ZVsxNTc3MzMzMjM5Ljc3MTA3OCwiMzUuMjI5LjE1MC4xNjEiLCJcL3N0b3J5XC9vdjNwbGpwLmh0bWwiXQ--

chihao 12:09:10

然後就會拿到 `正在驗證您的訪問請求` `系統偵測到您的電腦網路發出了異常流量，您需經驗證才能訪問我們的網站。`

chihao 12:09:21

我手上的 GCP vm

pm5 12:47:13

DigitalOcean 也被擋

ronnywang 12:55:07

我放棄 HiCloud 了….我來貢獻我家的光世代動態 IP 好了

chihao 12:55:28

HiCloud → ByeCloud

pm5 12:58:00

噗

ronnywang 14:13:01

終於開好 hicloud 了 XD

ronnywang 14:13:08

hicloud 沒有被擋

😆 1

chihao 15:33:27

來了來了 ⚠️ 《報導者》的報導 https://www.twreporter.org/topics/information-warfare-business

twreporter.org

風向戰爭與它的推手：揭開台星馬中「資訊戰商人」的神祕面紗 - 報導者 The Reporter

《報導者》追蹤數月，揭發藏身在網路背後帶風向的「資訊戰商人」真面目。他們之中有些是握有數據和技術的政治行銷公司，也有操弄Facebook和Google演算法的工程師，政治和企業主向他們下單，讓他們在言論市場裡，操作不平等的訊息戰。

❤️ 5

ronnywang 16:14:32

ok, 我這邊可以提供兩個跳板支援
1. HiCloud
2. 我家的光世代 + raspberry pi
前者成本大概 300 NTD / month ，速度比較快一些（一定比 raspberry pi 快），但缺點是換 IP 不方便，假如 IP 被擋需要手動去網頁界面換 IP 或換新機器

後者不需要額外成本（可能有極少的電費 XD），IP 被擋只要重連 PPPoE 就好，這部份可以做到 API 化，缺點是 raspberry pi 本身速度慢，還有一個是用了我本人的 Hinet 帳號，這點不確定會不會有什麼問題

chihao 16:15:05

我正想推坑你就被搶先跳了！？😆

pm5 16:57:15

@ronnywang 用 raspberry pi 的話，你最近有時間弄嗎？大概多久可以架好？

pm5 18:59:56

我是想說 raspberry pi 可以包好容易安裝的話，就可以多找幾個人裝起來 round robin

🙌 1

chihao 20:41:54

@mail.hkazami @tkirby 來 tag 有說想跳坑的 Steven 和 kirby 😛

🙋‍♂️ 1

chihao 20:42:39

真･分散式

chihao 20:44:22

把後段架構加入 roadmap \o/

hkazami 23:31:12

拜讀中ＸＤ

Victoria Welborn 09:01:29

Thank you @chihao and everyone else!!

I wanted to ask the group two things:

1. I am hoping to spend some time with one of you for an interview on g0v and more specifically information on 0archive (as an example of g0v's work) https://g0v.hackmd.io/@kVqWpZq8R12VCa4i1N68UA/SJiTyGyJ8/edit. I will be in town the 6-12, and wanted to see if one person might be available and willing for an on camera interview for the video (who is comfortable speaking primarily in English)? I hope it will only be about an hour of your time (we are also happy to conceal identity if folks would prefer).
2. I wanted to see how this group would feel about a camera attending the mini-hackathon on the 4th to take some b-roll for the video.
Happy to provide more information, thank you all for your time and help :)) We are SO excited to tell the amazing story of the amazing work you all are doing!

g0v.hackmd.io

g0v interview for civic tech video - HackMD

1 😎 1

fly 09:54:41

> 如同一般內容農場，「密訊」的內容五花八門，包含最能吸引點閱的寵物文章和心理測驗；但點開文章列表，卻可發現其組成幾乎9成都是標註為「新聞」分類的文章。
> 《報導者》根據事實查核平台「Cofacts」的資料庫來進行比對，發現所有檢舉內容中，光是針對「密訊」，就提出過39篇待查證要求，而其中正確訊息僅有4篇
>
> 為了追尋「密訊」流量的去處，《報導者》更使用CrowdTangle工具進行分析，發現它的文章主要都是流入成員萬人以上的泛藍粉專和社團，包括「2020韓國瑜總統後援會（總會）」、「青天白日正義力量」、「監督年金改革行動聯盟」、「靠北民進黨」和「反蔡英文聯盟」等社群內；其中，「青天白日正義力量」這個擁有10萬人粉絲的專頁，更直接將密訊網址列在自己的介紹欄位中。
>
> 屢敗屢戰的「密訊」沒有輕易倒下。事隔一週不到，就出現了新的網址「pplomo.com」（註），截至12月24日前，這個網站依舊可以在Facebook上被使用者轉載。層層封鎖中，「密訊」不倒。
>
> 在關注資訊戰的台北大學犯罪學研究所助理教授沈伯洋眼裡，「密訊」猶如一台「宣傳特定政治意識的宣傳機器」，即便所有人都有自己的立場，但如果幕後是特定國家伸手利用特定資訊來干擾認知，那麼在這場「改變認知」的戰爭中，台灣就會是輸家。
https://www.twreporter.org/a/information-warfare-business-content-farm-mission

twreporter.org

打不死的內容農場──揭開「密訊」背後操盤手和中國因素 - 報導者 The Reporter

曾創「一週被Facebook分享次數最多」紀錄，內容農場密訊在選舉相關社群廣傳。為何它能夠數度下架又重生？背後又是誰在操作？

chihao 10:20:00

嗯，剛剛有人寄信到 g0v-talks，二月有一個叫做 US-TW tech challenge against disinfo 的活動，供大家參考 https://disinfocloud.com/taiwan-tech-challenge

hkazami 2019-12-27 22:08:55

這活動的背後是誰啊……?

chihao 2019-12-28 09:11:17

@mail.hkazami 是美國國務院，來信貼在這裡了 https://g0v.hackmd.io/AHY4VqrkSbWrl9-zjX4Mdg

mengting 14:01:15

@mengting has joined the channel

bruce 17:24:46

想問1/4的第零次松可以公開在不止是disinfo 的地方宣傳嗎？像是粉絲頁或個人限時動態

chihao 17:39:43

我覺得可以耶！

chihao 17:40:16

不過就像我們開會討論的，會根據大家報名表填的資料來決定參與者 😛

👍 1 🙌 2

chihao 17:58:47

另外想跟 @pm5 @ayw255 @fockerlee 公開討論的就是 1/18，會覺得需要選後小松，還是想要休息？因為剛好是第一個 deadline 附近，也許 cowork 也不錯。另一個資訊是 1/18 jothon 預計舉辦基礎松，有跟 @besslee 談到併松，但又覺得有點擠 😛

bruce 2019-12-27 23:04:12

我目前覺得cowork +1

pm5 2019-12-27 23:32:31

cowork

wenyi 2019-12-28 06:12:20

Cowork ++

chihao 2019-12-28 11:50:38

哦嗚好 \o/ 那麼，我有請 @besslee 幫我們預留1/18 NPO Hub 獨立一間會議室，我們可以直接用（bess++）當天白天在 4F 公共區域會是 g0v 基礎松，晚上則會有一個分享會（還不要爆雷好了），大家覺得這樣如何，或者想另找場地也沒問題。 @ayw255 @fockerlee @pm5

bruce 2019-12-28 13:41:23

好耶

bess 17:58:51

@besslee has joined the channel

pm5 20:27:27

Hi Victoria! There are four of us (@chihao @ayw255 @fockerlee and me) working on 0archive now, with help from others in this channel. I for one would be happy to talk about the project and g0v. As for filming the mini-hackathon, personally I am fine with that, but I imagine we should see if most of the people coming to the hackathon are okay to be filmed, since we haven't mentioned that in registration?

🙌 2

Sofia 20:51:34

@sofia has joined the channel

hkazami 22:08:55

這活動的背後是誰啊……?

bruce 23:04:12

我目前覺得cowork +1

pm5 23:32:31

cowork

wenyi 06:12:20

Cowork ++

chihao 09:11:17

@mail.hkazami 是美國國務院，來信貼在這裡了 https://g0v.hackmd.io/AHY4VqrkSbWrl9-zjX4Mdg

g0v.hackmd.io

U.S.-Taiwan Tech Challenge against foreign adversarial disinformation and propaganda - HackMD

chihao 10:27:38

@pm5++ I could join as well 🙌 also +1 on asking participants before filming during hackathon.

chihao 10:29:08

@vwelborn do you mind registering to the hackathon on 1/4? This way I can send you the location just like other participants 🙂

chihao 10:29:23

Registration is here → https://forms.gle/kqffyonCYWTdeUgk8

Victoria Welborn 11:14:37

Of course! The only thing is that I will unfortunately be arriving on the 5th so the camera crew (which will only be two other people) will attend without me on the 4th, but I will still register! How would you suggest seeing if people going to the hackathon would mind be filmed?

That would be great about the two of you being filmed and interviewed! Will there also be another hackathon on the 8th? If so, we could hold your interviews there (before, during, or after) at the event.

chihao 11:20:42

@vwelborn
(1) Either you register or the camera crew register. Either way, might be best to link me, pm5, you, and the camera crew together in the same email thread.

(2) Ask the participants on site. If they refuse, don’t film them.

(3) Events on the 8th are hosted by other projects. I personally am not sure if I’m attending yet. I usually decide on that night if I’m swinging by and bringing beer 😛

Victoria Welborn 11:39:07

Perfect - that all sounds great! I will register them and link you all in the same email thread.

As for your interview and @pm5's interview - would either of you be open to perhaps us meeting up separately from the hackathon to do the interview? Maybe just let me know a few dates/times that work (and a preferred location - I can find one if you prefer).

Victoria Welborn 11:41:02

And just to clarify - on the 8th the only events that are happening are these two?: (I'm still working on this list - https://g0v.hackmd.io/HaIQVQpLQ8i-ixUDyXtBCQ)

• 1/8 (Wed) vTaiwan mini-hackathon
• Location: Social Innovation Experiment Center, Room A9, 99, Section 3, Ren-ai Road, Taipei
• 1/8 (Wed) disfactory mini-hackathon
• Location: Citizen of the Earth (Taipei Office), 9F 28 Beiping East Road, Taipei
Thank you!

g0v.hackmd.io

NDI: Defending Taiwan’s democracy against disinformation - HackMD

chihao 11:50:38

哦嗚好 \o/ 那麼，我有請 @besslee 幫我們預留1/18 NPO Hub 獨立一間會議室，我們可以直接用（bess++）當天白天在 4F 公共區域會是 g0v 基礎松，晚上則會有一個分享會（還不要爆雷好了），大家覺得這樣如何，或者想另找場地也沒問題。 @ayw255 @fockerlee @pm5

❤️ 1

1 👍 1

1

chihao 12:03:38

Wow this seems like a lot of work! Are there alternative ways to do the interview on 1/4?

bruce 13:41:23

好耶

pm5 15:35:10

@ayw255 https://2.python-requests.org//en/latest/user/advanced/#proxies

pm5 16:59:31

@fockerlee 你問的留言的 reply_to 內容，應該可以用 MySQL 的 `LAST_INSERT_ID()` 在 INSERT post 的時候就拿到，就不用再 query 一次了 https://dev.mysql.com/doc/refman/5.7/en/getting-unique-id.html

pm5 08:18:48

@fockerlee 我好像吃壞肚子了，今天下午無法

isabelhou 2019-12-30 11:13:34

保重

bruce 2019-12-30 12:07:13

沒問題！保重！

isabelhou 11:13:34

保重

Victoria Welborn 12:03:20

Unfortunately I won't be in Taiwan on the fourth and am the only person conducting the interviews, I'm sorry! However, I'm happy to come to a location that is convenient for you during the 5-12!

Victoria Welborn 12:06:57

Otherwise, maybe I can just interview one of you, if that would be less stress :)

bruce 12:07:13

沒問題！保重！

allenlinli 16:18:41

@allenlinli has joined the channel

pm5 18:33:21

@ayw255 proxy 的問題看起來是 requests 用的 urllib3 有問題 https://github.com/urllib3/urllib3/issues/1520

GitHub

Accessing HTTPS sites through proxy · Issue #1520 · urllib3/urllib3

Why is it that urllib3 doesn&#39;t allow proxy access to HTTPS sites? See poolmanager.py#L416. This limitation seems to be also reflected in the requests library as well. poolmanager.py#L416 def co...

pm5 2019-12-30 18:35:12

scrapy 不用是 urllib3，它也有一樣的問題嗎？

wenyi 2019-12-30 18:45:36

@pm5 Scrapy沒有試欸，都在用selenium，因為kknews & 壹讀都是需要用selenium開的網站
selenium需要用extention (因為需要帳密authenticate), 用Chrome 有成功過，可是在headless的情形下無法使用（等於死路一條因為在linux不用headless就會crash），看起來Firefox是可以用extension in headless mode，但目前還沒成功Q

pm5 2019-12-30 19:25:14

@ronnywang proxy 可以用 token 認證，不要用 basic auth 嗎？

pm5 2019-12-30 19:25:32

對耶我們需要的是 selenium，我都忘了

ronnywang 2019-12-30 19:34:06

我現在 Proxy 是用 apache 來做
```<VirtualHost *:443>
ProxyRequests On
ProxyVia On
<Proxy *>
AuthType basic
AuthName "private area"

AuthUserFile /srv/passwd
Require valid-user

Order allow,deny
Allow from all
</Proxy>
</VirtualHost>```
設定如上

lexifdev 2019-12-31 00:25:27

(I read it by Google Transalte. if translator works correctly~)
if you want to access web resources that behind ‘Basic Authentication’, you can do it easley by requests (without complex configurations of Selenium.)
```requests.get('https://api.github.com/user', auth=HTTPBasicAuth('user', 'pass'))```
https://requests.readthedocs.io/en/master/user/authentication/

pm5 2019-12-31 11:55:48

@sl thanks ❤️

pm5 2019-12-31 11:56:59

We are trying to crawl a few of the web sites and facebook that requires selenium, though. Otherwise they do not response with meaningful contents.

lexifdev 2020-01-01 10:44:57

yeah. many websites are cannot crawl without real browser based tool.
but I always try this once.

every time I crawl Facebook, I use mobile website (https://iphone.facebook.com/ or https://iphone.facebook.com/). it has also ‘noscript’ version.

some sites that blocked by ‘User-Agent’ or ‘Referer’ are can avoid by
```session = requests.Session()
session.headers.update('Referer', 'https://~~~') # or
session.headers.update('User-Agent', 'Mozilla/5.0 ~~~~')```
(maybe you already know)

pm5 18:35:12

scrapy 不用是 urllib3，它也有一樣的問題嗎？

wenyi 18:45:36

@pm5 Scrapy沒有試欸，都在用selenium，因為kknews & 壹讀都是需要用selenium開的網站
selenium需要用extention (因為需要帳密authenticate), 用Chrome 有成功過，可是在headless的情形下無法使用（等於死路一條因為在linux不用headless就會crash），看起來Firefox是可以用extension in headless mode，但目前還沒成功Q

pm5 19:25:14

@ronnywang proxy 可以用 token 認證，不要用 basic auth 嗎？

pm5 19:25:32

對耶我們需要的是 selenium，我都忘了

ronnywang 19:34:06

我現在 Proxy 是用 apache 來做
```<VirtualHost *:443>
ProxyRequests On
ProxyVia On
<Proxy *>
AuthType basic
AuthName "private area"

AuthUserFile /srv/passwd
Require valid-user

Order allow,deny
Allow from all
</Proxy>
</VirtualHost>```
設定如上

chihao 00:10:31

Not sure what’s a good solution here, but I think missing the 1/4 event is indeed unfortunate 😩 I’m a bit worried about the camera crew on 1/4. Would they be self-guided? 🤔

lexifdev 00:25:27

(I read it by Google Transalte. if translator works correctly~)
if you want to access web resources that behind ‘Basic Authentication’, you can do it easley by requests (without complex configurations of Selenium.)
```requests.get('https://api.github.com/user', auth=HTTPBasicAuth('user', 'pass'))```
https://requests.readthedocs.io/en/master/user/authentication/

❤️ 1

Victoria Welborn 01:49:31

It would just be one or two people with one camera taking some broll video, which they know how to do in a non intrusive way, but if it's a problem we can just get broll footage on the 8th instead!

Victoria Welborn 01:51:24

What do you think about me interviewing you and/or pm5 on a separate day and location?

ronnywang 09:41:20

@ayw255 大約兩小時前開始主機就有記憶體被吃完的情況，是不是有在做什麼更新呢？

wenyi 2019-12-31 09:42:45

@ronnywang 除了昨晚加幾個site進去跑之外沒做什麼改變耶，會是新網站的原因嗎🤔

ronnywang 2019-12-31 09:43:48

每一個 site 都是獨立的 process 在跑嗎？

wenyi 2019-12-31 09:45:44

有一支程式會幫每一個site開一支spider跑，用multiprocessing 所以同時會有幾個spiders一起跑

ronnywang 2019-12-31 09:45:51

我幫 disinfo 開一台獨立的主機好了 XD 看來現在記憶體用量會比較大了

wenyi 2019-12-31 09:54:28

有可能是新的網站東西太多抓得沒完沒了，我來設定一下depth limit

wenyi 09:42:45

@ronnywang 除了昨晚加幾個site進去跑之外沒做什麼改變耶，會是新網站的原因嗎🤔

ronnywang 09:43:48

每一個 site 都是獨立的 process 在跑嗎？

wenyi 09:45:44

有一支程式會幫每一個site開一支spider跑，用multiprocessing 所以同時會有幾個spiders一起跑

ronnywang 09:45:51

我幫 disinfo 開一台獨立的主機好了 XD 看來現在記憶體用量會比較大了

wenyi 09:54:28

有可能是新的網站東西太多抓得沒完沒了，我來設定一下depth limit

1

ronnywang 10:15:48

image.png

ronnywang 10:15:59

看起來 hourly cron 都跑不完，一直累積著

ronnywang 10:17:56

要不要加一個判斷，如果 hourly cron 發現自己跑超過 60 分鐘的話，就立刻結束自己

ronnywang 10:18:33

然後我手動把超過一小時的 cron 先 kill 掉

mrorz 10:21:05

這個圖裡的連結好像壞掉了 QQ

128885

chihao 10:21:52

著火了嗎 😢 感謝 mrorz 回報，我晚點修

mrorz 2019-12-31 11:41:43

似乎是權限問題

chihao 2019-12-31 11:51:29

修好了，目前 airtable 只開放專案成員編輯，其他人是 read-only https://airtable.com/shrd0utGHlTWmQsYt

ronnywang 10:23:17

我幫 disinfo 開了一台自己的 Linode 了

1

1 😂 1

ronnywang 10:35:47

使用上不需要改變什麼，之後 middle2 會自動把 disinfo 新的 request 都用那台專屬的 Linode 處理

mrorz 11:41:43

似乎是權限問題

chihao 11:51:29

修好了，目前 airtable 只開放專案成員編輯，其他人是 read-only https://airtable.com/shrd0utGHlTWmQsYt

Airtable

零時檔案局 0archive - Airtable

Explore the "零時檔案局 0archive" base on Airtable.

🎉 1

pm5 11:55:48

@sl thanks ❤️

pm5 11:56:59

We are trying to crawl a few of the web sites and facebook that requires selenium, though. Otherwise they do not response with meaningful contents.

pm5 15:13:05

大家覺得 dcard 要不要來爬？

🙌 1

chihao 15:13:39

爬爬爬 🐛

pm5 15:14:33

可以針對人來爬 https://www.dcard.tw/@young1632054129

Dcard

微生物學小渣渣的公開頁面 | Dcard

廣受年輕人喜愛的 Dcard 是台灣最大的匿名交流平台，舉凡時事話題、感情心情、吃喝玩樂、學習工作等，都有卡友陪你聊！

ronnywang 15:33:53

batch_discover.py 現在還是有跑不完的問題喔

pm5 2019-12-31 16:33:27

那來加 --run-time-limit 的檢查好了

pm5 2019-12-31 17:01:00

現在我們有辦法自己從 shell 砍掉跑不完的 cronjob 嗎？

ronnywang 2019-12-31 17:01:19

不行 XD

ronnywang 2019-12-31 17:01:31

那我來砍掉囉

pm5 2019-12-31 17:02:03

XD 那現在在跑的先幫我們砍掉吧，然後我先設成 daily run

pm5 2019-12-31 17:03:52

不對，我現在改成 daily 它又會跑下去 XD

pm5 2019-12-31 17:04:12

我先把那個 cronjob 刪掉好了

pm5 2019-12-31 18:53:21

先簡單粗暴地加上 58 分鐘後 killall，放回 hourly cronjob 去跑

pm5 2020-01-01 12:13:35

昨天晚上重新開始 hourly 跑 batch_discover 了。看 log 感覺都有順利被砍掉，不知道 memory 情況如何？

pm5 2020-01-01 12:14:01

特別是 selenium 有沒有順利 kill 掉

ronnywang 2020-01-01 13:23:12

應該是有正常砍掉，看起來沒有再累積了

pm5 2020-01-01 13:54:13

感謝 🙏 新年快樂

pm5 16:30:13

@fockerlee fb 欄位開好了https://g0v.hackmd.io/@chihao/0archive/https%3A%2F%2Fg0v.hackmd.io%2FlMQO37z6SbWNWo3R4-X_EA

HackMD

0archive 零時檔案局 :closed_book: - HackMD

# 0archive 零時檔案局 :closed_book: ## 想跳坑嗎 - [跳坑指南](<https://g0v.hackmd.io/cdctnMJWQpKWQYhSxB8sCw>) - [Roadmap](/L1

pm5 2019-12-31 16:31:45

`Site` 加了 `site_info`，`FBPostSnapshot` 加了 `fb_post_info` 和 `author_info` 減了 `fb_post_id`，`FBCommentSnapshot` 加了 `fb_comment_info` 和 `author_info` 減了 `fb_comment_id`。新加的都是 JSON 欄位

bruce 2019-12-31 16:36:56

好！感謝！

pm5 16:31:45

`Site` 加了 `site_info`，`FBPostSnapshot` 加了 `fb_post_info` 和 `author_info` 減了 `fb_post_id`，`FBCommentSnapshot` 加了 `fb_comment_info` 和 `author_info` 減了 `fb_comment_id`。新加的都是 JSON 欄位

pm5 16:33:27

那來加 --run-time-limit 的檢查好了

bruce 16:36:56

好！感謝！

pm5 17:01:00

現在我們有辦法自己從 shell 砍掉跑不完的 cronjob 嗎？

ronnywang 17:01:19

不行 XD

ronnywang 17:01:31

那我來砍掉囉

pm5 17:02:03

XD 那現在在跑的先幫我們砍掉吧，然後我先設成 daily run

pm5 17:03:52

不對，我現在改成 daily 它又會跑下去 XD

pm5 17:04:12

我先把那個 cronjob 刪掉好了

ronnywang 17:37:03

啊 sorry ，不小心 kill 掉某人的 bash

chihao 18:32:02

https://news.ltn.com.tw/news/politics/breakingnews/3025948

自由電子報

反滲透法》「大師鏈」批政府濫用法令急宣布放棄台灣市場 - 政治 - 自由時報電子報

立法院今天三讀通過「反滲透法」，民進黨團、國民黨團、時代力量黨團都提出修正動議，最後民進黨以人數優勢下午成功闖關。首家獲中國政府同意落地的台灣媒體《大師鏈》在反滲透法三讀通過後急忙發出聲明，怒批反滲透法為惡法，並宣布將暫時「放棄台灣市場」。落地北京的網路平台《大師鏈》是首家獲中國同意進駐的台灣媒體，藍委曾明宗、國民黨榮譽主席連戰等國民黨人士都表達支持，不過因《大師鏈》延攬前國安局長丶軍情局長加入，更被外界懷疑背後有「紅色力量」，引發熱議。

chihao 18:32:21

來爬《大師鏈》？😆 https://www.masterchain.media/

masterchain.media

大師鏈 - 全球華人價值內容平台

大師鏈旨在構建一個以AI技術為基礎的，全球華人價值內容平台。匯聚全球各地頂尖大師與專家輸出高質量內容，打造全球華人深度媒體新能量。基於AI技術的「大師孵化器平台」，人人都是自己領域的「大師」。這是一個相互賦能的AI價值生態！

tkirby 18:38:50

大師鍊的首頁源碼有簡體註解, 這是中國人架的網站? 不是聽說是台灣公司?

🧐 1

tkirby 18:40:21

他們有用 GA, Hotjar, kissmetrics, 搞不好可以從 ID 挖出些什麼關係? XD

🙏 1

hcchien 18:41:39

現在大家都開始看 trace id 了嗎 lol

3

pm5 18:53:21

先簡單粗暴地加上 58 分鐘後 killall，放回 hourly cronjob 去跑

ronnywang 19:59:08

大師鍊連網站都是掛在阿里雲下了

3

disinfo

Month: 2019-12

2019-12-01

2019-12-02

posts-10.csv

2019-12-04

2019-12-05

2019-12-06

2019-12-07

2019-12-08

2019-12-09

2019-12-10

2019-12-11

2019-12-12

2019-12-13

2019-12-14

Image from iOS

Image from iOS

2019-12-15

2019-12-16

2019-12-17

2019-12-18

2019-12-19

2019-12-21

2019-12-22

2019-12-23

2019-12-24

2019-12-25

image.png

2019-12-26

Untitled

image.png

2019-12-27

2019-12-28

2019-12-30

2019-12-31

image.png

128885