大數據悖論

若資料有缺陷,則樣本數愈大,偏差愈大(「垃圾進,垃圾出」)

地圖不是領土(模型不是現實的東西)。-Alfred Korzybski

「所有的模型都是錯誤的,有一些是有用的」(George Box)

大數據的迷思

https://medium.com/math-and-statistics/%E5%A4%A7%E6%95%B8%E6%93%9A%E7%9A%84%E8%BF%B7%E6%80%9D-ebbba8df5517

1. 大數法則:只要 n(樣本數)夠大,我們就能正確估計母數(例如:平均值)。

2. 中央極限定理:無論母族群的分布是什麼,無數次隨機抽樣之獨立隨機數平均值的分布都是常態分布(只要樣本數大於 30),而平均值估計的錯誤跟 1/√n 成正比(跟母族群的數目 N 無關)。

以上定律成立的條件是機率抽樣,如果沒有這些條件,那麼估計的錯誤跟 √N 成正比,而且適用大母族群法則:n 愈接近 N,偏差愈大(雖然抽樣變異數愈小)。

偏差 = 資料缺陷率(資料的品質) x (1-f)/f x 資料的變異性(問題的困難度),f = n/N,f = 1 時偏差 = 0,f = 0 時偏差 = 無限大

資料缺陷率(Data defect index,ddi):不能由樣本推估,只能由歷史資料知道,例如:2016 年美國總統大選民調中,投川普的 ddi 是 -0.45%,投希拉蕊的 ddi 是 -0.021%。

https://github.com/kuriwaki/ddi

圖一:若 ddi = 0.05。f = 0.2 時,有效樣本數 = 100,亦即簡單隨機取樣只要 n = 100,那麼錯誤率就跟 n = N x 0.2(例如:隨機取樣 100 人跟非隨機取樣 478 萬台灣人的錯誤率一樣);f = 0.5 時,有效樣本數 = 400;f = 0.7 時,(簡單隨機取樣之)有效樣本數 = 1000。

假設你想要知道一碗湯有多鹹,只要湯有搖勻,那麼無論碗有多大,你只要嚐一小口就可以了。無論母族群有多大(N),只要你是機率取樣,那麼 n 只要夠大就好了(1000 人的民調數據就能推估所有的美國人)。可惜大部分的研究都是具有選擇性偏差的非機率取樣(方便樣本),只有複雜/分層調查研究、民調是機率取樣,但是即使後者也會受到不回應偏差的影響。

2016 年的美國總統大選,幾乎所有的民調都預估希拉蕊會勝選,結果卻是川普勝選了。其中一個因素是拒答率太高了,尤其是川普的支持者。

偏差-變異交換:模型愈複雜,偏差愈小、變異數愈大(有效樣本數減少)、解釋(因果)性愈高;模型愈簡單,偏差愈大、變異數愈小(有效樣本數增加)、預測性愈高。

The Big Data Paradox in Clinical Practice 22

https://www.tandfonline.com/doi/full/10.1080/07357907.2022.2084621

Generalization Bias in Science 22

https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.13188

Observations on Big Data, Precision Health, and Machine Learning 22

https://discourse.datamethods.org/t/observations-on-big-data-precision-health-and-machine-learning/6061

Unrepresentative big surveys significantly overestimated US vaccine uptake 21

https://www.nature.com/articles/s41586-021-04198-4

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GKBUUK

https://github.com/vcbradley/ddc-vaccine-US

Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens From 2016 20

https://hdsr.mitpress.mit.edu/pub/cnxbwum6/release/5

Addressing selection bias and measurement error in COVID-19 case count data using auxiliary information 20

https://arxiv.org/abs/2005.10425

What surveys really say 21

https://www.nature.com/articles/d41586-021-03604-1

What is your question in a RCT? Causation or prediction?

https://www.slideshare.net/StephenSenn1/what-is-your-question-250659957

The paradox of big data 20

https://link.springer.com/article/10.1007/s42452-020-2862-5

Building Deep Statistical Thinking for Data Science 20

https://youtu.be/9WDtuCVZUp8

Meng’s “Big Data Paradox” and an extreme example

https://civilstat.com/2018/10/mengs-big-data-paradox-and-an-extreme-example/

Statistical paradises and paradoxes in big data 18

https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf

https://www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2019/StudentPresentations/WriteUp_TheaQuistgaard.pdf

Statistical paradises and paradoxes in Big Data 18

https://youtu.be/8YLdIDOMEZs

https://youtu.be/yz3jOIHLYhU

https://youtu.be/54T-guLVGdQ

https://www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2019/StudentPresentations/Presentation_TheaQuistgaard.pdf

More Data Means More Problems (a Paradox)

https://youtu.be/evRp7IYGyeg

Real-World Evidence for Regulatory Decision-Making: Guidance From Around the World 22

https://www.sciencedirect.com/science/article/pii/S0149291822000170

The challenges and opportunities in using real-world data to drive advances in healthcare in East Asia: expert panel recommendations 22

https://www.tandfonline.com/doi/full/10.1080/03007995.2022.2096354

Reproducibility of real-world evidence studies using clinical practice data to inform regulatory and coverage decisions 22

https://www.nature.com/articles/s41467-022-32310-3

Real-World Evidence - Where Are We Now?

https://www.fda.gov/media/158581/download

https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence

Current Status, Challenges, and Future Perspectives of Real-World Data and Real-World Evidence in Japan 21

https://link.springer.com/article/10.1007/s40801-021-00266-3

The Real-World Data Challenges Radar: A Review on the Challenges and Risks regarding the Use of Real-World Data 21

https://www.karger.com/Article/FullText/516178

留言

這個網誌中的熱門文章

可轉移性、普遍性、代表性和外部有效性

頻率學派 vs 貝氏學派

貝氏分析計算器