大數據悖論
若資料有缺陷,則樣本數愈大,偏差愈大(「垃圾進,垃圾出」)
地圖不是領土(模型不是現實的東西)。-Alfred Korzybski
「所有的模型都是錯誤的,有一些是有用的」(George Box)
大數據的迷思
https://medium.com/math-and-statistics/%E5%A4%A7%E6%95%B8%E6%93%9A%E7%9A%84%E8%BF%B7%E6%80%9D-ebbba8df5517
1. 大數法則:只要 n(樣本數)夠大,我們就能正確估計母數(例如:平均值)。
2. 中央極限定理:無論母族群的分布是什麼,無數次隨機抽樣之獨立隨機數平均值的分布都是常態分布(只要樣本數大於 30),而平均值估計的錯誤跟 1/√n 成正比(跟母族群的數目 N 無關)。
以上定律成立的條件是機率抽樣,如果沒有這些條件,那麼估計的錯誤跟 √N 成正比,而且適用大母族群法則:n 愈接近 N,偏差愈大(雖然抽樣變異數愈小)。
偏差 = 資料缺陷率(資料的品質) x (1-f)/f x 資料的變異性(問題的困難度),f = n/N,f = 1 時偏差 = 0,f = 0 時偏差 = 無限大
資料缺陷率(Data defect index,ddi):不能由樣本推估,只能由歷史資料知道,例如:2016 年美國總統大選民調中,投川普的 ddi 是 -0.45%,投希拉蕊的 ddi 是 -0.021%。
https://github.com/kuriwaki/ddi
圖一:若 ddi = 0.05。f = 0.2 時,有效樣本數 = 100,亦即簡單隨機取樣只要 n = 100,那麼錯誤率就跟 n = N x 0.2(例如:隨機取樣 100 人跟非隨機取樣 478 萬台灣人的錯誤率一樣);f = 0.5 時,有效樣本數 = 400;f = 0.7 時,(簡單隨機取樣之)有效樣本數 = 1000。
假設你想要知道一碗湯有多鹹,只要湯有搖勻,那麼無論碗有多大,你只要嚐一小口就可以了。無論母族群有多大(N),只要你是機率取樣,那麼 n 只要夠大就好了(1000 人的民調數據就能推估所有的美國人)。可惜大部分的研究都是具有選擇性偏差的非機率取樣(方便樣本),只有複雜/分層調查研究、民調是機率取樣,但是即使後者也會受到不回應偏差的影響。
2016 年的美國總統大選,幾乎所有的民調都預估希拉蕊會勝選,結果卻是川普勝選了。其中一個因素是拒答率太高了,尤其是川普的支持者。
偏差-變異交換:模型愈複雜,偏差愈小、變異數愈大(有效樣本數減少)、解釋(因果)性愈高;模型愈簡單,偏差愈大、變異數愈小(有效樣本數增加)、預測性愈高。
The Big Data Paradox in Clinical Practice 22
https://www.tandfonline.com/doi/full/10.1080/07357907.2022.2084621
Generalization Bias in Science 22
https://onlinelibrary.wiley.com/doi/full/10.1111/cogs.13188
Observations on Big Data, Precision Health, and Machine Learning 22
https://discourse.datamethods.org/t/observations-on-big-data-precision-health-and-machine-learning/6061
Unrepresentative big surveys significantly overestimated US vaccine uptake 21
https://www.nature.com/articles/s41586-021-04198-4
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GKBUUK
https://github.com/vcbradley/ddc-vaccine-US
Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens From 2016 20
https://hdsr.mitpress.mit.edu/pub/cnxbwum6/release/5
Addressing selection bias and measurement error in COVID-19 case count data using auxiliary information 20
https://arxiv.org/abs/2005.10425
What surveys really say 21
https://www.nature.com/articles/d41586-021-03604-1
What is your question in a RCT? Causation or prediction?
https://www.slideshare.net/StephenSenn1/what-is-your-question-250659957
The paradox of big data 20
https://link.springer.com/article/10.1007/s42452-020-2862-5
Building Deep Statistical Thinking for Data Science 20
https://youtu.be/9WDtuCVZUp8
Meng’s “Big Data Paradox” and an extreme example
https://civilstat.com/2018/10/mengs-big-data-paradox-and-an-extreme-example/
Statistical paradises and paradoxes in big data 18
https://statistics.fas.harvard.edu/files/statistics-2/files/statistical_paradises_and_paradoxes.pdf
https://www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2019/StudentPresentations/WriteUp_TheaQuistgaard.pdf
Statistical paradises and paradoxes in Big Data 18
https://youtu.be/8YLdIDOMEZs
https://youtu.be/yz3jOIHLYhU
https://youtu.be/54T-guLVGdQ
https://www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2019/StudentPresentations/Presentation_TheaQuistgaard.pdf
More Data Means More Problems (a Paradox)
https://youtu.be/evRp7IYGyeg
Real-World Evidence for Regulatory Decision-Making: Guidance From Around the World 22
https://www.sciencedirect.com/science/article/pii/S0149291822000170
The challenges and opportunities in using real-world data to drive advances in healthcare in East Asia: expert panel recommendations 22
https://www.tandfonline.com/doi/full/10.1080/03007995.2022.2096354
Reproducibility of real-world evidence studies using clinical practice data to inform regulatory and coverage decisions 22
https://www.nature.com/articles/s41467-022-32310-3
Real-World Evidence - Where Are We Now?
https://www.fda.gov/media/158581/download
https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
Current Status, Challenges, and Future Perspectives of Real-World Data and Real-World Evidence in Japan 21
https://link.springer.com/article/10.1007/s40801-021-00266-3
The Real-World Data Challenges Radar: A Review on the Challenges and Risks regarding the Use of Real-World Data 21
https://www.karger.com/Article/FullText/516178
留言
張貼留言