發表文章

目前顯示的是 4月, 2024的文章

Priors in bayesians

Noninformative priors: 1-tailed p value equals post prob for normal dist • Uniform (flat): exploratory data analysis, no prior knowledge, simple models with few parameters, improper posteriors that do not integrate to 1, no regularization • Jeffreys: some regularization to prevent overfitting with many parameters Weakly informative priors: • Normal Priors with Large Variance: L2 regularization (ridge), sensitive to the choice of variance, improper posteriors if the variance is too large, Bayesian neural network • Cauchy: default neutral(0, 0.707), pessimistic cauchy(0, 1), optimistic cauchy(0, 0.5); heavy-tailed, outliers and extreme values, somw regularization to prevent overfitting • t dist: heavy-tailed • Unit information: mean 0 sd 2, may lead to improper posteriors • Laplace: double exponential, L1 regularization (lasso), heavy-tailed • Lognormal: positive numbers, heavy-tailed • Pareto: positive numbers, heavy-tailed • Half cauchy: default for var, positive numbers, residual var,...

Winner’s curse for gaussian dist

# Simulate winner’s curse of choosing only those with p of less than 0.05 of gaussian dist with t tests in clinical trials using python import numpy as np from scipy import stats import matplotlib.pyplot as plt # Setting the seed for reproducibility np.random.seed(42) # Parameters true_effect = 0.5  # True mean difference between treatment and placebo std_dev = 3  # Standard deviation for both groups num_trials = 1000  # Number of trials to simulate sample_size = 300  # Number of participants in each group per trial # Store results p_values = [] effect_sizes = [] for _ in range(num_trials):     # Generate data for treatment and placebo groups     treatment_group = np.random.normal(true_effect, std_dev, sample_size)     placebo_group = np.random.normal(0, std_dev, sample_size)     # Calculate the effect size (mean difference)     effect_size = np.mean(treatment_group) - np.mean(placebo_group)     effect_sizes...

Winner’s curse for binary events

# Simulate winner’s curse of choosing only those with p of less than 0.05 of binary dist with logistic reg in 1000 clinical trials using python import numpy as np import pandas as pd from scipy.stats import bernoulli, norm from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt import statsmodels.api as sm # Set random seed for reproducibility np.random.seed(42) # Define simulation parameters n_trials = 1000  # Number of trials to simulate n_patients = 500  # Number of patients per trial true_odds_ratio = 1.5  # True odds ratio for the treatment effect def simulate_trial(n_patients, true_odds_ratio):     # Randomly assign patients to treatment (1) or control (0)     treatment = np.random.randint(2, size=n_patients)     # Calculate probability of success for each patient based on treatment and true odds ratio     p_success = 1 / (1 + np.exp(-np.log(true_odds_ratio) * treatment))        ...

大數據悖論

若資料有缺陷,則樣本數愈大,偏差愈大(「垃圾進,垃圾出」) 地圖不是領土(模型不是現實的東西)。-Alfred Korzybski 「所有的模型都是錯誤的,有一些是有用的」(George Box) 大數據的迷思 https://medium.com/math-and-statistics/%E5%A4%A7%E6%95%B8%E6%93%9A%E7%9A%84%E8%BF%B7%E6%80%9D-ebbba8df5517 1. 大數法則:只要 n(樣本數)夠大,我們就能正確估計母數(例如:平均值)。 2. 中央極限定理:無論母族群的分布是什麼,無數次隨機抽樣之獨立隨機數平均值的分布都是常態分布(只要樣本數大於 30),而平均值估計的錯誤跟 1/√n 成正比(跟母族群的數目 N 無關)。 以上定律成立的條件是機率抽樣,如果沒有這些條件,那麼估計的錯誤跟 √N 成正比,而且適用大母族群法則:n 愈接近 N,偏差愈大(雖然抽樣變異數愈小)。 偏差 = 資料缺陷率(資料的品質) x (1-f)/f x 資料的變異性(問題的困難度),f = n/N,f = 1 時偏差 = 0,f = 0 時偏差 = 無限大 資料缺陷率(Data defect index,ddi):不能由樣本推估,只能由歷史資料知道,例如:2016 年美國總統大選民調中,投川普的 ddi 是 -0.45%,投希拉蕊的 ddi 是 -0.021%。 https://github.com/kuriwaki/ddi 圖一:若 ddi = 0.05。f = 0.2 時,有效樣本數 = 100,亦即簡單隨機取樣只要 n = 100,那麼錯誤率就跟 n = N x 0.2(例如:隨機取樣 100 人跟非隨機取樣 478 萬台灣人的錯誤率一樣);f = 0.5 時,有效樣本數 = 400;f = 0.7 時,(簡單隨機取樣之)有效樣本數 = 1000。 假設你想要知道一碗湯有多鹹,只要湯有搖勻,那麼無論碗有多大,你只要嚐一小口就可以了。無論母族群有多大(N),只要你是機率取樣,那麼 n 只要夠大就好了(1000 人的民調數據就能推估所有的美國人)。可惜大部分的研究都是具有選擇性偏差的非機率取樣(方便樣本),只有複雜/分層調查研究、民調是機率取樣,但是即使後者也會受到不回應偏差的影響。 2016 年的美國總統大選,幾乎所有的...

貝氏分析

貝氏定理:P(A|B) = P(A,B) / P(B) = P(B|A) * P(A)/P(B), p(θ|D)= p(θ, D)/p(D) = p(θ) x p(D|θ)/p(D), theta: 母數, D:資料或變數 x, 條件機率 = 聯合機率/邊際機率 邊際機率:P(A), P(B), p(D), 不管其他的變數 條件機率:P(A|B), P(B|A), p(θ|D), p(D|θ) 聯合機率:P(A,B) = P(A|B) * P(B) = P(B|A) * P(A) 如果兩個事件A和B是獨立的,那麼條件機率等於邊際機率 似然函數 L(θ) ∝ ∏p(xi|θ), L(θ|x) ∝ p(x|θ):看見 x 時,θ 的似然率等比於 θ 的分布(例如平均 0、標準差 1 的標準常態 Z 分布)固定時 x 的密度、 機率 vs 似然率:單一宇宙 vs 多維宇宙 機率密度函數 f(x) or pdf:dnorm(x, mean, sd), p(x|θ), 積分等於 1 累積分布函數 cdf:pnorm(x, mean, sd) 分位數函數(pnorm 的 反函數):qnorm(0.975) 頻率學派 vs 貝氏學派:機率是重複事件發生的頻率 vs 主觀上某事件發生的不確定性、似然率 L(θ|x) 等比於 p(D|θ) vs p(D|θ)、樣本是隨機的變數 vs 已經發生的固定數、母數(參數)是未知的常數(沒有機率可言) vs 未知的隨機變數(有機率)、θ 固定 vs p(θ)、客觀(只依據本次試驗的結果)vs 主觀(依據先驗信念及本次試驗的結果而更新成後驗信念)、傳統 vs 新、簡單vs 複雜、人工計算可行 vs 不可行、最大似然率估計 vs 邊際似然率 p(D) = ∫p(θ’)p(D|θ’)dθ’ 離散機率的總和等於 1、連續機率的積分等於 1、似然率的總和/積分不等於 1。 邊際似然率是所有可能 θ 的積分而非所有似然率的平均值。 H0: θ=0, H1: θ ≠ 0 In fact, p(θ|D) is p(θ|D, Mi), p(θ) is p(θ|Mi), p(D|θ) is p(D|θ, Mi), p(D) is p(D|Mi) Under H1 or f1(x|θ1): p(H1|x) = p(H1) x p(x|H1)/p(x), p(x) i...

Convert t to SMD

# Standardized mean diff SMD (Cohen’s d):  SMD = 0.2 small, 0.5 medium, 0.8 large t_value <- 2.5  # example t-value n1 <- 30        # sample size in group 1 n2 <- 45        # sample size in group 2 SD1 <- 15       # standard deviation in group 1 SD2 <- 20       # standard deviation in group 2 # Compute pooled standard deviation SD_pooled <- sqrt(((n1 - 1) * SD1^2 + (n2 - 1) * SD2^2) / (n1 + n2 - 2)) # Calculate the standard error of the difference between means SE_diff <- sqrt(SD1^2 / n1 + SD2^2 / n2) # Convert t-value to SMD SMD <- t_value * SE_diff / SD_pooled # Print the SMD SMD

One-tailed p-value

onetailed_p <- function(estimate, lower, upper) { se <- (log(upper) - log(lower)) / 3.92 z <- log(estimate)/se p <- pnorm(z) return(p)   } "One-tailed p-value (less than):" one_tailed_p_value <- function(mean, lower, upper, alternative) { se <- (upper - lower) / 3.92   if (alternative == "greater") {     z <- (upper - mean)/se     p_value <- 1 - pnorm(z)   } else if (alternative == "less") {     z <- (lower - mean) / se     p_value <- pnorm(z)   } else {     stop("Invalid alternative hypothesis. Use 'greater' or 'less'.")   }   return(p_value) } # Example data lower <- 2.5 upper <- 4.8 mean <- 3.6 # Calculate p-values for both alternatives p_value_greater <- one_tailed_p_value(lower, upper, mean, "greater") p_value_less <- one_tailed_p_value(lower, upper, mean, "less") # Print the results cat("One-tailed p-value (greater than):", p_value_greater, "\n...