統計學

發表文章

目前顯示的是 4月, 2024的文章

Priors in bayesians

4月 30, 2024

Noninformative priors: 1-tailed p value equals post prob for normal dist • Uniform (flat): exploratory data analysis, no prior knowledge, simple models with few parameters, improper posteriors that do not integrate to 1, no regularization • Jeffreys: some regularization to prevent overfitting with many parameters Weakly informative priors: • Normal Priors with Large Variance: L2 regularization (ridge), sensitive to the choice of variance, improper posteriors if the variance is too large, Bayesian neural network • Cauchy: default neutral(0, 0.707), pessimistic cauchy(0, 1), optimistic cauchy(0, 0.5); heavy-tailed, outliers and extreme values, somw regularization to prevent overfitting • t dist: heavy-tailed • Unit information: mean 0 sd 2, may lead to improper posteriors • Laplace: double exponential, L1 regularization (lasso), heavy-tailed • Lognormal: positive numbers, heavy-tailed • Pareto: positive numbers, heavy-tailed • Half cauchy: default for var, positive numbers, residual var,...

閱讀完整內容

Winner’s curse for gaussian dist

4月 24, 2024

# Simulate winner’s curse of choosing only those with p of less than 0.05 of gaussian dist with t tests in clinical trials using python import numpy as np from scipy import stats import matplotlib.pyplot as plt # Setting the seed for reproducibility np.random.seed(42) # Parameters true_effect = 0.5 # True mean difference between treatment and placebo std_dev = 3 # Standard deviation for both groups num_trials = 1000 # Number of trials to simulate sample_size = 300 # Number of participants in each group per trial # Store results p_values = [] effect_sizes = [] for _ in range(num_trials): # Generate data for treatment and placebo groups treatment_group = np.random.normal(true_effect, std_dev, sample_size) placebo_group = np.random.normal(0, std_dev, sample_size) # Calculate the effect size (mean difference) effect_size = np.mean(treatment_group) - np.mean(placebo_group) effect_sizes...

閱讀完整內容

Winner’s curse for binary events

4月 24, 2024

# Simulate winner’s curse of choosing only those with p of less than 0.05 of binary dist with logistic reg in 1000 clinical trials using python import numpy as np import pandas as pd from scipy.stats import bernoulli, norm from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt import statsmodels.api as sm # Set random seed for reproducibility np.random.seed(42) # Define simulation parameters n_trials = 1000 # Number of trials to simulate n_patients = 500 # Number of patients per trial true_odds_ratio = 1.5 # True odds ratio for the treatment effect def simulate_trial(n_patients, true_odds_ratio): # Randomly assign patients to treatment (1) or control (0) treatment = np.random.randint(2, size=n_patients) # Calculate probability of success for each patient based on treatment and true odds ratio p_success = 1 / (1 + np.exp(-np.log(true_odds_ratio) * treatment)) ...

閱讀完整內容

大數據悖論

4月 23, 2024

若資料有缺陷，則樣本數愈大，偏差愈大（「垃圾進，垃圾出」）地圖不是領土（模型不是現實的東西）。-Alfred Korzybski 「所有的模型都是錯誤的，有一些是有用的」（George Box）大數據的迷思 https://medium.com/math-and-statistics/%E5%A4%A7%E6%95%B8%E6%93%9A%E7%9A%84%E8%BF%B7%E6%80%9D-ebbba8df5517 1. 大數法則：只要 n（樣本數）夠大，我們就能正確估計母數（例如：平均值）。 2. 中央極限定理：無論母族群的分布是什麼，無數次隨機抽樣之獨立隨機數平均值的分布都是常態分布（只要樣本數大於 30），而平均值估計的錯誤跟 1/√n 成正比（跟母族群的數目 N 無關）。以上定律成立的條件是機率抽樣，如果沒有這些條件，那麼估計的錯誤跟 √N 成正比，而且適用大母族群法則：n 愈接近 N，偏差愈大（雖然抽樣變異數愈小）。偏差 = 資料缺陷率（資料的品質） x (1-f)/f x 資料的變異性（問題的困難度），f = n/N，f = 1 時偏差 = 0，f = 0 時偏差 = 無限大資料缺陷率（Data defect index，ddi）：不能由樣本推估，只能由歷史資料知道，例如：2016 年美國總統大選民調中，投川普的 ddi 是 -0.45%，投希拉蕊的 ddi 是 -0.021%。 https://github.com/kuriwaki/ddi 圖一：若 ddi = 0.05。f = 0.2 時，有效樣本數 = 100，亦即簡單隨機取樣只要 n = 100，那麼錯誤率就跟 n = N x 0.2（例如：隨機取樣 100 人跟非隨機取樣 478 萬台灣人的錯誤率一樣）；f = 0.5 時，有效樣本數 = 400；f = 0.7 時，（簡單隨機取樣之）有效樣本數 = 1000。假設你想要知道一碗湯有多鹹，只要湯有搖勻，那麼無論碗有多大，你只要嚐一小口就可以了。無論母族群有多大（N），只要你是機率取樣，那麼 n 只要夠大就好了（1000 人的民調數據就能推估所有的美國人）。可惜大部分的研究都是具有選擇性偏差的非機率取樣（方便樣本），只有複雜/分層調查研究、民調是機率取樣，但是即使後者也會受到不回應偏差的影響。 2016 年的美國總統大選，幾乎所有的...

閱讀完整內容

貝氏分析

4月 20, 2024

閱讀完整內容

Convert t to SMD

4月 19, 2024

# Standardized mean diff SMD (Cohen’s d): SMD = 0.2 small, 0.5 medium, 0.8 large t_value <- 2.5 # example t-value n1 <- 30 # sample size in group 1 n2 <- 45 # sample size in group 2 SD1 <- 15 # standard deviation in group 1 SD2 <- 20 # standard deviation in group 2 # Compute pooled standard deviation SD_pooled <- sqrt(((n1 - 1) * SD1^2 + (n2 - 1) * SD2^2) / (n1 + n2 - 2)) # Calculate the standard error of the difference between means SE_diff <- sqrt(SD1^2 / n1 + SD2^2 / n2) # Convert t-value to SMD SMD <- t_value * SE_diff / SD_pooled # Print the SMD SMD

閱讀完整內容

One-tailed p-value

4月 15, 2024

onetailed_p <- function(estimate, lower, upper) { se <- (log(upper) - log(lower)) / 3.92 z <- log(estimate)/se p <- pnorm(z) return(p) } "One-tailed p-value (less than):" one_tailed_p_value <- function(mean, lower, upper, alternative) { se <- (upper - lower) / 3.92 if (alternative == "greater") { z <- (upper - mean)/se p_value <- 1 - pnorm(z) } else if (alternative == "less") { z <- (lower - mean) / se p_value <- pnorm(z) } else { stop("Invalid alternative hypothesis. Use 'greater' or 'less'.") } return(p_value) } # Example data lower <- 2.5 upper <- 4.8 mean <- 3.6 # Calculate p-values for both alternatives p_value_greater <- one_tailed_p_value(lower, upper, mean, "greater") p_value_less <- one_tailed_p_value(lower, upper, mean, "less") # Print the results cat("One-tailed p-value (greater than):", p_value_greater, "\n...

閱讀完整內容