統計學

發表文章

目前顯示的是 8月, 2020的文章

Stats terms

8月 29, 2020

AIC: for prediction Assumptions Asymptotic: if n —>∞ Bayesian Bias BIC: consistent, for explanation BLUE (best linear unbiased estimate) Calibration: rms CLT: if n —>∞, then an iid data distr is gaussian Conditional Confidence interval Consistency: asymptotic unbiasedness Crossvalidation: caret, rms Data reduction: pca Descriptive Distribution: exponential, gamma, lognormal, normal Efficiency: precision Estimator Finite sample Gauss-Markov theorem GEE GLM Hyperparameter: ML tuning Inferential: unbiased point estimate, correct CI Law of large numbers: if n —>∞, then x̄ =μ Linear Marginal Multivariate (> 2 Ys): canonical correlation, factor analysis, mvreg, pca Nonlinear: kernel, krls, mars (earth), splines, npreg Nonparametric OLS: if ε is gaussian distr, then it is a MLE Parameter: a fixed unknown population value, e.g. mean, var Population Post-hoc Prediction interval Random variable Regression Sample: a random realization of infinite ways of sampling an infini...

閱讀完整內容

Mixed model vs GEE

8月 29, 2020

Both mixed (panel, longitudinal, multilevel, hierarchical linear) model and GEE: Account for clusters or levels No misspecification of the model: no unmeasured confounders, no outliers Mixed model: conditional model, assumes MAR GEE with robust se: population average, marginal model, unbiased with misspecified icc if large sample of units, assumes MCAR Rx for multiple comparison lme4, merlin, nlme Stata: mixed, gllamm Fixed effects Assumptions of residuals: normality, homoscedasticity, linearity, additivity, no misspecification, MAR, no me, no uc, icc is correct Random effects Robustness: fe is unbiased and ci is correct even if nonnormal/heteroscedastic/missingness, but ci is not correct if icc is not correct. ML: for large samples or LR comparisons among nested models, biased for re REML: for small samples, biased for fe.

閱讀完整內容

Outliers

8月 29, 2020

離群值（極端值、異常值）定義：若是常態分布，大於三個標準差（99.7%）以上的差距，若非常態分布，在盒鬚圖中第三/ㄧ四分位數 ± 1.5 x 四分位距以外。工業上的「六西格瑪（標準差）」品管能把瑕疵品的比率控制在小於一百萬分之 3.4。原因：系統性偏差、技術錯誤、資料鍵入錯誤、真的極端值、其他處理：刪除（技術錯誤、資料鍵入錯誤）、穩健統計學（真的極端值、其他）

閱讀完整內容

Mixed model

8月 29, 2020

Fixed effects: dummy vars, chosen by the PI, each level is of interest, levels will be reused, not extrapolating to other levels Random effects: want to make inferences Model selection: estat ic, cAIC4, glmmlasso (std vars), gglasso (group lasso), AIC, DIC, MumIn::dredge Longitudinal data: i or t is level 1 (time), j is level 2 (subject) Time can be continuous (growth curve model) or discrete, can be imbalanced or different among subjects, but only continuous time can have a random slope and it should be consistent in fe and re. Cross-level interaction should always include a random slope for the level 1 entity. Fixed effect: within subject (micro) mean (population average in a single time period, higher-level entities treated as a dummy var in longitudinal or panel data, ignoring between subject variations: yij = b0 + b1xij + eij, estimated by BLUE (least squares). The only source of the variability is residual variance. It removes time-invariant heterogeneity. It is consis...

閱讀完整內容

Stats symbols

8月 29, 2020

Symbols ŷ = Xβ̂ + U, X is a design matrix β̂ = (XᵀX)⁻¹Xᵀy x̄ y̅ → ←. ⇄ ↑ ↓ ≥ ≤ ≠ ≈ ± ∫ ∈ ∪ ∩ Λ ∝ (proportional to), ~ (tilde)， √ ♥ 😀 😁 😂 😱【】℃ ∞ α (type 1 error), β (coef, type 2 error), γ/Γ, δ/∆ (difference), ε (error), η/H (eta), θ (parameter), ι (iota), λ (hyperparameter), ℒ (likelihood), μ (population mean), ξ (xi), π (pi), ∏ (product), ρ (rho, correlation) σ (sigma: standard deviation), σ²: variance, Σ: summation, (sem) τ (tau), φ (phi), χ² (chi), ψ (psi), , ζ (zeta) ω (omega), Ω (omega) E(x): expected value N: normal distribution Matrix: A’, Aᵀ: transpose, A⁻¹: inverse, I: identity

閱讀完整內容

ANOVA

8月 29, 2020

Balanced data: types 1–3 are the same Unbalanced data type 1 SS: anova, sequential, not useful type 2 SS (if no interaction): car Anova type 3 SS (if interaction): car Anova, contrasts=list(A=contr.sum, B=contr.sum), type=3)) anova.lme: anova(m, type=“marginal”) contrasts(A) <- “contr.treatment” contr.treatment: dummy coding, contr omitted level contr.sum: contr avg of other levels contr.helmert: contr avg of preceding levels reverse helmert: manually contr avg of subsequent levels contr.poly: contr orthogonal polynnomial unbalanced data: unequal levels: contr.poly(4, c(15,30,60,90))

閱讀完整內容

glm: assumptions

8月 29, 2020

線性回歸（Y. = a + bX + e）的假設 Xs 的效果是相加的（沒有交互作用） Xs 是固定（非隨機）的 Xs 沒有測量誤差 Xs 沒有完美共線性 Y 是連續變數（沒有上/下限、沒有截斷）殘差是獨立且相同分布（iid）的，亦即殘差與 Xs 不相關，而且沒有自相關（序列相關，重複測量或是縱貫性資料） e （殘差）的平均值是零殘差是常態分佈的 X 是有變異性的模型是正確（應變數 Y 和因變數 X 的關係是線性）的，而且是事先設定的，不是事後（模型選擇、因變數選擇之後）的沒有遺失的資料沒有極端值沒有未測量的 X 混淆變項 Gauss-Markov theorem (https://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem): mean of e is zero, e has homoscedasticity, independent var. If e has normal distr, then ols equals mle. Model Pre-specified: no model or variable selection Linearity (Rx: gamsel, krls, splines, fracpoly, svm) No misspecification (Rx: krls) No missing data (Rx: imputation) No outliers (Rx: robust methods) Variables Not bounded: Rx by logit Not censored: Rx by survival analysis, tobit models Not truncated Predictors Fixed: nonrandom Additive: no interactions, Rx by krls No measurement errors: Rx by eivreg, SEM No unmeasured confounders (Rx: sensitivity analysis, E values) No collinearity: Rx by pcareg, pls, ridge Having variance Resid...

閱讀完整內容

Pubmed

8月 29, 2020

((((((((((((((((“Kidney international”[Journal]) OR “American journal of kidney diseases : the official journal of the National Kidney Foundation”[Journal]) OR “Clinical kidney journal”[Journal]) OR “Advances in chronic kidney disease”[Journal]) OR (“Canadian journal of kidney health and disease”[Journal])) OR (“Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association — European Renal Association”[Journal])) OR “Journal of the American Society of Nephrology : JASN”[Journal]) OR “Clinical journal of the American Society of Nephrology : CJASN”[Journal]) OR (“Current opinion in nephrology and hypertension”[Journal])) OR “Seminars in nephrology”[Journal]) OR “BMC nephrology”[Journal]) OR “Nature reviews. Nephrology”[Journal]) OR “Lancet (London, England)”[Journal]) OR “The New England journal of medicine”[Journal]) OR “JAMA”[Journal]) OR “Peritoneal dialysis international : journal of the International Society for Peritoneal Dia...

閱讀完整內容

GEE

8月 29, 2020

Consistent Marginal mean, robust (sandwich, heteroscedasticity-consistent) SEs, semiparametric (only specify mean and var) Estimation: IRLS, not likelihood-based Population average Missing completely at random, wgeesel for MAR GOF: qic Assumptions: linearity Correct even if intraclass correlation wrong if no misspecification Clusters: panel, only one Trajectories: average Assumptions: 1. The responses Y1,Y2,…,Yn are correlated or clustered 2. There is a linear relationship between the covariates and a transformation of the response, described by the link function g 3. Within-subject covariance has some structure (“working covariance”): independence (observations over time are independent) exchangeable (all observations over time have the same correlation) AR(1) (correlation decreases as a power of how many timepoints apart two observations are) unstructured (correlation between all timepoints may be diffe...

閱讀完整內容

Joint model

8月 29, 2020

Joint lmm and survival model (account for measurement errors and unbalanced data): merlin, jmbayes Stata: stjm

閱讀完整內容

Mixed models vs repeated measures ANOVA

8月 29, 2020

Mixed models: Estimate the average and each individual (random effects) random effects are unobserved latent factors (similar to SEM) MAR compound symmetry, AR1, unstructured time can be discrete or continuous (growth model) RMANOVA: Estimate the average No missing data Balanced Sphericity (compound symmetry, equally spaced times, same times) Time can only be discrete

閱讀完整內容

nlme

8月 29, 2020

nlme ANOVA = lme with no intercepts Var-cov matrix ( positive-definite ) for random intercept and slope ( G matrix ) : random= pdSymm ( default , unstructured , MANOVA ), pdDiag ( diagonal , independent , variance components ), pdBlocked ( block-diagonal ), pdCompSymm ( compound symmetry , all cov are equal , exchangeable , sphericity , ANOVA ), pdIdent ( identical ) sigma matrix for 1st-level residuals: correlation= corAR1 , corARMA , corCAR1 , corCompSymm , corSymm ( unstructured ) ( form = ~ 1 | Subject ) weights = varIdent , varFixed ( form = ~ 1 | age ) gls is lme without the argument “ random ” Likelihood ratio comparisons: for comparing random effects , not meaningful for objects fit using REML and with different fixed effects ( compared by anova ( model ), summary ( model )) . Defaults: REML, random=list(id=pdsymm), correlation= NULL

閱讀完整內容

Distributions

8月 29, 2020

Bernoulli distribution: Binary regression. One trial ( coin toss ) with success probability p , failure 1-p. Binomial distribution: Logistic regression. n Bernoulli trials ( n coin tosses ) . An aggregation of the individual data.

閱讀完整內容

nlme vs lme4

8月 29, 2020

nlme residual cov: correlation = NULL (default), corCompSymm (sphericity), corAR1, corSymm (unstructured), suitable for repeated measures rf (intercept, slope) cov: positive definite pdDiag (var components), pdCompSymm, pdSymm (unstructured) Nested rf Splines: ns, lmesplines, mgcv::gamm P values: yes glmm: no gof: lmmfit, RLRsim default: reml fs: MuMIn::dredge summary (conditional t tests), anova (F tests) heteroscedasticity (weight=varIdent(form=~1/group)) lme4 residual cov: cannot be specified, suitable for clustered data rf cov: default unstructured (1 + x1 + x2|grp), var components ((1|grp) + (0 + x|grp)), Crossed rf Splines: gamm4 P values: No, lmerTest, pbkrtest glmm: yes gof: performance, r2glmm, RLRsim default: reml fs: buildmer

閱讀完整內容

R

8月 29, 2020

stats::anova (type I,sequential,not good) car::Anova (type II if no interaction,type III if interaction)

閱讀完整內容

Mixed models for longitudinal data

8月 29, 2020

Time is level-1. https://link.springer.com/article/10.1007/s10869-017-9491-z Mixed-effects models can be used with nested data, even if no group-level effects are expected or evident. Failing to use mixed effects to model nested data can increase the risk of type I error (for group-level effects) and type II error (for lower-level effects). Centering only applies to level-1 vars. Group-mean centering of a level-1 variable fundamentally changes the interpretation of the level-2 parameter estimate for the analogue of the same variable. In group-mean centered models, level-2 parameter estimates represent overall group effects; in raw or grand-mean models, level-2 parameter estimates represent differences in slopes between individual-level and group-level relationships. When level-1 variables are group-mean centered, one can mistakenly conclude the level-2 group-mean analogue of the lower-level variable represents a test of an emergent effect when it actually represent...

閱讀完整內容

Multiple comparison or post-hoc tests

8月 29, 2020

FWER: normality, homoscedasticity Balanced data Bonferroni and Dunn-Sidàk (nonp) after t-tests: input P values, pairwise, most conservative, least powerful Tukey HSD: pairwise, good Holm-Sidak: stepdown, input P values, not good Hochberg (NK: least conservative, not good) Dunnett: compared to a control Unbalanced data Tukey-Kramer Games- Howell tests: nonp, heteroscedasticity Scheffe FDR: for high-dimensional data BH (fdr): stepup, for independent or positive dependence tests, most commonly used BY: stepup, for dependent test, very conservative

閱讀完整內容

Stata Code for Chapter One: evaluate_disorganized.do

8月 29, 2020

trueparam_disorganized, delta(0.5) canna01(0) predorig(3) h(1) numpat(1000) reseed(24) display "canna01: "$canna01 display "Prediction origin: t="$PredOrig display "Prediction horizon: h="$h display "delta= " $delta display "Median of relative biases: " r(MedianRelBias) display "Minimum of relative biases: " r(MinRelBias) display "Maximum of relative biases: " r(MaxRelBias) display "Median of correlations between predicted and true transformed benefits: " r(MedianCorr) display "Minimum of correlations: " r(MinCorr) display "Maximum of correlations: " r(MaxCorr) trueparam_disorganized.ado. (Stata ado program that performs the Monte Carlo simulations. This program is called by evaluate_disorganized.do) program trueparam_disorganized, rclass version 15.1 syntax, delta(numlist max=1 >=0) canna01(numlist integer >=0 <=1) predorig(integer) h(integer) /// [ numpat(integer 1000) resee...

閱讀完整內容