REMI
REMI: Regression with marginal information with applications in genome-wide association studies
Jin Liu
2018-05-11
Introduction
This vignette provides an introduction to the REMI
package. R package REMI
implements REMI, an efficient statistical approach to conduct RE]gression with Marginal Information with an application in genome-wide association studies (GWAS).
library(devtools)
install_github("gordonliu810822/REMI")
The package can be loaded with the command:
library("REMI") #> Loading required package: Matrix #> Loading required package: foreach
This vignette is organized as follows. Section 1 shows how we use the real genotype data to generate phenotype data. Section 2 shows how we fit two versions of REMI (REMI-C and REMI-R). If you have any problems running the demo, please contact Jin Liu at jin.liu@duke.nus.edu.sg. Before running this demo, please download all the demo data sets at [googledrive folder] (https://drive.google.com/drive/folders/1ic9Q7Onq0iDNSkex-S4V_YRv-_aL4UxL?usp=sharing}).
1. Data
In this vignette, we use the 3,200 individuals’s real genotype data from GERA in chromosome 16 to chromosome 18 (19,865 SNPs) to generate a phenotype. First, we use an function embeded within REMI
package to read data from PLINK bed, bim, fam files directly as
We first generate genotype data using function genRawGeno:
setwd("/Users/nus/GoogleDrive/Work/GPA_matlab/OtherWork/LowRank/Rpackage/ssLasso_Simulation/REMI/data") n1 = 3000; dat = ReadPlinkFile('eur_chr16.17.18_3000'); x = dat$X;
We use 3,000 of all samples as training set and rest 200 samples as testing set by
keeplist = read.table(paste("keeplist_",n1,sep=""),sep="\t",header=F) fam = read.table(paste("eur_chr16.17.18_",n1,".fam",sep=""),sep=" ",header=F) test.set = match(keeplist[1:200,1],fam[,1]); train.set = match(keeplist[201:dim(x)[1],1],fam[,1]); train.lasso.set = match(keeplist[201:3200,1],fam[,1]);
To conduct REMI, we need a reference panel to estimate the correlation or covariance structure among covariates. Here, we use 1000 Genome Project (1000G) data as reference panel data. Thus, we need match reference allele in GERA with 1000G for each SNP. Here we load reference allele information from GERA first as
frq = read.table('eur_chr16.17.18_3000.frq',header=T); rsname = as.character(unlist(frq[,2])) A1_raw = as.character(unlist(frq[,3]))
The phenotype data is generated using linear model and 0.1\(\%\) of SNPs have non-zero effects following Gaussian distribution. We simulate the phenotype data such that the heritability is controlled at 0.5. The codes are
h2 = 0.001; p = dim(x)[2]; #n =dim(x)[1]; alpha= 0.005; m = as.integer(alpha*p); h2 = 0.5; eps = 0.2; set.seed(10000) b = numeric(p); indx = sample(1:p,m); b[indx] = rnorm(m); n = dim(x)[1] y0 = x%*%b; y = y0 + rnorm(n)*sd(y0)*sqrt((1-h2)/h2); var(y0)/var(y) #> [,1] #> [1,] 0.5034505 ytrain.lasso = y[train.lasso.set]; xtrain.lasso = x[train.lasso.set,]; ytrain = y[train.set]; xtrain = x[train.set,]; ytest = y[test.set]; xtest = x[test.set,]; x_var = apply(xtrain.lasso*xtrain.lasso,2,sum); yc.lasso = ytrain.lasso - mean(ytrain.lasso); Xmean.lasso = apply(xtrain.lasso, 2, mean); xc.lasso= sweep(xtrain.lasso,2,Xmean.lasso)
Then, summary statistics can be generated as
Xmean = apply(xtrain, 2, mean); xc= sweep(xtrain,2,Xmean); yc = ytrain - mean(ytrain); ytilde = numeric(p); betah = numeric(p); s2 = numeric(p); for (i in 1:p){ fm = lm(yc~1+xtrain[,i]); betah[i] = summary(fm)$coefficients[2,1]; s2[i] = summary(fm)$coefficients[2,2]^2; ytilde[i] = sum(yc*xc[,i])/length(yc); }
Then, the files with summary statistics are created as
# rs, A1, A2, betah, se2, n # summary statistics for REMI-R sumdat = data.frame(rsname,frq[,3:4],betah, s2, rep(n,p)) write.table(sumdat, "sumdat_sim1.txt",sep=" ",quote=F, row.names=F,col.names=F) # summary statistics for REMI-C sumdat2 = data.frame(rsname[1:p],frq[1:p,3:4],ytilde, rep(var(y),p), rep(n,p)) write.table(sumdat2, "sumdat_sim2.txt",sep=" ",quote=F, row.names=F,col.names=F)
2. Fit REMI
The REMI-R and REMI-C are fitted as
plink_file = "all_chr_1000G_sim_chr16.17.18"; summarystat_file1 = "sumdat_sim1.txt" summarystat_file2 = "sumdat_sim2.txt" out = REMI_R(summarystat_file1, plink_file,epsilon=0.2) out2= REMI_C(summarystat_file2, plink_file,epsilon=0.2);
where plink_file is 1000G reference penale data.
3. Plot Solution Paths
Fit Lasso using genotype data first by
sp= lasso(xtrain.lasso,yc.lasso,x_var,100,eps)
The solution path for Lasso, REMI-R and REMI-C are produced by
par(mfrow=c(1,3),mar=c(4,4,5,1), oma=c(0.5,0.5,0,0.5),mgp = c(2.5, 0.5, 0)) plotCoef3(abs(sp$beta_sparse),lambda=sp$lamseq,df=sp$df,bic=sp$bic, label=F,xvar="lambda", cex.main=1.5,cex.lab=1.5,main = paste("Lasso")); plotCoef3(abs(out$beta),lambda=out$diagnostics[,3],df=out$diagnostics[,1], bic=out$diagnostics[,4],label=F,xvar="lambda", cex.main=1.5,cex.lab=1.5,main = paste("REMI-R")); plotCoef3(abs(out2$beta),lambda=out2$diagnostics[,3],df=out2$diagnostics[,1], bic=out2$diagnostics[,4],label=F,xvar="lambda", cex.main=1.5,cex.lab=1.5,main = paste("REMI-C"));
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });