CoMM



CoMM: a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information






Introduction

This vignette provides an introduction to the CoMM package. R package CoMM implements CoMM, a collaborative mixed model to dissecting genetic contributions to complex traits by leveraging regulatory information. The package can be installed with the command:

library(devtools)

install_github("gordonliu810822/CoMM")

The package can be loaded with the command:

library("CoMM")

Fit CoMM using simulated data

We first generate genotype data using function genRawGeno:

library(mvtnorm) L = 1; M = 100; rho =0.5 n1 = 350; n2 = 5000; maf = runif(M,0.05,0.5) X = genRawGeno(maf, L, M, rho, n1 + n2);

Then, effect sizes are generated from standard Gaussian distribution with sparse structure:

beta_prop = 0.2; b = numeric(M); m = M * beta_prop; b[sample(M,m)] = rnorm(m);

Subsequently, the gene expression y is generated by controlling cellular heritability at prespecified level (h2y):

h2y = 0.05; b0 = 6; y0 <- X%*%b + b0; y <- y0 + (as.vector(var(y0)*(1-h2y)/h2y))^0.5*rnorm(n1+n2);

Finally, the phenotype data is generated as the generative model of CoMM with a prespecified trait heritability (h2) as:

h2 = 0.001; y1 <- y[1:n1] X1 <- X[1:n1,] y2 <- y0[(n1+1):(n1+n2)] X2 <- X[(n1+1):(n1+n2),] alpha0 <- 3 alpha <- 0.3 sz2 <- var(y2*alpha) * ((1-h2)/h2) z <- alpha0 + y2*alpha + rnorm(n2,0,sqrt(sz2))

The genotype data X1 and X2 are normalized as

y = y1; mean.x1 = apply(X1,2,mean); x1m = sweep(X1,2,mean.x1); std.x1 = apply(x1m,2,sd) x1p = sweep(x1m,2,std.x1,"/"); x1p = x1p/sqrt(dim(x1p)[2]) mean.x2 = apply(X2,2,mean); x2m = sweep(X2,2,mean.x2); std.x2 = apply(x2m,2,sd) x2p = sweep(x2m,2,std.x2,"/"); x2p = x2p/sqrt(dim(x2p)[2]) w2 = matrix(rep(1,n2),ncol=1); w1 = matrix(rep(1,n1),ncol=1);

Initilize the parameters by using linear mixed model (function lmm_pxem, LMM implemented using PX-EM algorithm):

fm0 = lmm_pxem(y, w1,x1p, 100) sigma2beta =fm0$sigma2beta; sigma2y =fm0$sigma2y; beta0 = fm0$beta0;

Fit CoMM w/ and w/o constraint that alpha = 0 as

fmHa = CoMM_covar_pxem(y, z, x1p, x2p, w1, w2,constr = 0); fmH0 = CoMM_covar_pxem(y, z, x1p, x2p, w1, w2,constr = 1); loglikHa = max(fmHa$loglik,na.rm=T) loglikH0 = max(fmH0$loglik,na.rm=T) tstat = 2 * (loglikHa - loglikH0); pval = pchisq(tstat,1,lower.tail=F) alpha_hat = fmHa$alpha 

Fit CoMM using GWAS and eQTL data

The example of running CoMM using GWAS and eQTL data in plink binary format

file1 = "1000G.EUR.QC.1"; file2 = "NFBC_filter_mph10"; file3 = "Geuvadis_gene_expression_qn.txt"; file4 = ""; file5 = "pc5_NFBC_filter_mph10.txt"; whichPheno = 1; bw = 500000;

Here, file1 is the prefix for eQTL genotype data in plink binary format, file2 is the GWAS data in plink binary format, file3 is the gene expression file with extended name, file4 and file5 are covariates file for eQTL and GWAS data, respectively. Then run fm = CoMM_testing_run(file1,file2,file3, file4,file5, whichPheno, bw);. For gene expresion file, it must have the following format (rows for genes and columns for individuais):

lower up genetype1 genetype2 TargetID Chr HG00105 HG00115
59783540 59843484 lincRNA PART1 ENSG00000152931.6 5 0.5126086 0.7089508
48128225 48148330 protein_coding UPP1 ENSG00000183696.9 7 1.4118007 -0.0135644
57846106 57853063 protein_coding INHBE ENSG00000139269.2 12 0.5755268 -1.0162217
116054583 116164515 protein_coding AFAP1L2 ENSG00000169129.8 10 1.1117776 0.0407033
22157909 22396763 protein_coding RAPGEF5 ENSG00000136237.12 7 0.2831573 -0.1772559
11700964 11743303 lincRNA RP11-434C1.1 ENSG00000247157.2 12 0.2550282 -0.2831573

To make ‘CoMM’ further speeding, we implement multiple thread version of ‘CoMM’ by just run fm = CoMM_testing_run_mt(file1,file2,file3, file4,file5, whichPheno, bw, coreNum); where coreNum = 24 is the number of cores in your CPU.

Figures

The following data and codes are used to produce one of the figures in the Yang et al. (2018).