- Measuring the extent of accessible genome
- SNP discovery
- Quality control for Omni1-Quad genotyping
- Transition and transversion SNPs
- Functional annotation of SNPs
- Indels discovery and functional annotation
- Deletion discovery
- Validation sequence and variant calling
- SNP discovery as a function of coverage
- Haplotype phasing of the SSMP genotype data
1.1 Measuring the extent of accessible genome
The extent of accessible genome refers to the amount of the reference genome (NCBI build37) that has been successfully mapped by the properly paired sequence reads that pass quality control. Sequence read filtering was performed with SAMTOOLS 0.1.17 to remove PCR duplicates, and subsequently the BEDTools version 2.14.31 was used to calculate the accessible genome for each sample with the following command line:
% samtools view -f 0x2 -b [sample].bam | samtools view -b -F 0x200 – | samtools view -b -F 0x400 – | genomeCoverageBed -d -ibam stdin | gzip -cv
Customized scripts were then used for calculating statistical summaries such as quartile information on coverage, total coverage of mapped bases, as well as total number of bases with at least 1 read coverage for mapped bases in each sample from the output of genomeCoverageBed. These metrics are subsequently summarized to calculate the average percentage of accessible genome across all 96 SSMP samples.
1.2 SNP discovery
We used two approaches to perform the SNP discovery process: (i) single-sample SNP calling using CASAVA with the small variant caller module; (ii) multi-sample SNP calling using SAMTOOLS 0.1.17 downloaded from http://sourceforge.net/projects/samtools/files/samtools/0.1.17/.
Briefly, the callSmallVariants module in CASAVA adopts two stages: the first stage calls alleles based on the sequence base calls, alignment and quality scores; while the second stage calls SNPs on the basis of the allele calls and the read depth. Allele calling considers the paired-end reads that satisfy all of the following criteria:
- Properly aligned and mapped to the reference genome;
- Pass the quality control filtering;
- Alignment score of at least 6;
- Read-pairs must map with the expected size and orientation.
QC filters adopted by CASAVA remove reads that fail primary analysis quality checks (such as PCR or optical duplicates, or reads with a paired-end mapping quality score of less than 90). The probabilistic model adopted by CASAVA calculates the likelihood of the possible genotypes for every site in the genome, and translates this to two quality scores: Q(snp) and Q(max_gt). Q(snp) expresses the probability that the site will contain a variant allele while Q(max_gt) shows the probability of the most likely genotype at this site. The mean sequencing depth of each chromosome is calculated, and any candidate SNP that possesses a Q(snp) > 0 but with call depth greater than 3 times the mean sequencing depth of the chromosome is excluded. This criterion typically removes SNPs in regions close to the centromeres and possessing high copy numbers. We additionally included a filter on Q(snp), retaining SNPs only when the Q(snp) is at least 20. Heterozygous SNPs on non-autosomal haploid chromosomes are also removed.
For the multi-sample calling with SAMTOOLS, the BAM file for each sample is split into the individual chromosomes before mpileup in SAMTOOLS is executed. An initial round of read filtering is performed to remove reads marked as PCR or optical duplicates. Reads with paired-end mapping quality less than 90 were also excluded. The following steps and command lines were executed:
Step 1: filter raw bam file
% samtools view -f 0x2 -b [sample].bam chr[*] | samtools view -b -F 0x200 – | samtools view -b -F 0x400 – > [sample]_chr[*].bam
Step 2: SAMTOOLS mpileup
% samtools mpileup -q 90 -uf [ref.fa] [list of bam files] > chr[*]_mpileup.out
Step 3: BCF file generation
% bcftools view -p 0.99 -bvcg chr[*]_mpileup.out > chr[*]_mpileup.raw.bcf 2> bcf.log
Step 4: variants filtering using BCFTOOLS
% bcftools view chr[*]_mpileup.raw.bcf | awk ‘$6>3’ | vcfutils.pl varFilter -D[*] -10 -20 -30 -40 > chr[*]_D[*]Qonly.out
After variant discovery, the SNPs are filtered to minimize false discovery using bcftools in SAMTOOLS with the following filtering criteria:
- Variant quality > 3
- Maximum read depth = mean read depth + 3 * standard deviation of read depth
- Minimum read depth = 3
- SNP must not be within 10bp of a gap.
We additionally removed all heterozygous SNPs for SNPs discovered in chromosome Y and mitochondria. To reduce the likelihood of false positives in the SNP calling, the final set of SNPs that are reported in the main text only included those that have been discovered by both CASAVA and SAMTOOLS.
1.3 Quality control for Omni1-Quad genotyping
Each of the 100 SSMP samples has also been genotyped on the Illumina Human Omni1Quad genotyping array. Genotype calling was performed using Genome Studio with version D of the manifest (1,138,747 markers. We remapped the SNPs to their updated chromosomal positions on NCBI Build37. Samples were assessed for the following criteria:
- High rates of missingness (> 5%)
- Excess heterozygosity, which can be indicative of sample contamination
- Excess identity-by-state, which suggests related or duplicated samples
- Discordant ethnic membership from the reported ethnicity
- Gender discrepancy.
One sample (SSM048) was identified to be distinctively Asian Indian in origin, and was excluded from subsequent analysis.
We excluded SNPs if at least one of the following conditions were satisfied:
- Invalid chromosomal positions or strand;
- Duplicated chromosomal positions
- High rates of missingness (call rate < 95%)
- Hardy-Weinberg equilibrium p-value < 10-4.
The SNPs on the Illumina Omni1Quad that passed the QC checks were subsequently used to assess the accuracy of the genotype calls made by CASAVA and SAMTOOLS mpileup on the sequence data.
1.4 Transition and transversion SNPs
Transition and transversion ratio (Ti/Tv) refers to the ratio of the number of transition SNPs to the number of transversion SNPs. This is a metric that is calculated across the biallelic SNPs discovered throughout the genome. A transition SNP refers to a polymorphism with a point mutation that changes a purine nucleotide to another purine (A <-> G), or a pyrimidine nucleotide to another pyrimidine (C <-> T). A transversion SNP refers to a polymorphism with a point mutation that changes a purine to a pyrimidine, or vice versa (A <-> C, A <-> T, C <-> G, G <-> T).
1.5 Functional annotation of SNPs
Functional annotation of biallelic SNPs is performed using ANNOVAR release 2011Nov202. We adopted a gene-based annotation of ANNOVAR in which ENSEMBL is used as a reference annotation database to determine whether a particular SNP will result in protein-coding changes by affecting the amino acids. From the outcome of the gene-based annotation, the functionality of nonsynonymous variants are further predicted with two algorithms: (i) Sorting Intolerant from Tolerant (SIFT); (ii) Polymorphism Phenotyping (PolyPhen version 2).
Both SIFT and PolyPhen aim to predict whether an amino acid substitution affects protein function. SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences3, while PolyPhen prediction is based on defined empirical rules which are applied to the sequence, phylogenetic and structural information characterizing the substitution4. SIFT annotation assigns a score (SIFT score) to each SNP which ranges from 0 to 1, where a SNP possessing a score higher than 0.05 is regarded as benign whereas the amino acid substitution is predicted to be damaging if the score is less than or equal to 0.05. Unlike SIFT, PolyPhen generates a prediction score with a minimum of 0, and score of at least 2 indicates an intolerant substitution that potentially leads to damaging impact.
The workflow is depicted below, and consists of four steps:
Step 1: Input file preparation
Step 2(a): Annotation database preparation
% perl annotate_variation.pl -buildver hg19 -downdb ensGene /hg19db/
% perl annotate_variation.pl -downdb snp132 -buildver hg19 /hg19db/
% perl annotate_variation.pl -downdb avsift -buildver hg19 /hg19db/
% perl annotate_variation.pl -downdb ljb_pp2 -buildver hg19 -webfrom annovar /hg19db/
Step 2(b): ANNOVAR gene-based annotation
% perl annotate_variation.pl -geneanno -buildver hg19 -dbtype ensGene [input file] /hg19db
Step 3: ANNOVAR filter based annotation for exonic variants
% perl annotate_variation.pl -filter -dbtype snp132 -buildver hg19 [input file] /hg19db/
% perl annotate_variation.pl -filter -dbtype avsift -buildver hg19 [input file] /hg19db/
% perl annotate_variation.pl -filter -dbtype ljb_pp2 -buildver hg19 [input file] /hg19db/
Step 4: Summarization of functional annotation output file
The annotation results were binned into 8 categories as described in the table below (as described from the ANNOVAR website), with another 5 categories for exonic SNPs.
|exonic||variant overlaps a coding exon|
|splicing||variant is within 2-bp of a splicing junction|
|ncRNA||variant overlaps a transcript without coding annotation in the gene definition|
|UTR|| variant overlaps a 5′ untranslated region
variant overlaps a 3′ untranslated region
|intronic||variant overlaps an intron|
|upstream||variant overlaps 1-kb region upstream of transcription start site|
|downstream||variant overlaps 1-kb region downtream of transcription end site|
|intergenic||variant is in intergenic region|
|stopgain||a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site|
|stoploss||a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site|
|unclassified||unknown function (due to various errors in the gene structure definition in the database file)|
|nonsynonymous||a single nucleotide change that cause an amino acid change|
|synonymous||a single nucleotide change that does not cause an amino acid change|
1.6 Indels discovery and functional annotation
As per SNP discovery, indels detection is performed with the same two methods CASAVA and SAMTOOLS mpileup. For indel discovery with CASAVA callSmallVariants module, candidate indels are identified by realigning singleton reads with gapped alignment and to use the read positions from the singleton read pairs as a distance metric to yield clusters of non-aligned singleton reads and semi-aligned reads. Within each cluster, the reads are assembled into contigs which are then aligned to the genome using the positions of the associated singleton reads. The diploid genotype call for each indel is then made using both the reads from the indel contig assembly and any reads from the reference assembly which overlap with the position of the putative indel. Genotype calls are subsequently made for each indel according to the associated quality scores to identify the most likely call. Functional annotations of the indels are similarly performed with ANNOVAR release 2011Nov20.
1.7 Deletion discovery
Two methods were used for the detection of large deletions in each sample: (i) BreakDancer-1.1._2011_02_21; (ii) VariationHunter Release_v0.3. BreakDancer adopts the paired-end mapping approach that utilizes paired-end sequencing reads to discover structural variants by relating the genomic coordinates of both ends of a paired-end read against the distribution of the insert sizes5. VariationHunter similarly adopts a paired-end mapping strategy based on maximuim parsimony in identifying structural variants6.
The BAM file for each sample is pre-processed to remove duplicated reads and those that fail QC, defined as reads with mapping quality > 35. The bam2cgf.pl perl script was used to generate the input file for BreakDancer, where we adopted the default option of 4 for the parameter on the standard deviation of the insert sizes. VariationHunter requires an input divet file which is converted from BAM with sam2divet.awk, where we specified the min parameter to be the mean insert size minus four times the standard deviation of the insert sizes. The max parameter is set to the mean insert size plus four times the standard deviation of the insert sizes.
The raw output from both software was subsequently filtered to minimize false discovery. For BreakDancer, we only retain a call if all of the following criteria are satisfied:
- Minimum confidence score of 25;
- Supported by at least 3 reads;
- Both paired-end reads are mapped onto the same chromosome;
- Not located in the centromere.
For VariationHunter, only deletions supported by at least 3 distinct reads and are not located in the centromere are retained. Overlapping calls are subsequently merged into a single call. To reduce the likelihood of false positives in the calling of deletions, the final set of deletions that are reported in the main text only included those that have been discovered by both BreakDancer and VariationHunter, and are between 50bp and 10Mb in size. To evaluate the intersection of the deletions by both methods for evaluating concordance, we used PennCNV version 2011Jun16 available from http://www.openbioinformatics.org/penncnv/penncnv_download.html#_Toc2148172567.
The scripts for the workflow are:
Step 1: Pre-processing raw bam files
% samtools view -q 35 -F 0x200 -b [sample].bam | samtools view -F 0x400 -b – > [sample].q35.bam
% samtools view [sample].q35.bam | awk -f sam2divet.awk > [sample].q35.vh.divet
Step 2: Structural deletion calling
% breakdancer_max [sample].q35.cfg > [sample].q35.BD.SV
% echo [$min-200] [$max+200] 0 [sample].q35.vh.divet 2 | ./VariationHunter_SC > log &
Step 3: filtration
% more [sample].q35.BD.SV | awk ‘$9>25 && $10>2 && $7 == “DEL” && $1 == $4’ > [sample].q35.BD.DEL
% ./remove_centromere.pl [sample].q35.BD.DEL > [sample].q35.BD.DEL.nogap
% ./remove_centromere_vh.pl [sample].q35.vh.divet.SV.Deletion > [sample].q35.vh.divet.SV.Deletion.nogap
Step 4: Concordance check on deletions discovered by BreakDancer and VariationHunter.
A detected deletion is defined to be known if at least 50% of the detected deletion overlaps with known findings from either the Database of Genomic Variants (DGV, November 2010 release) or by the 1KGP. Otherwise, the deletion is considered to be novel.
1.8 Validation sequence and variant calling
To investigate the reliability of the discovered variants, validation sequencing was performed on chromosome 20 for eleven of the samples. Each sample was captured on a NimbleGen SeqCap EZ Choice XL library that was designed for the targeted re-sequencing of chromosome 20, covering all 63,025,520 bases. Samples were indexed with a unique 6-base Illumina barcode and pooled for sequencing on two lanes of an Illumina HiSeq 1000. Reads were processed through the CASAVA 1.8.2 pipeline and mapped to Hg19 with ELANDv2e. We performed local realignment around indels and base quality score recalibration using the Genome Analysis Toolkit (GATK version 1.6-5-g557da77), and called SNPs and indels using the GATK Unified Genotyper. SNP and indel calls were filtered with the GATK VariantFiltrationWalker following the recommended parameters (for SNPs “QD < 2.0", "MQ < 40.0", "FS > 60.0″, “HaplotypeScore > 13.0”, “MQRankSum < -12.5", "ReadPosRankSum < -8.0" and for Indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0″). We also flagged and removed dense SNP clusters (–clusterWindowSize 8) or SNPs that occur at bases with Indel calls (–mask).
|QD||< 2||< 2.0|
|FS||> 60||> 200|
|ReadPosRankSum||< -8||< -20|
1.9 SNP discovery as a function of coverage
We vary the sequence coverage on chromosome 20 for all 96 samples from 5X to 30X in increment of 5X. For each coverage, we re-evaluated the number of SNPs that are detected across all 96 samples. The thinning of the sequence coverage is performed with DownsampleSam tool from PICARD version 1.55 (http://picard.sourceforge.net/) using the following command line for each sample:
% java -Xmx4g -jar DownsampleSam.jar INPUT= [sample]_chr20.bam OUTPUT=[sample]_chr20_5x.bam P=0.1667
SNP calling is subsequently performed on the thinned data using the multi-sample SAMTOOLS mpileup, including the full-coverage data in order to investigate the impact of sequence coverage on SNP discovery.
1.10 Haplotype phasing of the SSMP genotype data
The release of the SSMP data at http://www.statgen.nus.edu.sg/~SSMP is accompanied by phased haplotypes for the same samples that passed data quality control, as we expect the phased haplotypes will be a useful accompaniment for genotype imputation.
The set of consensus biallelic SNPs detected by both CASAVA and SAMTOOLS mpileup is considered for phasing, where we use the genotype calls made by CASAVA. Only SNPs with call rates of at least 0.95 and that are not monomorphic are phased. Phasing was performed using BEAGLE 3.3.2 (http://faculty.washington.edu/browning/beagle/beagle.html) with the following command line:
% java -Xmx 10000m -jar beagle.jar unphased=chr*.bgl missing=0 out=chr*
The raw output from the phasing is in a format where the allele calls are represented by either “1” or “2”, and these are subsequently converted to the ACGT format.
- Quinlan, A.R., and Hall, I.M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842.
- Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164.
- Ng, P.C., and Henikoff, S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11, 863-874.
- Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. (2010). A method and server for predicting damaging missense mutations. Nat Methods 7, 248-249.
- Chen, K., Wallis, J.W., McLellan, M.D., Larson, D.E., Kalicki, J.M., Pohl, C.S., McGrath, S.D., Wendl, M.C., Zhang, Q., Locke, D.P., et al. (2009). BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6, 677-681.
- Hormozdiari, F., Hajirasouliha, I., Dao, P., Hach, F., Yorukoglu, D., Alkan, C., Eichler, E.E., and Sahinalp, S.C. (2010). Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 26, i350-357.
- Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S.F., Hakonarson, H., and Bucan, M. (2007). PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17, 1665-1674.