HaploPS: Identifying recent positive selection signals in the human population

This page contains the program HaploPS, HaploPS_search and the scripts to analysis output of HaploPS. The detailed usage of HaploPS and the scripts to analyze output could be found in the README accompany the software. The HaploPS.zip package contains all the necessary files to perfom HaploPS, and downstream analylsis. The final output is the significant regions on the genome carrying positive selection signals, and each region is given a haploPS score. Further, if one wants to add genes to the identified region, the add_genes.perl script can be used.

HaploPS_search is a slight modification to HaploPS. HaploPS_search functions as Hapfinder to locate the founder haplotype carrying a selected allele. Instead of performing genome wide search for selection signal, a specific position and search frequency should be given to run haploPS_search. It performs searches of the founder haplotype carrying the SNP at the derived allele frequency. The output is the start and end position coordinates of the founder haplotype and dominant haplotype from. When a selection signal is found, HaploPS_search or hapfinder could be used to locate the founder haplotype.

HaploPS_search can be downloaded here. One example use of haploPS_search is the simulation of convergent and single origin events, which is illustrated below.

The source code for haploPS is available: HaploPS_alpha.cpp.Eigen library is required for the compilation. Example g++ compilation command is : g++ -L ~/library_path/Eigen/ -Wall HaploPS_version2.cpp .out HaploPS.out.

Please contact Xuanyao.liu@nus.edu.sg for any question or comments.

In addition, the scripts and datasets for power calculation of HaploPS are included. In future, if fellow researchers want to perform power calculation and comparison of related methods detecting positive selection, they can use the dataset as reference.

The selection model simulation is performed using Selsim, The configuration file for generating selection reaching current day frequency of 10%, 20% up to 100% can be found here. The command is ./Selsim configuration.file. The recombination file is selsim_recomb.txt, which is generated by cosi and made usable to selsim.

The example dataset for 2000 iterations of cosi simulation and selsim simulation and the processed input file for haploPS can be found below:

Once all input files are generated, run haploPS by running the script run_haploPS_africa_80f.sh which can be found here.
Once the output has been generated, please run significant_rank.r to generate identified selection region at each frequency. And run power_significant.r to get the iterations with positive finding and the predicted frequency of the selected allele.
If one wants to perform the simulation by themselves, one can follow the following procedure to generate input for haploPS.

The output from the selsim simulations needs to be processed to match the input file format requirement. The scripts used can be find here.

For example, to generate selection model with derived allele frequency of 80% for the functional allele, run:
./SelSim SelSimCON_80.txtTo exact the legend and haplotype information, use the following command found in create_geno_pos.sh:
for i in {0..199}; do s=$((128*$i+5));e=$((128*$i+5));j=$(($i+100)); sed -n “${s},${e}p” seq_80_2.txt > ./selsim_80f/80f_pos_$j.txt; done &
for i in {0..199}; do s=$((128*$i+7));e=$((128*$i+126)); j=$(($i+100)); sed -n “${s},${e}p” seq_80_2.txt > ./selsim_80f/80f_geno_$j.txt; done &
and perl create_postions.perl

The SNPs were thinned randomly to get around 200-300 SNPs on the entire region with Selsim_thin_snps.r

Finally add in genetic distance to the input file by running selsim_create_legend.r. The genetic map is generated by cosi based on genome average of 1cM/Mb.

All the scripst uploaded are made for this example at 80% frequency. To get input at other frequencies, please modify the script accordingly.

The neutral model without selection is performed using cosi using the best fit model. The input files can be found here (params, recParams). To run the simulation, please run the perl script 2000_run.pl.

Similar to selsim simulation, the output need to be modified to match haploPS format requirement. First running create_positions.perl and create_genotype.perl to extract the legend and haplotype from the output. Then run the selsim_thin_snps.r and selsim_create_legend.r to get the haploPS input.

Other than being able to identify signals of positive selection, haploPS is able to locate the founder haplotypes carrying the actual functional allele. With such information, one could infer the whether a selection signal that is shared across different populations is the consequence of a single mutation process followed subsequently by gene flow between populations, or convergent evolution due to the occurrence of multiple independent mutation events. Simulations were done to evaluable sensitivity and specificity of HaploPS’s ability to identify the origin selection events. Msms was used to simulate two population mimicking European and Asian, who split from the non-African population

The example command line for generating 100 iterations of convergent event is:

java -jar msms1.3rc9.jar -N 10000 -ms 240 100 -I 2 120 120 -t 100 -r 100 1000 -ej 0.05 2 1 -SI 0.025 2 0 0 -SAA 1000 -SaA 800 -Smu 0.1 -Sp 0.5 -Smark > convergent_test.txt &
In the above setting, the two populations split 2000 generations ago, and the selection was placed to the two populations separately 1000 generations ago.

The example command line for generating single origin event is:

java -jar msms1.3rc9.jar -N 10000 -ms 240 100 -I 2 120 120 -t 100 -r 100 1000 -ej 0.05 2 1 -SI 0.1 2 0 0 -SAA 400 -SAa 200 -Smu 0.1 -Sp 0.5 -Smark > single_origin_test.txt &
The above command line generate two population split 2000 generation ago, the selection was placed to their common ancestor 4000 generations ago.

To exact the sequences for pop1 and pop2 and the legend from the output, run the following:

for i in {0..99}; do s=$((244*$i+6));e=$((244*$i+6)); sed -n “${s},${e}p” single_origin_test.txt > single_origin_pos_$i.txt; done &
for i in {0..99}; do s=$((244*$i+7));e=$((244*$i+126)); sed -n “${s},${e}p” single_origin_test.txt > single_origin_geno_${i}_pop1.txt; done &
for i in {0..99}; do s=$((244*$i+127));e=$((244*$i+246)); sed -n “${s},${e}p” single_origin_test.txt > single_origin_geno_${i}_pop2.txt; done &

for i in {0..99}; do s=$((244*$i+6));e=$((244*$i+6)); sed -n “${s},${e}p” convergent_test.txt > convergent_pos_$i.txt; done
for i in {0..99}; do s=$((244*$i+7));e=$((244*$i+126)); sed -n “${s},${e}p” convergent_test.txt > convergent_geno_${i}_pop1.txt; done
for i in {0..99}; do s=$((244*$i+127));e=$((244*$i+246)); sed -n “${s},${e}p” convergent_test.txt > convergent_geno_${i}_pop2.txt; done

followed by running perl scripts. The following perl scripts can be found here.

  • perl msms_create_positions.perl
  • perl msms_create_genotype.perl single_origin pop1
  • perl msms_create_genotype.perl single_origin pop2

To create input of haploPS, one can use the following R script: msms_create_input.r which can be found here.

Since the advantageous allele is placed at position 0.5, the derived allele frequency can be calculated in each simulation. HaploPS_search is used to locate the founder haplotype carrying the advantageous allele at the derived allele frequency.

The output of haploPS is then analyzed with “single_origin_haploPS_search_similarity.r”. The haplotype similary index is calculated for each iteration of simulation