Last time I mentioned that Hardy-Weinburd Equilibrium (HWE) made me think about the relationship between population genetics and quantitative genetics. While HWE actually deserves it’s own post so here we are.
Background Info on HWE
HWE is a theoretical relationship between the allele frequency and genotype frequency. For a locus with two alleles, A and a, if the allele frequency of A is p, then the allele frequency the a is q, and the frequency of AA would be p^2, Aa being 2pq and aa being q^2, if this species is diploid, like human, or the plant species I work on (luckily!).
By Johnuniq - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6045237
What if the observation does not match with this prediction? e.g. For locus A, in my dataset with 100 individuals, I have 50 people genotyped as AA, 30 as Aa and 20 as aa, so the allele freqency of A is (50x2 + 30)/200 = 0.65, and 0.35 for a. So in theory, the expection for the genotype AA should be 0.65^2 x 100 = 42.25, Aa = 2 x 0.65 * (1-0.65) x 100 = 22.75, aa = 0.35 x 0.35 x 100 = 12.25. The difference between observation (50,30,20) and expectation (42,45,23) is the deviation.
Currently there are two ways to test whether this deviation is significant. Chronologically, the first way a simple Chi-squared goodness-of-fit test (read more about goodness of fit. In our example, this deviation would be, (50-42)^2/42 + (30-45)^2/45 + (20-23)^2/23 = 6.915. The degrees of freedom for test for Hardy–Weinberg proportions are # genotypes − # alleles, so here is 3-2=1. The 5% significance level for 1 degree of freedom in Chi-squared Distribution is 3.84. Since our χ2 value is greater than this, it deviates from HWE.
Later in the 2005 Wigginton paper, the authors proposed
For more on what is H-W Equilibrium and how to test whether it has deviated from it, see more on its wiki page. In this post we mainly talk about why, whether and how to filter genetic variant based on HWE.
Reasons for Deviation from HWE
Deviatins from HWE could mean several things, mainly:
- inbreeding
- population stratification
- problems in genotyping.
Here when we only want to filter out SNPs that deviate from HWE due to the third reason: problems in genotyping. If a SNP deviated from HWE due to the first two reasons, it would be wrong to filter them.
In the beginning I did H-W evaluation for the whole dataset. This is done through bcftools +fill-tags
. Here is the explaination for all the tags in the manual.
The main relevant tag is HWE
INFO/HWE Number:A Type:Float .. HWE test (PMID:15789306); 1=good, 0=bad
There is another tag that is related, or using the same input info which is the number of heterozygote individuals:
INFO/ExcHet Number:A Type:Float .. Test excess heterozygosity; 1=good, 0=bad
After that post, I decided to do the HW evaluation separately for each subpopulation.
e.g. the first SNP in my dataset it reads HWE=0.0022464
. After filling the tags separately, which is achived via -S sample_population.mapping_file
, this tag read:
HWE_TS1=1;ExcHet_TS1=0.996109;HWE_AGO=0.560936;ExcHet_AGO=0.837149;HWE_Nigeria=1;ExcHet_Nigeria=1;HWE_Evolution=1;ExcHet_Evolution=1;HWE_AVROS=1;ExcHet_AVROS=1;HWE_Ghana=1;ExcHet_Ghana=1;HWE_Compacta=1;ExcHet_Compacta=0.903226;HWE_Ekona=0.428571;ExcHet_Ekona=1;HWE_Deli=1;ExcHet_Deli=1;HWE_Tanzania=1;ExcHet_Tanzania=1;HWE_TS3=1;ExcHet_TS3=1;HWE_L2T=1;ExcHet_L2T=1;HWE_Ni=1;ExcHet_Ni=0.952381;HWE_TR=1;ExcHet_TR=0.947368;HWE=0.0022464;ExcHet=0.999891
oday when I was trying to filter out some SNPs so I do not need to deal with dozens of millions of them, I gave the Hardy-Weinburg Evaluation another thought. It is often desirable to filter out loci based on statistically significant (for a given α-value or P value) deviations from H-W proportions. Everybody knows about Hardy-Weinburg, the iconic p+q=1 -> p^2 + pq + q^2 = 1, simple, elegant, yet too good to be true, just like effective population size (random mating, infinitely large population in the absence of selection, migration, or new mutation). Therefore the concern is that a deviation from it might not be a consequence of poor SNP quality. If we filter too aggressively, we might lose the ones that are actually interesting since their allele frequency might be different in different populations.
Hardy–Weinberg proportions. It is often desirable to filter out locibased on statistically significant (for a given α-value or P value) devia-tions from HWP. HWP are a common assumption of many downstreamanalytical tools (for example, STRUCTURE) 81 , and removing loci thatviolate HWP can help to ensure unbiased results for downstreamanalyses in randomly mating populations 82 . Deviations from HWPoften reflect sequencing, assembly or alignment errors (such as aheterozygote deficit caused by allelic dropout or a heterozygoteexcess caused by paralogous regions) 47,60,83,84. However, loci out ofHWP can also indicate real biological phenomena, such as crypticpopulation substructuring (Fig. 2c) or balancing selection 85. As aresult, it is crucial to filter HWP within sample-groups (for example,within populations) rather than study wide (for example, globally onall samples) 86 (discussed below) and to do so with a low stringency ifthe loci under selection or those that differ between populations areof interest. That said, some metrics, such as FST, can be biased upwardby the careless removal of loci that are not in HWP within popula-tions 86, which is potentially problematic if population delineations