Last time I mentioned that Hardy-Weinburg Equilibrium (HWE) made me think about the relationship between population genetics and quantitative genetics. While HWE actually deserves it’s own post so here we are.

Background Info on HWE

HWE is the fundamental law of population genetics. There are two people’s names here, Hardy and Weinburg, who develope this theory independently around the same time. It is a theoretical relationship between the allele frequency and genotype frequency. For a locus with two alleles, A and a, if the allele frequency of A is p, then the allele frequency the a is q, and the frequency of AA would be p^2, Aa being 2pq and aa being q^2, if this species is diploid, like human, or the plant species I work on (luckily!). When random mating is true, you can calculate the allele frequency at this locus and find that it is still p and q.

By Johnuniq - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6045237

Deviation from HWE

What if the observation does not match with this prediction? e.g. For locus A, in my dataset with 100 individuals, I have 50 people genotyped as AA, 30 as Aa and 20 as aa, so the allele freqency of A is (50x2 + 30)/200 = 0.65, and 0.35 for a. So in theory, the expection for the genotype AA should be 0.65^2 x 100 = 42.25, Aa = 2 x 0.65 * (1-0.65) x 100 = 22.75, aa = 0.35 x 0.35 x 100 = 12.25. The difference between observation (50,30,20) and expectation (42,45,23) is the deviation. This is how it works for biallellic locus. For multiallellic and polyploidy, see this post on multinomial expansion.

Currently there are two ways to test whether this deviation is significant. Chronologically, the first way a simple Pearson’s Chi-squared goodness-of-fit test (read more about goodness of fit 拟合优度). In our example, this deviation would be, (50-42)^2/42 + (30-45)^2/45 + (20-23)^2/23 = 6.915. The degrees of freedom for test for Hardy–Weinberg proportions are # genotypes − # alleles, so here is 3-2=1. The 5% significance level for 1 degree of freedom in Chi-squared Distribution is 3.84. Since our χ2 value is greater than this, it deviates from HWE.

However, since the Chi-squared test is asymptotic (渐近的), it does not perform very well when the sample size is small (consider multiallelic loci). It also “can have inflated type I error rates, even in relatively large samples” (Wigginton 2005). Therefore be a form of Fisher’s exact test is needed. For example, more recently a number of MCMC methods of testing for deviations from HWP have been proposed (Guo & Thompson, 1992; Wigginton et al. 2005).

There is also something called “Equivalence tests”. For more on what is H-W Equilibrium and how to test whether it has deviated from it, see more on its wiki page. In this post we mainly talk about why, whether and how to filter genetic variant based on HWE.

Reasons for Deviation from HWE

Deviatins from HWE could mean several things, mainly:

inbreeding (defined as mating between individuals sharing a common parent in their ancestry)
population stratification
problems in genotyping.

Here when we only want to filter out SNPs that deviate from HWE due to the third reason: problems in genotyping. If a SNP deviated from HWE due to the first two reasons, it would be wrong to filter them.

In the beginning I did H-W evaluation for the whole dataset. This is done through bcftools +fill-tags. Here is the explaination for all the tags in the manual.

The main relevant tag is HWE

INFO/HWE Number:A Type:Float .. HWE test (PMID:15789306); 1=good, 0=bad

There is another tag that is related, or using the same input info which is the number of heterozygote individuals:

INFO/ExcHet Number:A Type:Float .. Test excess heterozygosity; 1=good, 0=bad

After that post, I decided to do the HW evaluation separately for each subpopulation.

e.g. the first SNP in my dataset it reads HWE=0.0022464. After filling the tags separately, which is achived via -S sample_population.mapping_file, this tag read:

HWE_Pop1=1;ExcHet_Pop1=0.996109;HWE_Pop2=0.560936;ExcHet_Pop2=0.837149;HWE_Pop3=1;ExcHet_Pop3=1

Today when I was trying to filter out some SNPs so I do not need to deal with dozens of millions of them, I gave the Hardy-Weinburg Evaluation another thought. It is often desirable to filter out loci based on statistically significant (for a given α-value or P value) deviations from H-W proportions. Everybody knows about Hardy-Weinburg, the iconic p+q=1 -> p^2 + pq + q^2 = 1, simple, elegant, yet too good to be true, just like effective population size (random mating, infinitely large population in the absence of selection, migration, or new mutation). Therefore the concern is that a deviation from it might not be a consequence of poor SNP quality. If we filter too aggressively, we might lose the ones that are actually interesting since their allele frequency might be different in different populations.

Hardy–Weinberg proportions. It is often desirable to filter out locibased on statistically significant (for a given α-value or P value) devia-tions from HWP. HWP are a common assumption of many downstreamanalytical tools (for example, STRUCTURE) 81 , and removing loci thatviolate HWP can help to ensure unbiased results for downstreamanalyses in randomly mating populations 82 . Deviations from HWPoften reflect sequencing, assembly or alignment errors (such as aheterozygote deficit caused by allelic dropout or a heterozygoteexcess caused by paralogous regions) 47,60,83,84. However, loci out ofHWP can also indicate real biological phenomena, such as crypticpopulation substructuring (Fig. 2c) or balancing selection 85. As aresult, it is crucial to filter HWP within sample-groups (for example,within populations) rather than study wide (for example, globally onall samples) 86 (discussed below) and to do so with a low stringency ifthe loci under selection or those that differ between populations areof interest. That said, some metrics, such as FST, can be biased upwardby the careless removal of loci that are not in HWP within popula-tions 86, which is potentially problematic if population delineations

Reference

Wigginton 2005

Huan Fan / 2024-12-02
Published under (CC) BY-NC-SA in categories notes tagged with stats