Huan Fan http://fanhuan.github.io 2026-05-18T09:10:37+00:00 huan.fan@wisc.edu The Hidden Importance of Founders in PLINK Analysis http://fanhuan.github.io/en/2026/05/15/Plink-Founders/ 2026-05-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2026/05/15/Plink-Founders Recently I starting doing family-based GWAS using SNIPAR. This means I need to know the relationship between the samples in my analysis. Previously I only have info on two families which takes the majority of the data that I am working on, and I just treated the rest as un-related. But I know that is not true. In order to increase the sample size, I used KING, a kinship inference tool to predict the possible relationships based on SNP data. Then I check with the breeders to see whether they agree with those relationships. So now in my dataset, a lot of individuals have derived hypothetical PID or MID (parental or maternal ID), just to suggest full or half sibling relationships.

Then I just went ahead to do my usually data preparation using PLINK until I realized some problem, and it centers around this concept called founder.

  • Anyone with 0 0 in columns 3–4 of the .fam file (no parents listed)
  • Not a biological concept — purely a pedigree bookkeeping artifact
  • Population datasets with no pedigree: everyone is a founder (fine)
  • Breeding/family datasets with pedigrees filled in: only the top generation are founders (can be very few)

This is all because by default, PLINK calculates allele frequencies based on founders only. Related individuals share alleles IBD — counting them equally inflates the effective sample size and biases allele frequency estimates. Using only founders approximates sampling independent chromosomes from the base population.

We talked about base population before. At that point, I thought it only affects certain plink functions such as --maf or --hwe. Not until today did I realized that by default, any feature of PLINK is based on the base population or the founders. OK so the first conclusion of today is, in PLINK, founders are the base population.

3. Analyses silently affected by founder status

Basically any analysis. You need to be very careful about whether you want to just use the founders (if your pedigree in the .fam file is correct), or all the individuals (turn on --nonfounders). Sometimes you also do not want to do the latter if your dataset is heavily biased by some families like I do. Here is a limited summary table for features I usually use. But again, only founders are used for allele freq calculation by default for any featuer, any!

Flag What uses founders Consequence if few founders
--freq Frequency computed from founders only Inaccurate MAF
--maf Filters based on founder frequencies Wrong variants removed/retained
--hwe HWE test on founders only Underpowered or wrong results
--pca (PLINK 1.9) GRM built from founders only Fails if N_founders < 20 or has duplicates
--pca approx (PLINK 2) Allele freqs from founders Hard error if N_founders < 50
--indep-pairwise / LD pruning r² computed from founders only Over-pruning when few founders (spurious LD from small N)
--genome / IBD Uses founder allele frequencies Biased IBD estimates

4. How did I discover this silent scary behavior?

  1. Like I said in the beginning, after adding all those PID and MID, there are very few founders left in my dataset, and I noticed that a lot more SNPs were filtered out under the same --maf.

  2. Then I realized that it also affects the LD prunning because under the same parameters (--indep-pairwise 500 50 0.8 ), higher percentage of SNPs were found in LD/heavier prunning.

  3. Eventually, PCA failed:

  • PLINK 1.9 --pca: silent failure with cryptic GRM error (“Failed to extract eigenvector(s) from GRM”), probably a singularity problem.
  • PLINK 2 --pca approx: explicit error (“less than 50 founders available to impute allele frequencies”)

Both errors have the same root cause: the GRM and allele frequency estimation are operating on fewer than 50 individuals for a dataset with thousands of samples.

5. Solutions and tradeoffs

--nonfounders: Usually this is an easy problem to fix by turning on this option and use all individuals in the dataset. This indeed retained slightly more SNPs (less than 10%) using all thousands of individuals, however still significantly less than the previous batch with only hundreds of individuals.

  • --freq + --read-freq: pre-compute frequencies from a representative subset, then feed them in — most principled for mixed datasets. Four-step workflow:
    1. Pre-filter without --maf (apply --geno and --mind only)
    2. Define a representative subset: include all true founders (PID=0, MID=0) plus one individual per unique (PID, MID) pair among non-founders. This ensures every independent lineage contributes exactly once — full siblings collapse to one representative, but half-siblings (who share only one parent and thus have different (PID, MID) combinations) each get their own representative.
    3. Compute frequencies from that subset: plink --bfile ... --keep <subset> --freq --nonfounders --out ...--nonfounders is required here because the subset includes non-founders (e.g., the half-sib representatives); without it PLINK falls back to the 27 true founders.
    4. Apply MAF filter using the pre-computed frequencies: plink --bfile ... --read-freq <freq_file> --maf 0.005 --make-bed --out ...
  • Remove relatives first for LD pruning: use --rel-cutoff (can try third degree: 0.125 or second degree: 0.25) + --make-founders (required when parents are absent from the kept subset) + --indep-pairwise; apply the resulting prune list to the full dataset. For highly structured multi-population datasets, population structure will still inflate LD — per-population pruning followed by taking the union of kept variants is the most principled approach.

    A note on consistency between MAF and LD representative selection: It is natural — and correct — to use different criteria at the two stages. For MAF estimation, the pedigree-based approach (one per unique PID/MID pair) is optimal because it uses known family structure to ensure independent lineage representation; half-siblings are included because their distinct (PID, MID) pairs represent genuinely different crosses. For LD estimation, a kinship cutoff (e.g., 0.125) uses empirical relatedness to prevent shared haplotype blocks from inflating apparent LD; half-siblings (IBD ≈ 0.25) are excluded by this threshold. The LD stage being stricter about relatedness than the MAF stage is the safe direction and is not a methodological inconsistency.

  • --bad-freqs: override (not recommended — hides the problem)

Attempt 1 — default (a couple dozens of founders): retained only ~4.5% of variants vs ~12.4% for a previous version where we assigned hundreds of founders. Noisy r² from small N causes spurious high-LD calls and over-pruning.

Attempt 2 — --nonfounders (all individuals in thousands): retained even fewer variants (~4.2%). This is counterintuitive — more individuals, yet worse results. The explanation requires understanding two distinct sources of r² inflation:

  • Attempt 1 suffers from small-N noise: with only ~27 individuals, r² estimates are imprecise and systematically upward-biased (r² is bounded at 0, so random errors can only push it higher, never lower). Some truly unlinked variants get flagged as in LD by chance.

  • Attempt 2 suffers from kinship-induced pseudo-LD: related individuals share long IBD haplotype blocks. Two variants sitting on the same shared haplotype will co-occur systematically across all members of a family — not because of actual LD in the population, but because of shared ancestry. Within a pruning window, PLINK cannot distinguish this from real LD and prunes accordingly. This is especially bad when you have a lot of related samples in your dataset.

In my case, the kinship inflation turns out to be larger than the small-N noise inflation, so going from a couple of dozens of founders to thousands of related individuals makes things worse. Therefore we need to remove relatives first — you need a dataset where r² reflects actual population LD, not shared ancestry.

At first I tried to get a unrelated subset using --rel-cutoff 0.125 , but again only a couple of dozens of individuals are left. leaving 14 — worse than the original 27 founders. This is because 2nd-degree relatedness is pretty common in my dataset. Then I tried a lower cut off --rel-cutoff 0.25 (remove only 1st-degree + duplicates), now we have a few hundreds remaining. You then need to make all of them founders (--make-founders)

Attempt 5 — add --make-founders: promotes all individuals with absent parents to founder status. This is necessary whenever you use --keep to subset a pedigree dataset. Still retained fewer variants than expected (~3.1%), because population structure (many divergent populations) inflates within-window r² regardless of relatedness.

Validation: despite all this, PCA eigenvectors computed before and after LD pruning showed >0.99 correlation — confirming that for PCA, the exact pruning strategy matters little in practice.

6. Key takeaway

Always check your founder count before running any frequency-dependent analysis:

grep "founders" your.log

If you have a pedigree-filled .fam file and few founders, every downstream result is quietly wrong unless you intervene. The --hwe case is worth special attention: HWE violations are expected in related samples, so filtering on HWE in a pedigree dataset silently removes valid markers.

]]>
GBLUP Overfitting http://fanhuan.github.io/en/2025/10/06/GBLUP-Overfitting/ 2025-10-06T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/10/06/GBLUP-Overfitting A while ago we explained the math behind BLUP. Recently I was doing some GBLUP for a bunch of traits on the same individuals and for some traits, the accuracy of the prediction was greater than the broad-sense heritability (h2)! This should not be happening and this is to document my debugging process.

Firstly some background on BLUP vs GBLUP. Simply put, the model is exactly the same, y = Xb + Za + e. The difference lies in Z. In BLUP, it is usually the A matrix which is based on pedigree; while in GBLUP, it is usually a genetic relationship matrix (GRM) calculated based on molecular markers such as SNPs. In my case it is a GRM calculated based on millions of WGS markers.

The tool I was using is a R package called rrblup. Let’s first define how I calculate h2 and the accuracy then why there might be overfitting. In this post we will not talk about fixed effect. Maybe in another post we will do.

For h2, it is calculated with the full data:

model <- rrblup::mixed.solve(y=y, K=GRM)
h2 <- model$Vu/(model$Vu + model$Ve) 

For accuracy: cor(y, model$u)^2. Note that in rrblup::mixed.solve(), model$u is Za, not a. The corrent form should actually be cor(Za + e, Za)^2, but since in this post there is no fixed effect, this is equivalent to cor(y, Za)^2.

Hypothesis 1: rescale of GRM. Did not help.

# Check if K is properly scaled
mean(diag(GRM))  # Should be close to 1
# If not, try:
G_scaled <- GRM / mean(diag(GRM))
model <- mixed.solve(y = y, K = G_scaled)

Hypothesis 2: population structure can create “information leakage”

From the BLUP calculation, we know that the weight is much higher from their close relatives than others in the linear combination for the prediction. Therefore when there is strong population structure (which is true in my dataset), individual i from population A is predicted mainly by its close relatives. The genetic correlation (represented in their pairwise similarity in GRM) can be confounded with population-specific effects, such as environmental factors or other cryptic relatedness. Since we only have one random effect term, everything is lumpped into this term.

Now the real problem is, why this inflates r2, but not h2?

captures both true genetic effects and This is because some of the

]]>
IntroBlocker with its Output Explained. http://fanhuan.github.io/en/2025/09/04/IntroBlocker-Output-Explained/ 2025-09-04T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/09/04/IntroBlocker-Output-Explained I have been playing with this tool called IntroBlocker recently and would like to document what I understand, especially on the output files.

This tool was published along with a population genomic study on wheat (Wang 2022.

The whole program breaks into 5 major parts as suggested both in the code and the output folders.

]]>
GWAS and Its Peaks http://fanhuan.github.io/en/2025/08/08/GWAS-Peaks/ 2025-08-08T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/08/08/GWAS-Peaks When we do a manhattan plot for GWAS results, we are expecting to see sharp peaks, the sharper the better. But how about those isolated points with very low p-values, even after adjustment/punishment? Why are they less trustworthy? It is something that I know for a fact, but always having problem explaining to people who do not do GWAS. Today I’d like to solve this problem once and for all (wow ambitious)!

At the heart of the problem is something called Linkage Disequilibrium (LD). This word has been the center of my universe in the recent couple of years. Everything dated back in 2010 in Okinawa; LD and coalescent is the center of every theory and every lecture, together with all these selections.

Linkage Disequilibrium and Signal Coherence

When you see a sharp peak with multiple SNPs showing strong associations, it typically reflects the underlying linkage disequilibrium (LD) structure of the genome. SNPs in close proximity tend to be inherited together, so a true causal variant should create a signal that extends across nearby correlated SNPs. An isolated significant SNP surrounded by non-significant variants suggests the signal might not be reflecting a genuine biological effect in that genomic region.

Technical Artifacts and Genotyping Errors

Isolated significant SNPs are more likely to represent technical problems like genotyping errors, batch effects, or platform-specific artifacts. These issues typically affect individual SNPs rather than entire LD blocks. Quality control procedures can miss some of these problems, especially if they’re systematic across cases and controls.

Population Stratification Issues

Inadequately corrected population structure can create spurious associations at individual SNPs, particularly those with unusual allele frequency patterns across ancestral groups. Well-designed studies use principal components or other methods to control for this, but isolated signals might indicate residual stratification.

Multiple Testing Considerations

While you mention adjusted p-values, the genomic context matters for interpretation. A single SNP reaching genome-wide significance (typically 5×10⁻⁸) in isolation is statistically significant but lacks the biological plausibility that comes with seeing the expected LD pattern around a true association.

Biological Plausibility

Clustered signals often coincide with known genes, regulatory elements, or functional annotations, providing biological context. Isolated SNPs in gene deserts or without obvious functional relevance require more scrutiny.

However, isolated SNPs aren’t automatically false positives - they could represent rare variants with large effects, structural variants not well-captured by standard arrays, or associations in regions of low LD. The key is to evaluate them with additional evidence like replication studies, functional annotation, and deeper sequencing.

]]>
Breakpoints vs Breakends http://fanhuan.github.io/en/2025/07/03/Breakpoint-vs-Breakend/ 2025-07-03T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/07/03/Breakpoint-vs-Breakend Have been working on structual variations recently and came across some new concepts.

Before that some brief recap on the Alternative allele field format (section 1.2.5). If the ALT column starts with left angle bracket (<), it suggests an IMPRECISE structual variant. Being imprecise means that the values in the INFO column (END, SVLEN etc.) is estimated to the best of the mapping info.

Among which I found two confusing ones: breakpoints and breakends.

Breakpoint is a general term. It is the precise positions in the genome where the DNA is broken and rearranged. In a perticular SV, is the start or end coordinate, precisely or imprecisely but to the best estimation.

Then what are breakends?

I was first introduced the idea of “breakends” in the manta user guide (btw it still uses python 2.7 and has not been and will not be updated since 2019; I guess everyone is doing long reads now).

Manta divides the SV and indel discovery process into two primary steps: (1) scanning the genome to find SV associated regions and (2) analysis, scoring and output of SVs found in such regions.

Build __breakend__ association graph In this step the entire genome is scanned to discover evidence of possible SVs and large indels. This evidence is enumerated into a graph with edges connecting all regions of the genome which have a possible __breakend__ association. 

According to the VCF specification (v4.1) there are only 6 types of structual variants (SVTYPE) and they are:

  • DEL: deletion
  • INS: insertion
  • DUP: duplication
  • INV: inversion
  • CNV: copy number variation
  • BND: Breakend

So breakend is one of them.

The first five are pretty self-explainatory. There is a whole section decicated to breakend in the VCF specification: 5.4 Specifying complex rearrangements with breakends.

An arbitrary rearrangement event can be summarized as a set of novel adjacencies. Each adjacency ties together 2 breakends. The two breakends at either end of a novel adjacency are called mates.

Here we first need to understand what is a novel adjacency, which is a new connection between two genomic positions that are not adjacent in the reference genome, suggesting a structural variant in the sample genome.

]]>
Lost In Translation http://fanhuan.github.io/en/2025/05/28/Lost-In-Translation/ 2025-05-28T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/05/28/Lost-In-Translation While working with a vcf file, I noticed that one of the variant looked like this:

ID REF ALT

chr1_254_A_T T A

I was pretty confused. The ID suggested that A is the REF call and T is the Alternative. However the REF and ALT columns suggest the opposite. I was immediate alarmed since this could have cause problematic genotype calls where 0/0 and 1/1 are switched.

How could this be? I checked my original vcf file with which the current one is a subset of, things are OK. the ID is still chr1_254_A_T, and the REF is A and ALT is T. So where did thing go wrong?

Looking through my notes, I realized that I have converted my vcf to plink format, did some prunning there, and then converted the plink files back to vcf. Could this be the problem?

The variant information is stored in the .bim file, and here is its definition:

.bim (PLINK extended MAP file)
Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.)

A text file with no header line, and one line per variant with the following six fields:

Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
Variant identifier
Position in morgans or centimorgans (safe to use dummy value of '0')
Base-pair coordinate (1-based; limited to 231-2)
Allele 1 (corresponding to clear bits in .bed; usually minor)
Allele 2 (corresponding to set bits in .bed; usually major)

As you can see, column 5 is the minor allele and column 6 is the major. This means we have lost the info on which one is REF and which one is ALT. When you use plink --recode vcf to convert your .bim back to vcf, it will just assume that the major is REF, which is not always true.

So what can you do? When converting your vcf to the plink format via plint --vcf input.vcf --make-bed, make sure to add either --keep-allele-order or --real-ref-alleles. Then the .bim file will be correct and when you convert it back to vcf later, there should be any problem. It is said that from plink 2.0 does not have this problem and will always respect the original REF/ALT order.

Happy genotyping!

]]>
Building a Linkage Map - Part I http://fanhuan.github.io/en/2025/05/05/Lep-MAP3-I/ 2025-05-05T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/05/05/Lep-MAP3-I A linkage map is usually required for a lot of quantitative genetic (QTL mapping) and population genetic analysis (gene flow). In this series of post I will be talking about how to build one based on whole genome sequencing data from a family design.

If you are dealing with millions of markers or even more, the only option that I found appropriate is Lep-MAP3. Btw please let me know if you have a better option as I am really not perfectly happy about it.

Sample size and pedigree structure

As for sample size, I will just give you a rule of thumb.

200: reasonable

100 - 200: questionable

< 100: unreliable.

Lep-MAP3 should allow grandparents and half-siblings besides full-sibling and parent-offsprings. It can also deal with selfings. You can include everyone in the same analysis to make your sample size larger.

Input prep

Only two files are required. One is the genotype likelihood, usually a vcf file, and one is a pedigree file.

The first one is pretty straight forward. Just make sure the vcf you provided is not just some genotype calls. If your vcf is recorded from a plink format file, you most-likely have lost the GL or PL info.

The pedigree file looks very confusing. But it is actually just a transpose of the .fam file in the plink format, plus two extra columns in the front as place holder for Chromosome aod positions in the later output. I wrote a python script to help you with the conversion. Note that only two parents are allow in one family, but there can be multiple families. Now let’s talk about some of the more complicated cases.

  1. Multiple families sharing some family members.

In this case you need to list all relevent individuals in each family, under the same ID. For example, GP1 is one of the grandparents for Family_1 and Family_2, then in the pedigree file you will have two columns for GP1, one under Family_1 and one under Family_2. This is also how the program detects half-siblings if they see individuals appeared as parents in multiple families. ParentCall2 will actually prompt you to turn on the halfSibs=1 option in this case. Note that one need to turn on the grandparentPhase=1 in the OrderMarkers2 step when grandparents are identified.

  1. What about selfing ones?

If there are selfing families, which is pretty common in plant breeding. This is what the author suggested on the wiki page of Lep-MAP3:

“As Lep-MAP3 assumes two parents for each family, a selfing crosses cannot be directly analysed with it. However, it is possible to add two dummy parents (one male and another female) to the pedigree.”

Note that you also need to add some dummy columns in your vcf files in this case.

The author also mentioned that “Data for the single parent is not really needed, but the grandparents (say two individuals from different lines crossed to form the parent) can be used.” I don’t know what he means here. Could be, the grandparents are the important info; or if you do not have the single parent info, just use the grandparents as their parents? I wanted to ask this in the forum but kept getting “Spambot protection engaged” from Sourceforge…

Also someone in the forum asked whether one can turn on both grandparentPhase=1 and selfingPhase=1 in the OrderMarkers2 step. The author says that the later is used when there is no grandparents and do not know how the program will function when both are turned on.

Overall pipeline

img

Step 1: ParentCall2

Recall what is PL: sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.

PL=−10∗logP(Genotype | Data).

P(G D) is calculated by P(D G)P(G)/P(D). See more details on this HaplotypeCaller page of gatk. -10log(P(G|D) will put PL into Pred score scale (Q=-10*logErrorRate). Then PL is normalize across all genotypes by subtracting the value of the lowest PL from all the values, then the PL value of the most likely genotype is 0.

e.g. this is the genotype and their PL value for three samples.

0/1:38,0,59 0/0:0,69,689 0/0:0,57,569

The order goes like: 0/0, 0/1 and 1/1. In the first sample, the most likely genotype is 0/1 (PL=0), and second likely is 0/0 (PL=38). The second and third sample both are called as 0/0, but we have more confidence in the second sample since the different between the most and second most likely genotype is larger (69 > 57).

Note that each variant for each sample will get a ten column string of posterio probabilities. Why ten? This is the number of combinations with replacement with 4 types of nucleotides. CR(n,r) = C(n+r-1, 2) = C(5,2) = 5 x 4 / 2 = 10. However it the 10-number posterior columns are not ordered lexicographically (AA, AC, AG, AT, CC, … TT) but fixed by genotype indices (like VCF’s GT field) rather than nucleotide combinations. It looks like this:

1.REF/REF (0/0)

2.REF/ALT1 (0/1)

3.REF/ALT2 (0/2)

4.REF/ALT3 (0/3)

5.ALT1/ALT1 (1/1)

6.ALT1/ALT2 (1/2)

7.ALT1/ALT3 (1/3)

8.ALT2/ALT2 (2/2)

9.ALT2/ALT3 (2/3)

10.ALT3/ALT3 (3/3)

However since I already prefitered my data so only bi-allelic variants are kept, you will only see the 1st, 2nd and 5th columns are used.

Step 2: Filtering2

“The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping).”

Here segregation distortion refers to deviation from the expected Mendelian inheritance ratios in genetic crosses. Since I already did mendelian check (via bcftools +mendelian2) and excluded(mode -E) those SNPs before the ParentCall2 , in theory this step won’t be filtering out any SNPs.

Step 3: SeparateChromosomes2

The most crucial parameter in this step is lodLimit. This can be understand as how correlated two SNPs has to be to be considered from one linkage group (LG). The higher the more LG you will end up with.

Usually there will be a lot of very small LGs which are not relevant, at least they will not be corresponding to chromosomes. One can use sizeLimit to reset them to 0 (singletons or unassigned).

While you are testing the optimal lodLimit, it can be time consuming. samplePairs=NUM helps to reduce the computing time by 1/NUM times. Once you’ve decided on the lodLimit, then rerun without samplePairs, since it will assigned as 0.

]]>
Genotype Likelihood http://fanhuan.github.io/en/2025/04/17/Genotype-Likelihood/ 2025-04-17T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/04/17/Genotype-Likelihood Yesterday when we were talking about sequencing depth we ran into a word: genotype likelihood. I know what is genotype; I also know what is likelihood; but what is genotype likelihood?

Before we start, let’s do a quick recap on the difference between probability and likelihood, since they are usually a mixtures in my head unless I really try to focus on their differences.

  • Probability = What is the chance of this outcome, given a model or known parameters? P(D∣θ)

  • Likelihood = How plausible is this model (or parameter value), given the data I observed? L(θ∣D)

Therefore probabilies are used in simulation from known models, and likelihood is used in inferring model parameters from observed data (what we are doing most of the time).

OK, so from our understanding of genotype and likelihood, genotype likelihood should be L(AA/Aa/aa mapping results).

Or is it?

so, some sort of quality score, like the QUAL column in a vcf? P(data∣no variant). Phred Quality Score (Q)=−10×log10​(P)

So:

QUAL = 30 → 1 in 1000 chance the site is not a real variant

QUAL = 50 → 1 in 100,000 chance it's a false positive

Reference chain: Kardos 2024 Molecular Ecology -> 2019 Bertrand MEE -> 2016 Vieira Bioinformatics -> 2009 The sequence alignment/map format and SAMtools

]]>
How Low is Low? http://fanhuan.github.io/en/2025/04/16/Sequencing-Coverage/ 2025-04-16T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/04/16/Sequencing-Coverage You have decided to do whole genomic sequencing (WGS) for your research project. You contacted your sequencing service provider. The first question you will get is: how much data do you need.

What they are actually asking is: what is the sequencing depth, or sometimes referred to as coverage are you expecting.

We all know that coverage limits the kind of analysis we could carry out. But how much coverage is enough?

In Hemstrom 2024, they tried to define what is Low-coverage WGS. They really tried; they put it into the glossary part:

Low-coverage whole-genome sequencing: 
Whole-genome sequencing (WGS)with small numbers of reads covering most genomic loci (low coverage);
the number of reads constituting low coverage varies widely depending on the discipline, 
methodology and research question. Low-coverage WGS often requires genotype likelihood-based methods.

OK. So what have we got from these sentences? That “the number constituting low coverage varies widely depending on the discipline, methodology and research question”. This means no matter which discipline, which methodology and what kind of research questions you have, you still do not know what is considered low-coverage! But once you’ve decided that your coverage is indeed low for your perticular circumstance, you should use “genotype likelihood-based methods”.

Wow. Where do we start. Maybe let’s understand more about this “genotype likelihood-based methods” and it might help us understand when we need to use it and back calculate what is considered low-coverage. Here is a post on genotype likelihood if you are not sure what it is.

Here they cited an attack, sorry, no, a comment on a pretty famous paper on the inbreeding of North American wolves. In the wolf paper, the sequencing coverage is 7X. Wow OK that actually sounds low. Imaging if you have a heterozygous site, you won’t have five reads to support either, let alone the PCR duplication, which can be actually very high (5% to 50% in my current dataset). OK I would say anything below 10X is a no-brainer low. Later I also discovered this paper used RAD-seq. 7X coverage RAD-seq for 437 individuals (ok the sample size is pretty good). Man we need more funding on conservation.

OK back to the main topic. How low is considered low? The comment paper actually investigated on this matter and showed us some data.

img

This is the meat of the paper. Let’s take a look at some of the relevant subplots.

Figure 1c: This is saying the probability of seeing both alleles in a heterozygous locus will reach amost 1 when the read depth is 10. However this is assuming sequence reads are independent (no PCR duplicates) and that each allele is equally likely to be sequenced. So 10 is the absolute low threshold. You should at least do better than 10.

Figure 1d: F-ROH(run of homozygousity), a finer way of estimating inbreed coefficient (F), see this paper on Newzealand hihi (a friendly bird) on more details of ROH, stabalizes after the read depth reaches 5. You may say ok this is no problem since the coverage is 7. No. Then mean coverage is 7, meaning a lot of the loci might have <5 coverage.

Figure 1e: H-obs is the percentage of heterozygous sites observed, and it just kept on rising even after 20X.

Figure 1f: H-exp is the percentage of heterozygous sites calculated based on Hardy-Weinberg Equilibrium. It stabalizes after 10X. But as the authors pointed out, the pattern is clearer than in Figure 1e, since nobody with an H-exp higher than 0.22 had a read depth lower than 10X. This is to say the H-exp is capped by the read depth.

Figure 1g: Here missingness means missing calls of genotypes at a site for an individual. You can see that only when the read depth reached 15 when the trend stablizes.

OK, based on this one study, I will just say that 10X is the bare-minimum, and only >20X can be considered safe for a diploid genome.

Please take note on the ‘>’ before 20X. Let me emphasize. This is not the mean, but the min! If you tell your sequence service provider that you want 20X, you might end up with lots of samples or loci under 20X, even under 10X. I took a brief look on the dataset that I am working on right now. There is indeed a strong correlation between the mean depth of the variants called, and the mean depth of the sequencing effort (r close to 0.9). However the ratio between the two is between 0.5 to 0.75. That is to say in the worse case, only half of the reads were useful in calling the variants. That translate to 27X(0.75) to 40X(0.5) of sequencing effort. This ratio is negatively correlated with the duplication rate (r close to -0.8). Maybe you can go for 30X, and resequence the ones with low variant coverage later.

Good luck to everyone on securing a bigger funding!

]]>
Candidate Genes, what's next? http://fanhuan.github.io/en/2025/04/15/Candidate-Gene/ 2025-04-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/04/15/Candidate-Gene You’ve done GWAS and there are some peaks, and some of them seem to lie within or next to some important genes. Now what do you do? How do you validate what you find?

]]>