A linkage map is usually required for a lot of quantitative genetic (QTL mapping) and population genetic analysis (gene flow). In this series of post I will be talking about how to build one based on whole genome sequencing data from a family design.

If you are dealing with millions of markers or even more, the only option that I found appropriate is Lep-MAP3. Btw please let me know if you have a better option as I am really not perfectly happy about it.

Sample size and pedigree structure

As for sample size, I will just give you a rule of thumb.

200: reasonable

100 - 200: questionable

< 100: unreliable.

Lep-MAP3 should allow grandparents and half-siblings besides full-sibling and parent-offsprings. It can also deal with selfings. You can include everyone in the same analysis to make your sample size larger.

Input prep

Only two files are required. One is the genotype likelihood, usually a vcf file, and one is a pedigree file.

The first one is pretty straight forward. Just make sure the vcf you provided is not just some genotype calls. If your vcf is recorded from a plink format file, you most-likely have lost the GL or PL info.

The pedigree file looks very confusing. But it is actually just a transpose of the .fam file in the plink format, plus two extra columns in the front as place holder for Chromosome aod positions in the later output. I wrote a python script to help you with the conversion. Note that only two parents are allow in one family, but there can be multiple families. Now let’s talk about some of the more complicated cases.

Multiple families sharing some family members.

In this case you need to list all relevent individuals in each family, under the same ID. For example, GP1 is one of the grandparents for Family_1 and Family_2, then in the pedigree file you will have two columns for GP1, one under Family_1 and one under Family_2. This is also how the program detects half-siblings if they see individuals appeared as parents in multiple families. ParentCall2 will actually prompt you to turn on the halfSibs=1 option in this case. Note that one need to turn on the grandparentPhase=1 in the OrderMarkers2 step when grandparents are identified.

What about selfing ones?

If there are selfing families, which is pretty common in plant breeding. This is what the author suggested on the wiki page of Lep-MAP3:

“As Lep-MAP3 assumes two parents for each family, a selfing crosses cannot be directly analysed with it. However, it is possible to add two dummy parents (one male and another female) to the pedigree.”

Note that you also need to add some dummy columns in your vcf files in this case.

The author also mentioned that “Data for the single parent is not really needed, but the grandparents (say two individuals from different lines crossed to form the parent) can be used.” I don’t know what he means here. Could be, the grandparents are the important info; or if you do not have the single parent info, just use the grandparents as their parents? I wanted to ask this in the forum but kept getting “Spambot protection engaged” from Sourceforge…

Also someone in the forum asked whether one can turn on both grandparentPhase=1 and selfingPhase=1 in the OrderMarkers2 step. The author says that the later is used when there is no grandparents and do not know how the program will function when both are turned on.

Overall pipeline

Step 1: ParentCall2

Recall what is PL: sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.

PL=−10∗logP(Genotype | Data).

P(G D) is calculated by P(D G)P(G)/P(D). See more details on this HaplotypeCaller page of gatk. -10log(P(G|D) will put PL into Pred score scale (Q=-10*logErrorRate). Then PL is normalize across all genotypes by subtracting the value of the lowest PL from all the values, then the PL value of the most likely genotype is 0.

e.g. this is the genotype and their PL value for three samples.

0/1:38,0,59 0/0:0,69,689 0/0:0,57,569

The order goes like: 0/0, 0/1 and 1/1. In the first sample, the most likely genotype is 0/1 (PL=0), and second likely is 0/0 (PL=38). The second and third sample both are called as 0/0, but we have more confidence in the second sample since the different between the most and second most likely genotype is larger (69 > 57).

Note that each variant for each sample will get a ten column string of posterio probabilities. Why ten? This is the number of combinations with replacement with 4 types of nucleotides. CR(n,r) = C(n+r-1, 2) = C(5,2) = 5 x 4 / 2 = 10. However it the 10-number posterior columns are not ordered lexicographically (AA, AC, AG, AT, CC, … TT) but fixed by genotype indices (like VCF’s GT field) rather than nucleotide combinations. It looks like this:

1.REF/REF (0/0)

2.REF/ALT1 (0/1)

3.REF/ALT2 (0/2)

4.REF/ALT3 (0/3)

5.ALT1/ALT1 (1/1)

6.ALT1/ALT2 (1/2)

7.ALT1/ALT3 (1/3)

8.ALT2/ALT2 (2/2)

9.ALT2/ALT3 (2/3)

10.ALT3/ALT3 (3/3)

However since I already prefitered my data so only bi-allelic variants are kept, you will only see the 1st, 2nd and 5th columns are used.

Step 2: Filtering2

“The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping).”

Here segregation distortion refers to deviation from the expected Mendelian inheritance ratios in genetic crosses. Since I already did mendelian check (via bcftools +mendelian2) and excluded(mode -E) those SNPs before the ParentCall2 , in theory this step won’t be filtering out any SNPs.

Step 3: SeparateChromosomes2

The most crucial parameter in this step is lodLimit. This can be understand as how correlated two SNPs has to be to be considered from one linkage group (LG). The higher the more LG you will end up with.

Usually there will be a lot of very small LGs which are not relevant, at least they will not be corresponding to chromosomes. One can use sizeLimit to reset them to 0 (singletons or unassigned).

While you are testing the optimal lodLimit, it can be time consuming. samplePairs=NUM helps to reduce the computing time by 1/NUM times. Once you’ve decided on the lodLimit, then rerun without samplePairs, since it will assigned as 0.

Huan Fan / 2025-05-05
Published under (CC) BY-NC-SA in categories notes tagged with Stats