Huan Fan http://fanhuan.github.io 2025-03-27T04:05:11+00:00 huan.fan@wisc.edu HISAT2 Alignment Statistics http://fanhuan.github.io/en/2025/03/25/hisat2-stats/ 2025-03-25T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/03/25/hisat2-stats I’m trying to tidy the mapping statistics given by HISAT2 into a table. First let’s get on the same page with the statistics it spits out.

For example:

50964542 reads; of these:
  50964542 (100.00%) were paired; of these:
    2909022 (5.71%) aligned concordantly 0 times
    45584085 (89.44%) aligned concordantly exactly 1 time
    2471435 (4.85%) aligned concordantly >1 times
    ----
    2909022 pairs aligned concordantly 0 times; of these:
      205308 (7.06%) aligned discordantly 1 time
    ----
    2703714 pairs aligned 0 times concordantly or discordantly; of these:
      5407428 mates make up the pairs; of these:
        2967368 (54.88%) aligned 0 times   
        2131328 (39.41%) aligned exactly 1 time
        308732 (5.71%) aligned >1 times
97.09% overall alignment rate

Line by line:

50964542 reads; of these:                                                      # Total read pairs
  50964542 (100.00%) were paired; of these:                                    
    2909022 (5.71%) aligned concordantly 0 times                               # Unaligned Concordant Pairs
    45584085 (89.44%) aligned concordantly exactly 1 time                      # Concordant Unique Pairs
    2471435 (4.85%) aligned concordantly >1 times                              # Concordant Multi-mapped
    ----
    2909022 pairs aligned concordantly 0 times; of these:
      205308 (7.06%) aligned discordantly 1 time                               # Discordant Alignments
    ----
    2703714 pairs aligned 0 times concordantly or discordantly; of these:      # Unaligned Pairs Total
      5407428 mates make up the pairs; of these:
        2967368 (54.88%) aligned 0 times                                       # Single-end Unaligned
        2131328 (39.41%) aligned exactly 1 time                                # Single-end Unique
        308732 (5.71%) aligned >1 times                                        # Single-end Multi
97.09% overall alignment rate                                                  # Overall Alignment Rate

This stats tells us the underlying logic of HISAT2.

  1. Check whether reads are paired. In this case all of them are (100%)

  2. Try to align reads in pairs. This resulted in three groups:
    • Concordant Unique Pairs (45584085). This means both mates aligned uniquely in correct orientation/distance. These are the high quality alignment.
    • Concordant Multi-mapped (2471435). Both mates aligned to multiple locations in correct orientation/distance. Likely to repetitive regions.
    • The rest (2909022). For a lack of better words let’s call this group Non-Concordant Pairs.
  3. Now it is about different situations within the Non-Concordant Pairs.
    • Discordant alignment (205308): this means both mates aligned uniquely, but not concordantly (not in correct orientation or distance). These alignments are interesting since they could suggest potential structural variation or mis-assemblies.
    • Single-end (2703714 pairs or 5407428 mates). For the rest of reads, basically HISAT2 failed to treat them as pairs, and now is trying to salvaged them individually as single-end. For these mates, there are obviously three situations: Single-end Unaligned (2967368), Single-end Unique (2131328) and Single-end Multi (308732). Among those, the only thing that could be informative is the Single-end Unique.
  4. Now you might have guessed how the overall alignment rate is calcualted: 1- Single-end Unaligned/(Total read pairs * 2), since those are the only reads that faied to map to anywhere in the reference. This could be due to distances between the ref and your sample, or an incomplete ref in terms of short read mapping.

I wrote a short python script that takes the stderr of HISAT2 and tidy it up in the terminology defined in this post.

]]>
Probabilities and Log Likelihood http://fanhuan.github.io/en/2025/03/10/Probabilities-And-Log-Likelihood/ 2025-03-10T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/03/10/Probabilities-And-Log-Likelihood log(probability) = Log likelihood, range = exp(log_likelihood) = p, range = [0,1].

]]>
Tensors, Tokens and Embeddings http://fanhuan.github.io/en/2025/02/28/Tensors-Tokens-Embeddings/ 2025-02-28T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/28/Tensors-Tokens-Embeddings Don’t let the jargons scare you away is the thing that you need to remind yourself constantly. They are here for a lack of better words.

Tensors (张量)

Tensor is actually a pretty fundamental concept.

Scaling laws in the context of natural language processing (NLP) and computer vision refer to the predictable relationships between the size of a model (e.g., number of parameters), the amount of training data, and the model’s performance (e.g., accuracy, loss, or other metrics). These laws describe how performance improves as you scale up key factors like model size, dataset size, and computational resources. But before we can understand its importance, we need to first understand power law.

Power Law

A power-law relationship is a mathematical relationship between two quantities where one quantity varies as a power of the other. In other words, one quantity is proportional to the other raised to an exponent. Mathematically, it is expressed as:

[ y = k \cdot x^n ]

Where:

  • ( y ) is the dependent variable (e.g., model performance),
  • ( x ) is the independent variable (e.g., model size, dataset size, or compute),
  • ( k ) is a constant (proportionality factor),
  • ( n ) is the exponent (a constant that determines the shape of the relationship).

Key Characteristics of Power-Law Relationships:

  1. Non-linear: Unlike linear relationships (( y = mx + b )), power-law relationships are non-linear. This means that changes in ( x ) lead to disproportionate changes in ( y ).
  2. Scale-invariant: Power-law relationships appear the same at all scales. If you zoom in or out on the data, the relationship retains its shape.
  3. Heavy-tailed distribution: In many real-world systems, power-law relationships describe phenomena where small events are common, but large events are rare (e.g., word frequency in language, city sizes, or income distribution).

Examples of Power-Law Relationships:

  1. Natural Language Processing (NLP):
    • Model performance (e.g., perplexity or accuracy) often improves as a power-law function of model size, dataset size, or compute. For example: [ \text{Performance} \propto (\text{Model Size})^n ]
    • This means doubling the model size might lead to a less-than-doubling improvement in performance, depending on the exponent ( n ).
  2. Computer Vision:
    • Image recognition accuracy often scales as a power-law function of the number of training images or model parameters.
  3. Real-World Phenomena:
    • Zipf’s Law: In linguistics, the frequency of a word is inversely proportional to its rank in the frequency table (e.g., the most common word appears twice as often as the second most common word).
    • Pareto Principle (80/20 Rule): 80% of outcomes often come from 20% of causes (e.g., 80% of wealth is owned by 20% of the population).
    • Network Science: The distribution of connections in many networks (e.g., social networks, the internet) follows a power law, where a few nodes have many connections, and most nodes have few.

Visualizing a Power-Law Relationship:

When plotted on a log-log scale (where both axes are logarithmic), a power-law relationship appears as a straight line. This is because taking the logarithm of both sides of the equation ( y = k \cdot x^n ) gives:

[ \log(y) = \log(k) + n \cdot \log(x) ]

This is the equation of a straight line with slope ( n ) and intercept ( \log(k) ).


Why Power-Law Relationships Matter in AI:

  1. Predictability: Power-law relationships allow researchers to predict how performance will improve as they scale up resources (e.g., model size, data, compute).
  2. Optimization: Understanding power-law scaling helps allocate resources efficiently. For example, if performance improves slowly with larger models, it might be better to invest in more data or better algorithms.
  3. Benchmarking: Power-law relationships provide a framework for comparing different models and architectures.

Example in AI Scaling:

In OpenAI’s research on scaling laws for language models, they found that: [ \text{Test Loss} \propto (\text{Model Size})^{-\alpha} \cdot (\text{Dataset Size})^{-\beta} \cdot (\text{Compute})^{-\gamma} ] Here, ( \alpha ), ( \beta ), and ( \gamma ) are exponents that describe how performance improves with scaling.


In summary, a power-law relationship describes how one quantity changes as a power of another. It is a fundamental concept in AI scaling, as well as in many natural and social phenomena.

]]>
Scaling Law and Power Law http://fanhuan.github.io/en/2025/02/24/Scaling-Law-and-Power-Law/ 2025-02-24T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/24/Scaling-Law-and-Power-Law Scaling Law

Scaling laws in the context of natural language processing (NLP) and computer vision refer to the predictable relationships between the size of a model (e.g., number of parameters), the amount of training data, and the model’s performance (e.g., accuracy, loss, or other metrics). These laws describe how performance improves as you scale up key factors like model size, dataset size, and computational resources. But before we can understand its importance, we need to first understand power law.

Power Law

A power-law relationship is a mathematical relationship between two quantities where one quantity varies as a power of the other. In other words, one quantity is proportional to the other raised to an exponent. Mathematically, it is expressed as:

[ y = k \cdot x^n ]

Where:

  • ( y ) is the dependent variable (e.g., model performance),
  • ( x ) is the independent variable (e.g., model size, dataset size, or compute),
  • ( k ) is a constant (proportionality factor),
  • ( n ) is the exponent (a constant that determines the shape of the relationship).

Key Characteristics of Power-Law Relationships:

  1. Non-linear: Unlike linear relationships (( y = mx + b )), power-law relationships are non-linear. This means that changes in ( x ) lead to disproportionate changes in ( y ).
  2. Scale-invariant: Power-law relationships appear the same at all scales. If you zoom in or out on the data, the relationship retains its shape.
  3. Heavy-tailed distribution: In many real-world systems, power-law relationships describe phenomena where small events are common, but large events are rare (e.g., word frequency in language, city sizes, or income distribution).

Examples of Power-Law Relationships:

  1. Natural Language Processing (NLP):
    • Model performance (e.g., perplexity or accuracy) often improves as a power-law function of model size, dataset size, or compute. For example: [ \text{Performance} \propto (\text{Model Size})^n ]
    • This means doubling the model size might lead to a less-than-doubling improvement in performance, depending on the exponent ( n ).
  2. Computer Vision:
    • Image recognition accuracy often scales as a power-law function of the number of training images or model parameters.
  3. Real-World Phenomena:
    • Zipf’s Law: In linguistics, the frequency of a word is inversely proportional to its rank in the frequency table (e.g., the most common word appears twice as often as the second most common word).
    • Pareto Principle (80/20 Rule): 80% of outcomes often come from 20% of causes (e.g., 80% of wealth is owned by 20% of the population).
    • Network Science: The distribution of connections in many networks (e.g., social networks, the internet) follows a power law, where a few nodes have many connections, and most nodes have few.

Visualizing a Power-Law Relationship:

When plotted on a log-log scale (where both axes are logarithmic), a power-law relationship appears as a straight line. This is because taking the logarithm of both sides of the equation ( y = k \cdot x^n ) gives:

[ \log(y) = \log(k) + n \cdot \log(x) ]

This is the equation of a straight line with slope ( n ) and intercept ( \log(k) ).


Why Power-Law Relationships Matter in AI:

  1. Predictability: Power-law relationships allow researchers to predict how performance will improve as they scale up resources (e.g., model size, data, compute).
  2. Optimization: Understanding power-law scaling helps allocate resources efficiently. For example, if performance improves slowly with larger models, it might be better to invest in more data or better algorithms.
  3. Benchmarking: Power-law relationships provide a framework for comparing different models and architectures.

Example in AI Scaling:

In OpenAI’s research on scaling laws for language models, they found that: [ \text{Test Loss} \propto (\text{Model Size})^{-\alpha} \cdot (\text{Dataset Size})^{-\beta} \cdot (\text{Compute})^{-\gamma} ] Here, ( \alpha ), ( \beta ), and ( \gamma ) are exponents that describe how performance improves with scaling.


In summary, a power-law relationship describes how one quantity changes as a power of another. It is a fundamental concept in AI scaling, as well as in many natural and social phenomena.

]]>
Foundation Model http://fanhuan.github.io/en/2025/02/24/Foundation-Model/ 2025-02-24T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/24/Foundation-Model I was reading the Evo paper, and it referred the Evo model as a foundation model (基础模型). I had to look up what that means.

In the context of this paper, a foundation model refers to a large, general-purpose machine learning model that is trained on vast amounts of data and can be adapted (or fine-tuned) for a wide range of downstream tasks. Foundation models are designed to capture broad patterns and relationships in the data, making them highly versatile and powerful tools for various applications.


Key Characteristics of Foundation Models:

  1. Large-Scale Training: Foundation models are trained on massive datasets, often using unsupervised or self-supervised learning techniques.
  2. General-Purpose: They are not task-specific but are designed to learn general representations of the data (e.g., language, images, or biological sequences).
  3. Transfer Learning: Once trained, foundation models can be fine-tuned or adapted to specific tasks with relatively little additional data.
  4. Versatility: They can be applied across multiple domains and tasks, often outperforming specialized models.

Examples of Foundation Models:

  1. Natural Language Processing (NLP):
    • GPT (Generative Pre-trained Transformer):
      • Developed by OpenAI, GPT models (e.g., GPT-3, GPT-4) are trained on vast amounts of text data and can perform tasks like text generation, translation, summarization, and question answering.
    • BERT (Bidirectional Encoder Representations from Transformers):
      • Developed by Google, BERT is trained to understand the context of words in a sentence and is used for tasks like sentiment analysis, named entity recognition, and question answering.
    • T5 (Text-To-Text Transfer Transformer):
      • Developed by Google, T5 treats all NLP tasks as a text-to-text problem, making it highly flexible for tasks like translation, summarization, and classification.
  2. Computer Vision:
    • CLIP (Contrastive Language–Image Pretraining):
      • Developed by OpenAI, CLIP connects images and text, enabling tasks like zero-shot image classification and image-text retrieval.
    • DALL·E:
      • Also developed by OpenAI, DALL·E generates images from textual descriptions, demonstrating the ability to combine vision and language understanding.
  3. Biology and Bioinformatics:
    • Protein Models:
      • AlphaFold: Developed by DeepMind, AlphaFold predicts protein structures from amino acid sequences, revolutionizing structural biology.
      • ESM (Evolutionary Scale Modeling): Developed by Meta AI, ESM models are trained on protein sequences to predict structure, function, and evolutionary relationships.
    • DNA Models:
      • DNABERT
      • NT (Nucleotide Transfoermer)
      • Evo: Evo is a foundation model designed to capture the multimodality of the central dogma (DNA → RNA → protein) and the multiscale nature of evolution. It can likely be applied to tasks like gene function prediction, protein design, and evolutionary analysis. Evo2 is just released and eukaryotic genomes are included in the training this time.
  4. Multimodal Models:
    • Flamingo:
      • Developed by DeepMind, Flamingo combines vision and language understanding, enabling tasks like image captioning and visual question answering.
    • Gato:
      • Also developed by DeepMind, Gato is a general-purpose model capable of performing tasks across multiple domains, including text, images, and robotics.

Why Foundation Models Are Important:

  1. Efficiency: Instead of training a new model from scratch for every task, foundation models can be fine-tuned with minimal additional data and computation.
  2. Performance: Foundation models often achieve state-of-the-art performance on a wide range of tasks due to their large-scale training and generalization capabilities.
  3. Innovation: They enable new applications and discoveries by providing a powerful base for further research and development.

Evo as a Foundation Model:

In the case of Evo, it is designed to capture two key aspects of biology:

  1. Multimodality of the Central Dogma: Evo can handle the flow of genetic information from DNA to RNA to proteins, integrating multiple biological modalities.
  2. Multiscale Nature of Evolution: Evo can analyze evolutionary patterns at different scales, from molecular scales to systems scales(interaction between different modality of molecules) and entire genomes (see their figure from the paper below).

img

As a foundation model, Evo can be fine-tuned for various biological tasks, such as predicting gene functions, designing proteins, or analyzing evolutionary relationships, making it a versatile tool for computational biology.

Update from Evo2

  1. Evo is trained on prokaryotes and phage genomes. Evo2 is trained on “a highly curated genomic atlas spanning all domains of life”.

  2. Pretraining data set: 300 billion nt (from 2.7 million genomes) vs. 9.3 (abstract) or 8.84 (methods) trillion nt. openGenome2 (the one Evo 2 was trained on) included a 33% expansion of representative prokaryotic genomes from 85,205 to 113,379 (357 billion nucleotides), a total of 6.98 trillion nucleotides from eukaryotic genomes, 854 billion nucleotides of non-redundant metagenomic sequencing data, 2.82 billion nucleotides of organelle genomes, and 602 billion nucleotides of subsets of eukaryotic sequence data to focus on likely functional regions of the genomes by focusing on different windows around coding genes. This means eukaryotic genomes takes about 80% of the pretraining dataset.

Let’s take a closer look on how the eukaryotic genomes were chosen:

vs. 9.3 trillion DNA base pairs.

  1. Model parameter size: 7B(Evo) vs 7B and 40B (Evo2). The differences between the two Evo2 model is that the 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens. Note that GPT3 has 175B.
  2. Token context window1 million: 131Kb vs 1Mb, both at single-nucleotide resolution. This number is 2048 for GPT3.
  3. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation.
]]>
Coalescence Theory http://fanhuan.github.io/en/2025/02/20/Coalescence-Theory/ 2025-02-20T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/20/Coalescence-Theory When I was doing some reading on IBD, I ran into this passage in Powell 2010.

“Traditionally, the probability that two alleles are IBD was most often calculated from a known pedigree and so the individuals at the top of the pedigree (the founders) form a natural base population. However, it is becoming common to use data on genetic markers such as SNPs to estimate the probability of being IBD without reference to a known pedigree and, in this case, there is no obvious base population. Moreover, the concept of IBD seems to conflict with the well-established coalescence theory in which all alleles are descended from a common ancestor but at different times in the past.”

Then I was a bit lost. Why are they in conflict? I asked ChatGPT and this is the answer:

The seeming conflict comes from how “identity by descent” (IBD) and coalescence theory define “common ancestry” over different time scales and with different reference points.

IBD in Pedigree Context

  • IBD Definition:
    In a pedigree, IBD typically means that two alleles are identical because they were inherited from the same recent common ancestor.
  • Base Population:
    When you work with a known pedigree, you often define a set of founders as your “base population.” Alleles are considered IBD if they are identical by descent from those founders.

Coalescence Theory

  • Coalescence Concept:
    Coalescence theory tells us that if you trace the ancestry of any pair of alleles far enough back in time, they will eventually converge on a single common ancestral allele.
  • Universal Common Ancestor:
    This means that, theoretically, all alleles in a population are IBD if you go far enough back—because they all share an ancestor (for example, the most recent common ancestor of the species).

The Conflict

  • Time Scale and Practical Use:
    • IBD: In practice, IBD is used to assess recent shared ancestry. It’s meaningful when you have a defined base population or cutoff time, so you can say, “These two alleles come from the same recent ancestor.”
    • Coalescence Theory: Coalescence does not have a built-in cutoff; it implies that, given infinite time, all alleles are IBD. This makes the concept trivial if you ignore time scales.
  • No Natural Base Population:
    When using genetic markers like SNPs without a known pedigree, there’s no obvious set of founders. In coalescence theory, every allele is descended from a common ancestor if you go back far enough. This contrasts with the practical use of IBD, where we’re interested in whether two alleles share a recent common ancestor, not just any ancestor in the distant past.

In Summary

  • IBD (Traditional/Practical Use):
    It’s about recent common ancestry relative to a defined base population. Two alleles are considered IBD if they are inherited from the same recent ancestor.

  • Coalescence Theory:
    It shows that all alleles coalesce into a single ancestral allele over time, meaning that if you extend the time horizon indefinitely, every pair of alleles is IBD.

Thus, the conflict arises because coalescence theory makes the notion of IBD universal (all alleles are IBD eventually), while in practice, we want to quantify recent relatedness using IBD. This is why when using genetic markers without a known pedigree, researchers must define an arbitrary or biologically meaningful base population or time threshold for IBD to be useful.

Then the article went on about how this conflict is dealt with:

“In practice, this conflict has been ignored by using IBD concepts for recent common ancestors and coalescence analysis for distant common ancestors; however, the two categories of ancestor merge, especially when using dense SNP or DNA sequence data.”

Sounds familiar? This is exactly the problem we discussed in the quantitative genetics vs. population genetics post. In quantitative genetics, we usually working with populations with recent common ancestors, sometimes even with clear pedigree; while in population genetics, we usually have little to no understanding of the pedigree and relies solely on molecular markers. While just as this sentence described, now that we have whole genome information on various types of populations combined, how things are done differently before needs to be unified.

So, is coalescence theory still relevant?

]]>
IBD and IBS http://fanhuan.github.io/en/2025/02/19/IBD-And-IBS/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/IBD-And-IBS IBD at allele level

Identity-by-descent, also known as identical-by-descent, is used to describe two homologous alleles that have descended from a common ancester. Homologous here means that they are the same (identical). You can compare two alleles from different individuals or the same diploid individual.

IBS

Identity-by-state, also known as identical-by-state. This concept is relatively simple. It just means that two alleles are homologous, iregardless whether they are IBD. This sounds familiar right? The relationship between IBD and IBS is like the one between orthologs and homologs. IBS is what we see in the current dataset.

There we borrow an illustration from [Powell 2010]((https://www.nature.com/articles/nrg2865) to demonstrate the difference between IDB and IBS.

img

So in this figure, as long as the letter is the same, they are IBS, so all the Gs and all the Ts are IBS respectively. However, you also need to have the same background color to be IBD. For example, C1 and C2 are IBD, B3 and B4 are not IBD, C4 and C5 are not IBD either.

prob(IBD) = F = Inbreeding coefficient

The probability of IBD between two alleles is denoted as F.

How to calculate F for a specific locus? It can be achieved by comparing the observed heterozygosity rate with the expected one under Hardy-Weinberg Equilibrium. We already know that inbreeding, defined as mating between individuals sharing a common parent in their ancestry, is one of reasons for deviation from HWE. If there is only random mating (one of the assumption of HWE), meaning no mating between individuals sharing common ancestry, i.e. no inbreeding, there would be no IBD (F=0), and we all know that the frequency of genotypes would look like: GG = p^2, GT = 2pq, TT = q^2. In this case, the homologous alleles (GG or TT) are actually from different ancesters; they just happen to be the same. OK now you see how this might contradict with the coalescence theory where there is only one common ancester (they did not just HAPPEN to be the same!) but it is in the other post.

Now let’s add inbreeding into the picture. Because of inbreeding, some of the homologous alleles are actually from the same ancester (IBD), and we already denoted this probability as F. For this portion (F), since they are IBD, they can only be GG or TT, therefore their relative frequency for GG/GT/TT is p/0/q. For the rest (1-F) that are still under HWE, the relative frequency for GG/GT/TT is still p^2/2pq/q^2. So in the current population, the genotype frequency would be:

  • GG: F * p + (1-F) * p^2 = p^2 + pqF
  • GT: (1-F) * 2pq = 2pq - 2pqF
  • TT: F * q + (1-F) * q^2 = q^2 + pqF

Notice how pqF is effectively “taken” from the heterozygotes and added to each homozygote class.

If the two alleles are in the same diploid individual then F is also called the inbreeding coefficient of the individual at this locus.

If the two alleles are in different individuals, F can be used to calculate the numerator relationship between them as in a A matrix. The co-ancestry of two diploid individuals is the average of the four F values from the 4 possible comparisons between each pair of alleles (2 choose 1 * 2 choose 1); their numerator relationship, or the off-diagonals on their A matrix, is twice their co-ancestry. Note that this is slightly different from what we did in the A matrix post, where it was defined as a_ind1,ind2=0.5(aind1,sire2 + aind1,dam2), where the numerator relationship is the average of ind1 ‘s relationship with ind 2’s parents. Using the same rationale, the numerator relationship of a diploid individual and itself (diagonals of A matrix) would be (1 + 1 + F + F)/4 * 2 = 1 + F, given the probability of IBD of the same allele is 1.

In Speed and Balding 2015, it is defined as the “phenomenon whereby two individuals share a genomic region as a result of inheritance from a recent common ancester, where ‘recent’ can mean from an ancestor in a given pedigree, or with on intervening mutations event or with no intervening recombination event.”

” is tightly linked to “Traditional measures of relatedness, which are based on probabilities of IBD from common ancestors within a pedigree, depend on the choice of pedigree”. If the pedigree is known, the expected IBD is [A matrix].

IBD at segment level

“In this definition of ‘chromosome segment IBD’ there is no need for a base population.”

However unfortunately there is no consistant definition of IBD probabilities without pedigree.

However when the pedigree is unknown, IBD relationships can only be inferred from the population at speculation, and unfortunately there is no consist

In another review paper, Powell 2010 defined it as “alleles that are descended from a common ancestor in a base population”. You can see the two definitions are slightly different. The former uses “genomic region” as the unit whereas the latter uses “alleles”. Alleles are versions of genes, where as “genomic region” can be non-genic, also can be of any length, so the former is more generic. Also, the latter emphasized on the concept of “base population”. The probability of IBD is sometimes referred to as F, and it “has to be defined with respect to a base (reference) population; that is, the two alleles are descended from the same ancestral allele in the base population.” Why so? As you can imaging, if an allele is very rare in the base population, then the possibility of IBD is very high. On the contrary, if an allele is very prominant in the base population, two individuals having the same allele could be due to chance. See another post on how to determine the base population.

“If the two alleles are in the same diploid individual then F is the inbreeding coefficient of the individual at this locus.” See more on how IC is calculated in this post.

IBS is usually used to calculate the G matrix with unknown pedigree. As you can see this can lead to erroneous inference because a consistent base population is not used. Note that this relationship is usually considered within the same generation, not crossing generations. Another thing to note in this figure is that the base population used for the estimation of IBD coefficients should be B1, B2, B3 and B4, not the current C1 to C5. This is why you need to specify the founders or any know pedigree info in the .fam file. I wonder whether gcta takes this info? I tried but it does not :(

What plink offers

PLINK provides tools to calculate genetic similarity between individuals using IBS and Hamming distance. IBS measures the proportion of alleles shared between two individuals across all markers. It ranges from 0 (no alleles shared) to 1 (all alleles shared). Hamming distance measure the mismatches between two individuals, therefore they are inversely related, and it is specified as [‘1-ibs’]. You can choose based on whether you’d like a similarity matrix (ibs) or distance matrix (1-ibs).

This is an option called flat-missing. The manual reads:

“Missingness correction When missing calls are present, PLINK 1.9 defaults to dividing each observed genomic distance by (1-<sum of missing variants’ average contribution to distance>). If MAF is nearly independent of missingness, this treatment is more accurate than the usual flat (1-) denominator. However, if independence is a poor assumption, you can use the 'flat-missing' modifier to force PLINK 1.9 to apply the flat missingness correction."

But how do I know if MAF is dependent of missingness or not in my data? In this case you can investigate their relationship in your own data by generating those two stats.

plink --bfile data --mising --out stats
plink --bfile data --freq --out stats

Then you can calculate the correlation of the F_miss column in the .lmiss file and the MAF column in the .frq file. If it is significantly greater than 0, there might be a correlation. In my case it is amost 0.15 therefore I should turn on the flat-missing option. Then the cmd looks like:

plink --bfile plink_data --distance ibs flat-missing --out ibs_distance

These metrics are useful for understanding relatedness, population structure, and data quality.

from HWE to F

Equation 1 shows how to adjust the usual Hardy–Weinberg genotype frequencies to account for inbreeding. Suppose you have two alleles, G and T, with frequencies (q) and (p = 1 - q) in the base population. Then the genotype probabilities in a population with inbreeding coefficient (F) become:

[ \begin{aligned} P(\text{GG}) &= q^2 + pqF,
P(\text{GT}) &= 2pq(1 - F),
P(\text{TT}) &= p^2 + pqF. \end{aligned} ]

Below is a step-by-step explanation of why this formula makes sense.


1. No Inbreeding ((F = 0)): Hardy–Weinberg

Recall that in a random-mating (non-inbred) population under Hardy–Weinberg equilibrium, the genotype frequencies are:

[ P(\text{GG}) = q^2, \quad P(\text{GT}) = 2pq, \quad P(\text{TT}) = p^2. ]

If we set (F = 0) in Equation 1, it simplifies back to the usual (q^2,\, 2pq,\, p^2). This confirms that the formula is consistent with standard Hardy–Weinberg when there is no inbreeding.


2. Complete Inbreeding ((F = 1))

If (F = 1), every individual is completely autozygous (homozygous by descent). Then the probability of being homozygous for G is (q), and for T is (p). The formula becomes:

[ P(\text{GG}) = q, \quad P(\text{GT}) = 0, \quad P(\text{TT}) = p, ]

meaning no heterozygotes at all (all individuals are homozygous). This matches the extreme case of complete inbreeding.


3. Partial Inbreeding ((0 < F < 1))

When (F) is between 0 and 1, a fraction (F) of the population is “autozygous” (forced to be homozygous), while the remaining fraction ((1-F)) follows the usual Hardy–Weinberg proportions. Algebraically, you can think of it as:

  1. With probability ((1-F)), an individual has genotype frequencies (q^2 : 2pq : p^2).
  2. With probability (F), the individual is homozygous, and the chance of G vs. T is (q) vs. (p).

Putting these together:

  • GG: ((1-F)\,q^2 + F\,q = q^2 + pqF)
  • GT: ((1-F)\,(2pq) = 2pq(1-F))
  • TT: ((1-F)\,p^2 + F\,p = p^2 + pqF)

Notice how (pqF) is effectively “taken” from the heterozygotes and added to each homozygote class.


Why This Matters

  • Increased Homozygosity:
    As (F) increases from 0 to 1, you see fewer heterozygotes ( (2pq(1-F))) and more homozygotes ( (q^2 + pqF \text{ or } p^2 + pqF)).
  • Interpreting (F):
    The inbreeding coefficient (F) quantifies the probability that the two alleles an individual carries are identical by descent. Higher (F) means a greater chance of inheriting the same ancestral allele on both chromosomes, hence more homozygosity.

In Summary

Equation 1: [ \text{GG: } q^2 + pqF, \quad \text{GT: } 2pq(1-F), \quad \text{TT: } p^2 + pqF ] is the standard way to incorporate inbreeding into genotype frequencies. It smoothly transitions between:

  • Hardy–Weinberg proportions when (F = 0).
  • Complete homozygosity when (F = 1).
  • Intermediate levels of homozygosity for (0 < F < 1).

Thus, the formula neatly captures how inbreeding inflates homozygote frequencies and deflates heterozygotes relative to the baseline Hardy–Weinberg expectation.

]]>
Base Population and Why It Matters http://fanhuan.github.io/en/2025/02/19/Base-Population/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/Base-Population In the .fam file prepared for plink , there are two columns for you to specify one’s father (PID) and mother (MID) in this dataset, 0 if unknown. Those with both PID and MID as 0 are considered as founders. Note that “By default, if parental IDs are provided for a sample, they are not treated as a founder even if neither parent is in the dataset.” In that case you need to manually make them founders via --make-founders.

Why do we need founders? Because only they are included in some calculations such as minor allele frequencies/counts or Hardy-Weinberg equilibrium tests, both related to the concept of base population.

Traditionally, the probability that two alleles are IBD was most often calculated from a known pedigree and so the individuals at the top of the pedigree (the founders) form a natural base population, where the founders themselves are unrelated.

The probability that two alleles are IBD has to be defined with respect to a base (reference) population; that is, the two alleles are descended from the same ancestral allele in the base population.

The point of coalescence is the most recent common ancestor. The status of alleles there is in the ancestral state.

]]>
BLUP http://fanhuan.github.io/en/2025/02/19/BLUP/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/BLUP This post is mainly based on: https://rpubs.com/amputz/BLUP

This is the equation that we need to understand: Y = Xβ + Zu + ε. The alternative way is y = Xb + Za + e.

Now let’s expand it into matrix form.

Y: Response vector (n×1). X: Fixed-effects design matrix (n×p): p is the number of levels for the fixed effect. For example it would be 2 if X is sex (female and male). How would this work if X is continuous? β: Fixed-effects coefficients (p×1). var(β) = 0 because it is fixed. They are constant for each level. Z: Random-effects design matrix (n×q): q is the dimension of A/G matrix (n×n). Design matrix needs y/n for perticular levels/individuals.
u: Random effects (q×1), with u ∼ N(0, G). Note that q can be greater than n, and this is why we can predict for more individuals than we have records of. G is the vcv matrix for the random effects ε: Residual errors (n×1), with ε∼N(0,R). R is the vcv for the residuals (recall GLS models).

In sum, var(y) = ZGZ’ + R (note that Z is a design matrix therefore ZZ’ = I).

Now we need to solve for β and u.

Recall how we used ordinary least squares (OLS) for fixed effects models. In OLS, we solve (X’X)β = X’y to estimate β. For mixed models, the equations are extended to include the random effects and their variances. How do we estimate β and u at the same time?

Henderson’s Mixed Model Equations

Henderson’s equations combine both fixed and random effects into a single set of equations. The user might be familiar with ordinary least squares (OLS) for fixed effects models, so comparing them could be helpful.

The joint distribution of Y and u would then be multivariate normal. To find the estimates of β and u, we can maximize the joint likelihood, which leads to solving the equations derived from setting the derivatives of the log-likelihood with respect to β and u to zero.

Alternatively, since maximizing the likelihood directly can be complicated due to the random effects, Henderson proposed using Best Linear Unbiased Prediction (BLUP) for the random effects. The BLUP approach combines the information from the fixed and random parts.

So, the Mixed Model Equations (MME) are typically written in matrix form as:

[ X’R⁻¹X X’R⁻¹Z ] [ β ] = [ X’R⁻¹Y ]

[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ] [ u ] [ Z’R⁻¹Y ]

This looks like a block matrix system. The left-hand side matrix has blocks X’R⁻¹X, X’R⁻¹Z, Z’R⁻¹X, and Z’R⁻¹Z + G⁻¹. The right-hand side vectors are X’R⁻¹Y and Z’R⁻¹Y.

But why does the lower right block include G⁻¹? Because the random effects u have their own variance-covariance structure G, so including G⁻¹ adds a penalty term to the random effects, shrinking them towards zero depending on the variance components. This is similar to ridge regression (L2 regularization, lasso is L1), where adding a penalty term stabilizes the estimates.

Wait, if G is the variance of u, then G⁻¹ would be the precision matrix. So, in the MME, we’re essentially combining the information from the data (through X and Z) and the prior information (through G⁻¹) in a Bayesian-like manner, although Henderson’s derivation is frequentist.

Let me check the dimensions. Suppose X is n x p (n observations, p fixed effects), Z is n x q (q random effects), then β is p x 1 and u is q x 1. The matrix on the left would be (p+q) x (p+q), and the right-hand side vectors are (p+q) x 1. That makes sense.

Solving this system gives the estimates for β and u. The solution can be written using block matrix inversion, but it’s often computationally intensive, especially for large datasets. Modern software uses iterative methods or restricted maximum likelihood (REML) to estimate the variance components (G and R) first and then solve the MME.

But how do we actually use these equations? Let’s consider a simple example. Suppose we have a model with one fixed effect (intercept) and one random effect (e.g., subjects). Then X would be a column of ones, Z would be a matrix indicating the subject for each observation, G is the variance of the random intercepts, and R is the residual variance times the identity matrix.

In this case, the MME would involve inverting matrices that combine the fixed and random effects. The solution would give the fixed intercept and the best linear unbiased predictions for each subject’s random intercept.

But I’m still a bit fuzzy on the derivation. Let me try to sketch it out. Starting from the model Y = Xβ + Zu + ε, with u ~ N(0, G) and ε ~ N(0, R). The goal is to find β and u that maximize the joint density of Y and u. The joint density is proportional to exp(-0.5[(Y - Xβ - Zu)’ R⁻¹ (Y - Xβ - Zu) + u’ G⁻¹ u]).

Taking derivatives with respect to β and u and setting them to zero gives the estimating equations. The derivative with respect to β would be X’ R⁻¹ (Y - Xβ - Zu) = 0, and the derivative with respect to u would be Z’ R⁻¹ (Y - Xβ - Zu) - G⁻¹ u = 0. Rearranging these gives the MME:

X’ R⁻¹ X β + X’ R⁻¹ Z u = X’ R⁻¹ Y

Z’ R⁻¹ X β + (Z’ R⁻¹ Z + G⁻¹) u = Z’ R⁻¹ Y

Which is exactly the block matrix system mentioned earlier. So, the MME are derived from setting the gradients of the joint log-likelihood (or equivalently, the penalized least squares) to zero.

Understanding this, the matrix form is a way to compactly represent these two equations. The inclusion of G⁻¹ in the lower right block accounts for the penalty term from the random effects’ prior distribution. This ensures that the estimates of u are shrunk towards zero depending on their variance, preventing overfitting.

Another key point is that if G is diagonal (i.e., random effects are independent), then G⁻¹ is also diagonal, simplifying computations. But in cases where there’s a more complex covariance structure (e.g., spatial or temporal correlations), G would have off-diagonal elements, making the inversion more challenging.

In practice, when the variance components (G and R) are unknown, they need to be estimated first, often using methods like REML or ML, and then plugged into the MME to solve for β and u. This iterative process is why mixed models can be computationally intensive.

So, summarizing my understanding: The matrix form of the Mixed Model Equations is a system that combines the fixed and random effects into a single equation. The structure of the matrix reflects the contributions of both the fixed effects design matrix (X) and the random effects design matrix (Z), along with the variance components (G and R). Solving this system provides estimates for β and predictions for u, taking into account the variability in both the fixed and random parts of the model.

I think I need to solidify this by going through a small example or maybe looking at how the matrices are structured in a simple case. Let’s say we have three observations from two subjects. Suppose the fixed effect is a single intercept, and the random effect is the subject-specific intercept. Then:

  • X would be a 3x1 matrix of ones.

  • Z would be a 3x2 matrix where each row has a 1 in the column corresponding to the subject.

  • G is a 2x2 matrix with the variance of the subject intercepts on the diagonal.

  • R is σ²I₃, the residual variance times the identity matrix.

Plugging these into the MME:

[ X’R⁻¹X X’R⁻¹Z ] [ β ] = [ X’R⁻¹Y ]

[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ] [ u ] [ Z’R⁻¹Y ]

Calculating each block:

X’R⁻¹X would be (1/σ²) * 3 (since X is a column of ones, and X’X is 3).

X’R⁻¹Z would be (1/σ²) * [2 1] if the first subject has two observations and the second has one.

Z’R⁻¹X would be the transpose of X’R⁻¹Z, so (1/σ²) * [2; 1].

Z’R⁻¹Z would be (1/σ²) * diag([2, 1]) since each subject’s column in Z has 2 and 1 ones respectively.

Adding G⁻¹ to Z’R⁻¹Z gives diag([2/σ² + 1/g, 1/σ² + 1/g]), assuming G is diag(g, g).

So the full left-hand side matrix would be:

[ 3/σ² 2/σ² 1/σ² ]

[ 2/σ² (2/σ² + 1/g) 0 ]

[ 1/σ² 0 (1/σ² + 1/g) ]

Wait, but the block structure is:

[ X’R⁻¹X X’R⁻¹Z ]

[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ]

So if X is 3x1 and Z is 3x2, then X’R⁻¹X is 1x1, X’R⁻¹Z is 1x2, Z’R⁻¹X is 2x1, and Z’R⁻¹Z + G⁻¹ is 2x2. So the full matrix is 3x3. But in my example above, I may have messed up the blocks. Let me correct that.

X’R⁻¹X is (1/σ²) * 3 (since X is a column of ones, X’X is 3).

X’R⁻¹Z is (1/σ²) * [2, 1], since the first subject has two observations (sum of Z’s first column) and the second subject has one (sum of Z’s second column).

Z’R⁻¹X is the transpose of X’R⁻¹Z, which is [2/σ²; 1/σ²].

Z’R⁻¹Z is (1/σ²) * [[2, 0], [0, 1]], since each subject’s column has 2 and 1 ones respectively.

G⁻¹ is (1/g) * I₂, since G is diag(g, g).

So Z’R⁻¹Z + G⁻¹ becomes [[2/σ² + 1/g, 0], [0, 1/σ² + 1/g]].

Therefore, the full left-hand side matrix is:

[ 3/σ² 2/σ² 1/σ² ]

[ 2/σ² 2/σ² + 1/g 0 ]

[ 1/σ² 0 1/σ² + 1/g ]

But actually, the block structure is:

Top-left: 1x1 (3/σ²)

Top-right: 1x2 [2/σ², 1/σ²]

Bottom-left: 2x1 [2/σ²; 1/σ²]

Bottom-right: 2x2 diag(2/σ² + 1/g, 1/σ² + 1/g)

And the right-hand side vector is:

[ X’R⁻¹Y ] = (1/σ²) * sum(Y)

[ Z’R⁻¹Y ] = (1/σ²) * [sum(Y for subject 1), sum(Y for subject 2)]

Solving this system would give the estimate for β (the fixed intercept) and the predictions for u₁ and u₂ (the random intercepts for each subject).

This example helps to see how the matrix blocks are formed. The key takeaway is that the MME combines information from both the fixed and random effects, weighted by their respective variances (σ² and g). The inclusion of G⁻¹ in the lower right block penalizes the random effects, preventing them from being too large, which is especially important when the number of random effects is large relative to the number of observations.

Another thing to note is that if the random effects have large variance (g is large), then G⁻¹ is small, and the penalty term is less influential, allowing the random effects to vary more. Conversely, if g is small, the penalty is stronger, shrinking the random effects towards zero.

Understanding the matrix form also helps in seeing how computational challenges arise. Inverting the large matrix becomes difficult as the number of fixed and random effects grows. This is why methods like the Henderson’s iterative approach or using sparse matrix techniques are employed in practice.

In summary, the matrix form of the Mixed Model Equations is a systematic way to estimate both fixed and random effects by setting up a block matrix that incorporates the design matrices (X and Z) and the variance components (R and G). Solving this system provides the estimates while accounting for the correlation structure in the data.

]]>
Starts and Bars http://fanhuan.github.io/en/2025/02/17/Stars-And-Bars/ 2025-02-17T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/17/Stars-And-Bars While checking on the generalization of HWE, I had to refresh my memory on multinomial expansion. For any positive integer m and any non-negative integer n, the multinomial theorem describes how a sum with m terms expands when raised to the nth power. I do remember that the sum of exponents in each term needs to be the same as the original nth power, but I forgot on how to calculate the coefficient.

Then I came across this method called stars and bars, which is used to determing how many terms an multinomial expansion has. I don’t remember when or whether I’ve learnt this method in school, but in Chinese it is called “隔板法”(https://zh.wikipedia.org/zh-sg/%E9%9A%94%E6%9D%BF%E6%B3%95). It is solving for the number of combinations of nonnegative integer indices k1 through km such that the sum of all ki is n. Let’s consider the case where we have 3 terms, a, b and c, and we want to expand to the power of 4, (a+b+c)^4. So in this case, n=4 and m=3, where we need to split 4 stars into 3 groups, with 0-4 starts in each group. How? We only need 3-1=2 bars to put amongst those starts, and they will be separated into 3 groups.

It could be something like:


**|**| (a^2 * b^2 * c^0)

or

|*|*** (a^0 * b^1 * c^3)

As you can see the number of combination would be n + (m-1) choose (m-1), i.e, there are altogether n+m-1 positions, and we need to choose (m-1) to place the bars, simple. In our example, it would be 6 choose 2, which is 6!/(4! * 2!) = 15, and there are indeed 15 different combinations such as a^4 or b * c^3.

A harder question would be how to understand the multinomial coefficients. Let’s keep thinking along the lines of stars and bars.

When we were thinking about the number of terms, we were thinking about where to put the bars, but treating the stars anonymously (they are just stars!). Now imaging that they are not. We actually have 4 distinctive starts, 1,2,3,4, then for the first example,

**|**| (a^2 * b^2 * c^0)

there would be 6 different groupings:

12|34|
13|24|
14|23|
23|14|
24|13|

meaning, the arrangement within each group should be cancled/devided, therefore the formular is 4!/(2! * 2! * 0!), or in general term, it would be n!/(k1!k2!…km!) for each term.

Now let’s think about a special case where m = 2, meaning there are always just 2-1=1 bar, the bar can be placed at n+1 different positions, when it is placed at the kth position (let the bar be in front all the stars as 0), the coefficient would be n!/(k! * (n-k)!), which is actually (n choose k), the binomial coefficient.

Now thinking back on the generalization of HWE, there can be more than two alleles at one locus (more bars), or more than two sets of chromosomes (more stars), i.e. polypoidy, or a combination of both. But now we have no problem for expansion in any case.

I hope this post helps you to understand the multinomial expansion and generalization of HWE.

]]>