Huan Fan http://fanhuan.github.io 2024-12-20T09:56:37+00:00 huan.fan@wisc.edu Kinship matrix (II) -- GRM http://fanhuan.github.io/en/2024/12/12/GRM/ 2024-12-12T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/12/GRM In a previous post we talked about the first way to estimate a kinship matrix, which is through pedigree. In this post we will cover how to do it with genomic data.

This process is described in detail in the methods session of Proferssor Yang Jian’s landmark 2010 NG paper and you use gcta to generate one. Here I am trying to understandit by recreating the process.

Step 1: genotype matrix or SNP matrix.

This part is simple. It’s just converting the vcf to a coded matrix where ref/ref is 0, ref/alt is 1 and alt/alt is 2. Note that this is difference from rrBLUP where the convention is {-1/0/1}. See more about genotype denotation in a previous post. It usually has n rows (number of sample) and m columns (m variants or SNPs).

Step 2: scaling of the genotype matrix.

This genotype matrix is then scaled column-wise, meaning, each variant is scaled on its own based on its allel frequency in this dataset, with the assumption that each locus/SNP is under Hardy-Weinburd Equilibrium (HWE). You see two assumptions are made here. 1. This dataset is representative of the base population therefore the allele frequencies are also representative. 2. Any assumptions that HWE carries such as an infinite random-mating population.

The scaling of the genotype matrix is to ensure that each variant is treated equally to the total genetic effect iregardless of its allel frequency. Mathematically it is to reach a mean of 0 and variance of 1. If the allele frequency of the the alternative state in variant_i is pi, then we scale by minus the mean, which is (2*pi^2 + 1*2*pi(1-pi) + 0*(1-pi)^2) = 2pi, and divided by the sd (var is 2pq is derived from Var(xij) = E(xij^2) - E(xij)^2). Then {0,1,2} (denoted as xij) will become (xij - 2pi)/sqrt(2*pi*(1-pi)).

To demonstrate let’s have toy dataset with 5 individuals and 3 causual SNPs to a certain trait.

xij SNP1 SNP2 SNP3
Individual 1 1 1 2
Individual 2 0 1 0
Individual 3 1 0 1
Individual 4 2 0 0
Individual 5 1 0 0
SUM 5 2 3
pi 0.5 0.2 0.3

First let’s calculate the pi. p1 = (1+0+1+2+1)/10 = 0.5, p2=0.2 and p3=0.3. If Then use the equation (xij - pi)/sqrt(2*pi*(1-pi)). Here is a snipett of code for this calculation:

scale_geno <- function(fi, geno){
  qq <- -2*fi/sqrt(2*fi*(1-fi))
  Qq <- (1-2*fi)/sqrt(2*fi*(1-fi))
  QQ <- (2-2*fi)/sqrt(2*fi*(1-fi))
  a <- ifelse(geno == 0, qq, ifelse(geno == 1, Qq, QQ))
  return(a)
}

In a table it looks like:

genotype\pi 0.5 0.2 0.3
qq/0 -1.41 -0.71 -0.93
Qq/1 0 1.06 0.62
QQ/2 1.41 2.82 2.16

and the scaled genotype matrix (zij) looks like the following. The sum of each column is 0 so the mean is also 0, and the variance of the whole matrix is 1.1, close to the expectation.

zij SNP1 SNP2 SNP3
Indi 1 0 1.06 2.16
Indi 2 -1.41 1.06 -0.93
Indi 3 0 -0.71 0.62
Indi 4 1.41 -0.71 -0.93
Indi 5 0 -0.71 -0.93
SUM 0 0 0

Step 3: turning a genotype matrix (n x m) into a relationship matrix (n x n)

By now, we should be able to have a relationship matrix between these five individuals by doing ZZ’/m(=3). Let’s call it the G matrix and it is 5 X 5 in this case.

G Ind1 Ind2 Ind3 Ind4 Ind5
Ind1 1.93 0.30 0.20 -0.92 -0.92
Ind2 0.30 0.70 -0.44 -0.63 0.04
Ind3 0.20 -0.44 0.30 -0.02 -0.02
Ind4 -0.92 -0.63 -0.02 1.12 0.46
Ind5 -0.92 0.04 -0.02 0.46 0.46

In reality, we know little about where the causual SNPs are. Instead, we will just use all the SNPs, as long as some of them are tightly linked to the actually QTLs. However this assumes that the allele frequency we see in our dataset hold true for the population, which ignores the sampling error associated with each SNP. In order to improve the estimate of G, Yang 2010 proposed a weighted average across all SNPs. In this method, the values on the off-diaganols is the same as ZZ’/m. The only adjustment happens on the diaganols (equation 5 in the image below). Specifically, they are defined as 1+F and F is called inbreeding coefficient. I know how to calculate the inbreeding coefficient when considering one locus, which is 1 - H(O)/H(E), H(O) for observed number of heterozygotes and H(E) for expected number of heterozygotes under HWE. However I do not know how you calculate F when there are multiple loci. The j=k part in Equation 6 sums it up for all the SNPs and return a 1+F for each individual based on Eq. 5, but I do not understand why Eq. 5 is true. They lost me at When j=k, var(Aijj) part. But let’s just use on our small toy dataset for now.

img

Aijj <- function(p, x){
  f <- (x^2 - (1+2*p)*x + 2*p^2)/2*p*(1-p)
  a <- 1 + f
  return(a)
}
Aijj SNP1 SNP2 SNP3 Ajj(mean)
Indi 1 0.9375 0.9744 1.1029 1.0049
Indi 2 1.0625 0.9744 1.0189 1.0186
Indi 3 0.9375 1.0064 0.9559 0.9666
Indi 4 1.0625 1.0064 1.0189 1.0293
Indi 5 0.9375 1.0064 1.0189 0.9876

Note that for heterozygotes (1), the number is smaller than 1 (f is negetive) and the homozygote, the number is greater than 1 (f is positive). This matches with our understanding of increasing in homozygosity where in inbreed. The last column is the average over i or across the SNPs. Now we can see that among those 5 individuals, number 3 and 5 are outbreed and the rest are inbreed. But this does not match with intuition. Indi 1 has two hetero site out of 3, why it is above 1? Need to check futher. Assuming this is correct, now we can replace the diaganols with the new calculations.

A Ind1 Ind2 Ind3 Ind4 Ind5
Ind1 1.00 0.30 0.20 -0.92 -0.92
Ind2 0.30 1.02 -0.44 -0.63 0.04
Ind3 0.20 -0.44 0.97 -0.02 -0.02
Ind4 -0.92 -0.63 -0.02 1.03 0.46
Ind5 -0.92 0.04 -0.02 0.46 0.99

We can see that the order changed. Previous from low to high it was 3,5,2,4,1, now it is 3,5,1,2,4. Still something with Indi 1. But the ones on the diaganols are closer to 1 comparing to G.

Currently I am using gcta and then I convert it from bin format to the txt format. gcta offers a snipet of code for the conversion but it did not work for me. Here is my snipet:

read_grm <- function(prefix) {
  # Construct file paths based on the prefix
  grm_bin <- paste0(prefix, ".grm.bin")  # Binary file containing the GRM
  grm_id <- paste0(prefix, ".grm.id")    # File containing IDs (individuals)
  
  # Step 1: Read the ID file
  grm_ids <- read.table(grm_id, header = FALSE, stringsAsFactors = FALSE)
  colnames(grm_ids) <- c("FID", "IID")
  
  # Step 2: Read the binary GRM file
  # Read the size of the binary file (each element is stored as a 4-byte float)
  n <- nrow(grm_ids)  # Number of individuals
  grm_data <- readBin(grm_bin, what = "numeric", size = 4, n = n * (n + 1) / 2)
  
  # Step 3: Reshape the GRM data into a matrix
  # Initialize an empty matrix to store the GRM values
  grm_matrix <- matrix(0, n, n)
  
  # Fill the matrix from the GRM binary file (which is in lower triangular format)
  index <- 1
  for (i in 1:n) {
    for (j in 1:i) {
      grm_matrix[i, j] <- grm_data[index]
      grm_matrix[j, i] <- grm_data[index]  # Symmetrize the matrix
      index <- index + 1
    }
  }
  
  # Add row and column names using the IDs
  rownames(grm_matrix) <- grm_ids$IID
  colnames(grm_matrix) <- grm_ids$IID
  
  # Return the GRM matrix
  return(grm_matrix)
}

Is there R code that takes the genotype matrix and gives the GRM? I am sure I am reinventing the wheel here. Le’t try the A.mat from rrBLUP.

M <- matrix(c(1,1,2,
              0,1,0,
              1,0,1,
              2,0,0,
              1,0,0),5,3)
# rrBLUP decodes in {-1,0,1}
M <- M-1
A <- A.mat(M)

OK we are not done yet!

Now let’s hand calculate the right side of the the methods section of Yang 2010 which is “Unbiased estimate of the relationship at the causal variants and the genetic variance”, or A*.

The result is very different… OK let’s explore more tomorrow. hand calculation vs. gcta bin to txt vs. A.mat.

]]>
Goodness of Fit http://fanhuan.github.io/en/2024/12/03/Goodness-of-fit/ 2024-12-03T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/03/Goodness-of-fit We keep hearing about this phrase, goodness of fit, sometimes hyphenated. But I never pause to think about that it is. I just search my papers, where I have several (close to 10!) statistical ebooks and none of them mentioned anything about it. Oh well. Wiki it is. Btw what would you do? ChatGPT? I wish I could fix the comment part of my blog…

OK, here is what Wiki says about goodness of fit.

“The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions (see Kolmogorov–Smirnov test), or whether outcome frequencies follow a specified distribution (see Pearson’s chi-square test). In the analysis of variance, one of the components into which the variance is partitioned may be a lack-of-fit sum of squares.”

So to paraphrase, the goodness of fit is a way to evaluate statistical models, and it focuses on how well the model (expectations) fits the observations. For example, R2 is a goodness-of-fit measure. This led me to think what other ways of evaluating statistical models could be. Recalling the steps we take after constructing a linear model, there are diagnostic tests (residual checks), model comparison, significance of coefficients, etc. Here is a summary table from ChatGPT:

img

However, as you can see, nothing was mentioned about the significance of coefficients. When I ask chatGPT, it says: “Testing whether a coefficient in a linear regression model is significant is not typically classified as a type of model evaluation. Instead, it is considered part of inference or hypothesis testing about the relationships between variables in the model.” Oh my. Inference.

Brian’s understand of inference

When I was taking JHU’s Data Science Specialization on Coursera, one of the course is Statistical inference. It happens after Reproducible Research and before Regression Models. In the beginning of the course, Brian Caffo defined it as the process of drawing formal conclusions from data., which is further defined as settings where one wants to infer facts about a population using noisy statistical data where uncertainty must be accounted for. Not very conclusive. Later in the course we talked about probability, conditional probability, expectations, variance, common distributions, asymptopia (law of large numbers and central limit thereom), t confidence intervals, hypothesis testing, pValues, power, multiple testing and resampling. So some basic statistical concepts.

Kyle’s understand of inference.

In Kyle’s advanced statistics course where I co-teach, he did mention about inference and back then I did pause to contemplate on this word. On the slide for Inference he says:

  1. How to evaluate whether our model fits the data well? This includes goodness-of-fit measure such as R2 and diagnostic tests that evaluates residuals.
  2. How to evaluate whether all our predictors are useful for the model? This includes t-tests or ANOVA that evaluates model parameters. This is usually referred to as hypothesis testing, where we assumes the

Wiki’s inference

When I looked on Wiki, statistical inference is mentioned as opposed to descriptive statistics, which mainly includes “measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness”.

“Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.[1] Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. “

Before we can evaluate whether the predictors are useful for the model, we need to first find/solve the parameters/coefficients. How parameters are found in models? I can think of four:

  1. LSE: least squares estimation
  2. MLE: maximum likelihood estimation
  3. Bayesian: summarizing the posterior
  4. Loss function: machine learning.

Then I asked ChatGPT to give me a more comprehensive table:

img

]]>
LD Prunning http://fanhuan.github.io/en/2024/12/02/LD-Prunning/ 2024-12-02T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/02/LD-Prunning In the era of whole genome sequencing of thousands of individuals, we are facing the problem of not too few genetic variants, but too many. A major task is to filter those variants. Recently there is a very good review paper on this topic by Hemstrom et al.

The relationship between recombination rate and linkage disequilibrium (LD) is a key concept in population genetics. LD describes the non-random association of alleles at two or more loci, while recombination rate determines how frequently genetic material is exchanged between loci during meiosis. Here’s how they are related:


1. What is Linkage Disequilibrium (LD)?

  • LD measures the statistical association between alleles at different loci.
  • If two loci are in LD, the allele combinations at these loci occur more or less frequently than expected based on their individual allele frequencies.
  • LD can be quantified using metrics like ( D’ ), ( r^2 ), or ( D ):
    • ( r^2 ): Measures the correlation between alleles at two loci, ranging from 0 (no LD) to 1 (complete LD).
    • ( D ): Measures the deviation of observed haplotype frequencies from expected under linkage equilibrium.

2. How Does Recombination Affect LD?

Recombination reduces LD by reshuffling alleles at different loci during meiosis. The relationship between recombination rate and LD can be summarized as:

  1. High Recombination Rate:
    • Loci with high recombination rates tend to have low LD because frequent recombination breaks the association between alleles.
    • Alleles at these loci assort more independently, leading to linkage equilibrium.
  2. Low Recombination Rate:
    • Loci with low recombination rates tend to have high LD because recombination events are rare, preserving the non-random association of alleles.
    • This is common for loci that are physically close on the same chromosome.
  3. Recombination Hotspots:
    • Regions of the genome with high recombination activity can lead to sharp decreases in LD between loci on either side of the hotspot, even if they are physically close.

3. Factors Influencing the Relationship Between Recombination and LD

While recombination plays a central role in shaping LD, other factors also affect this relationship:

  1. Genetic Distance:
    • Loci that are closer together on a chromosome typically have lower recombination rates and higher LD.
    • Loci further apart are more likely to recombine, resulting in lower LD.
  2. Population Size:
    • Smaller populations tend to have higher LD because fewer recombination events occur across generations.
  3. Mutation Rate:
    • Higher mutation rates introduce new alleles that can increase or decrease LD.
  4. Selection:
    • Natural selection can maintain LD by favoring specific allele combinations (e.g., epistatic selection or selective sweeps).
  5. Population History:
    • Bottlenecks, founder effects, and admixture events can lead to elevated LD in regions with low recombination rates.

4. Mathematical Description

LD decay due to recombination can be described by the equation: [ D_{t+1} = (1 - r) D_t ] Where:

  • ( D_{t+1} ): LD at the next generation.
  • ( r ): Recombination rate between two loci.
  • ( D_t ): LD in the current generation.

This shows that:

  • Higher recombination rates (( r )) reduce LD faster across generations.
  • Lower recombination rates (( r )) allow LD to persist for longer periods.

5. Real-World Implications

  1. Mapping Genes:
    • LD is used in genome-wide association studies (GWAS) to link genetic markers to traits.
    • High LD regions may indicate physical proximity between a marker and a causal variant.
  2. Population Genomics:
    • LD patterns provide insights into recombination landscapes, population structure, and demographic history.
  3. Selective Sweeps:
    • Strong positive selection can maintain high LD around a beneficial allele, even in regions with moderate recombination rates.

6. Summary

  • Recombination rate inversely affects LD: High recombination reduces LD, while low recombination maintains it.
  • LD patterns reflect the interplay of recombination, selection, mutation, and demographic factors.
  • Understanding the relationship between recombination and LD is crucial for genetic mapping, evolutionary studies, and understanding population structure.

Recombination rate is typically calculated or estimated using genetic data, and it represents the frequency at which recombination occurs between two loci. This rate can be determined in different ways depending on the type of data and methods used. Below are the key approaches:


1. Using Genetic Maps

A genetic map provides recombination rates in centiMorgans (cM) per physical distance (e.g., per megabase, Mb).

  • Definition of 1 cM:
    • 1 centiMorgan corresponds to a 1% chance of recombination occurring between two loci during meiosis.
  • Recombination Rate: [ \text{Recombination Rate (cM/Mb)} = \frac{\text{Genetic Distance (cM)}}{\text{Physical Distance (Mb)}} ]

How Genetic Maps Are Built:

  1. Linkage Analysis:
    • Use observed genetic markers (e.g., SNPs) from pedigree data or experimental crosses.
    • Recombination frequencies (( r )) between markers are measured.
    • The genetic distance is inferred using the Haldane or Kosambi mapping functions:
      • Haldane (no interference): ( d = -\frac{1}{2} \ln(1 - 2r) )
      • Kosambi (with interference): ( d = \frac{1}{4} \ln\left(\frac{1 + 2r}{1 - 2r}\right) )
    • Genetic distances are summed to build the map.
  2. High-Density SNP Data:
    • Use population-based genetic data and haplotypes to infer recombination hotspots and recombination rates.

2. Using Population Genetic Data

Recombination rates can also be inferred directly from population genetic data using linkage disequilibrium (LD).

Concept:

Recombination breaks down LD over time, so patterns of LD between markers can be used to estimate recombination rates.

  1. Statistical Models:
    • LD-based methods estimate ( r ) by fitting population genetic models.
    • Software tools such as LDhat and LDhelmet are widely used for this purpose.
  2. Coalescent Framework:
    • Recombination rates are estimated by modeling how haplotypes coalesce back in time under specific demographic and genetic scenarios.
  3. Formula Linking LD and Recombination: LD decay due to recombination is modeled as: [ r^2 = \frac{1}{1 + 4N_e r} ] Where:
    • ( r^2 ): Linkage disequilibrium between loci.
    • ( N_e ): Effective population size.
    • ( r ): Recombination rate between loci.

This relationship allows estimation of ( r ) using LD patterns in population data.


3. Experimental Crosses

In experimental populations (e.g., plants or animals), recombination rates can be measured directly by analyzing offspring genotypes from controlled crosses.

Steps:

  1. Cross two genetically distinct parents to produce offspring.
  2. Genotype markers (e.g., SNPs, microsatellites) in the offspring.
  3. Count recombination events between adjacent markers.
  4. Calculate recombination frequency: [ r = \frac{\text{Number of recombinant offspring}}{\text{Total number of offspring}} ]
  5. Use mapping functions (e.g., Haldane or Kosambi) to convert recombination frequencies into genetic distances.

4. Using Molecular Data

With advancements in sequencing, recombination rates can also be estimated using:

  1. Recombination Hotspots:
    • High-resolution sequencing data reveals recombination hotspots (regions with very high recombination rates).
    • Tools like PRDM9 motif analysis can identify hotspots based on sequence patterns.
  2. Double-Strand Break (DSB) Mapping:
    • Experimental methods (e.g., ChIP-seq for DSB proteins like Spo11) directly measure recombination activity at specific genomic regions.

5. Using Existing Recombination Maps

For well-studied organisms like humans, mice, and certain crops, recombination maps are already available:

  • Human recombination maps (e.g., HapMap or 1000 Genomes Project) provide rates in ( \text{cM/Mb} ) across the genome.
  • These maps are often derived from large-scale genotyping and haplotype-based LD analyses.

Example in Humans

In humans, the average recombination rate is ~1.2 cM/Mb, but it varies across the genome:

  • Recombination hotspots: Regions with recombination rates >10 cM/Mb.
  • Recombination coldspots: Regions with recombination rates <0.1 cM/Mb.

R Implementation Example

If you have genetic distances (cM) and physical distances (Mb), you can calculate recombination rates like this:

# Example data
genetic_distance <- c(1.5, 2.0, 0.5)  # in cM
physical_distance <- c(0.1, 0.2, 0.05)  # in Mb

# Calculate recombination rate in cM/Mb
recombination_rate <- genetic_distance / physical_distance

# Print results
print(recombination_rate)

Output:

[1] 15 10 10  # cM/Mb

Summary:

  • Recombination rates can be calculated from genetic maps, LD patterns, or experimental crosses.
  • They are influenced by physical distance, recombination hotspots, and population genetics.
  • Tools like LDhat, LDhelmet, and existing recombination maps are useful for estimation.
]]>
Hardy-Weinberg Equilibrium http://fanhuan.github.io/en/2024/12/02/HWE/ 2024-12-02T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/02/HWE Last time I mentioned that Hardy-Weinburd Equilibrium (HWE) made me think about the relationship between population genetics and quantitative genetics. While HWE actually deserves it’s own post so here we are.

Background Info on HWE

HWE is a theoretical relationship between the allele frequency and genotype frequency. For a locus with two alleles, A and a, if the allele frequency of A is p, then the allele frequency the a is q, and the frequency of AA would be p^2, Aa being 2pq and aa being q^2, if this species is diploid, like human, or the plant species I work on (luckily!).

img By Johnuniq - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6045237

What if the observation does not match with this prediction? e.g. For locus A, in my dataset with 100 individuals, I have 50 people genotyped as AA, 30 as Aa and 20 as aa, so the allele freqency of A is (50x2 + 30)/200 = 0.65, and 0.35 for a. So in theory, the expection for the genotype AA should be 0.65^2 x 100 = 42.25, Aa = 2 x 0.65 * (1-0.65) x 100 = 22.75, aa = 0.35 x 0.35 x 100 = 12.25. The difference between observation (50,30,20) and expectation (42,45,23) is the deviation.

Currently there are two ways to test whether this deviation is significant. Chronologically, the first way a simple Chi-squared goodness-of-fit test (read more about goodness of fit. In our example, this deviation would be, (50-42)^2/42 + (30-45)^2/45 + (20-23)^2/23 = 6.915. The degrees of freedom for test for Hardy–Weinberg proportions are # genotypes − # alleles, so here is 3-2=1. The 5% significance level for 1 degree of freedom in Chi-squared Distribution is 3.84. Since our χ2 value is greater than this, it deviates from HWE.

Later in the 2005 Wigginton paper, the authors proposed

For more on what is H-W Equilibrium and how to test whether it has deviated from it, see more on its wiki page. In this post we mainly talk about why, whether and how to filter genetic variant based on HWE.

Reasons for Deviation from HWE

Deviatins from HWE could mean several things, mainly:

  1. inbreeding
  2. population stratification
  3. problems in genotyping.

Here when we only want to filter out SNPs that deviate from HWE due to the third reason: problems in genotyping. If a SNP deviated from HWE due to the first two reasons, it would be wrong to filter them.

In the beginning I did H-W evaluation for the whole dataset. This is done through bcftools +fill-tags. Here is the explaination for all the tags in the manual.

The main relevant tag is HWE

INFO/HWE Number:A Type:Float .. HWE test (PMID:15789306); 1=good, 0=bad

There is another tag that is related, or using the same input info which is the number of heterozygote individuals:

INFO/ExcHet Number:A Type:Float .. Test excess heterozygosity; 1=good, 0=bad

After that post, I decided to do the HW evaluation separately for each subpopulation.

e.g. the first SNP in my dataset it reads HWE=0.0022464. After filling the tags separately, which is achived via -S sample_population.mapping_file, this tag read:

HWE_TS1=1;ExcHet_TS1=0.996109;HWE_AGO=0.560936;ExcHet_AGO=0.837149;HWE_Nigeria=1;ExcHet_Nigeria=1;HWE_Evolution=1;ExcHet_Evolution=1;HWE_AVROS=1;ExcHet_AVROS=1;HWE_Ghana=1;ExcHet_Ghana=1;HWE_Compacta=1;ExcHet_Compacta=0.903226;HWE_Ekona=0.428571;ExcHet_Ekona=1;HWE_Deli=1;ExcHet_Deli=1;HWE_Tanzania=1;ExcHet_Tanzania=1;HWE_TS3=1;ExcHet_TS3=1;HWE_L2T=1;ExcHet_L2T=1;HWE_Ni=1;ExcHet_Ni=0.952381;HWE_TR=1;ExcHet_TR=0.947368;HWE=0.0022464;ExcHet=0.999891

oday when I was trying to filter out some SNPs so I do not need to deal with dozens of millions of them, I gave the Hardy-Weinburg Evaluation another thought. It is often desirable to filter out loci based on statistically significant (for a given α-value or P value) deviations from H-W proportions. Everybody knows about Hardy-Weinburg, the iconic p+q=1 -> p^2 + pq + q^2 = 1, simple, elegant, yet too good to be true, just like effective population size (random mating, infinitely large population in the absence of selection, migration, or new mutation). Therefore the concern is that a deviation from it might not be a consequence of poor SNP quality. If we filter too aggressively, we might lose the ones that are actually interesting since their allele frequency might be different in different populations.

Hardy–Weinberg proportions. It is often desirable to filter out locibased on statistically significant (for a given α-value or P value) devia-tions from HWP. HWP are a common assumption of many downstreamanalytical tools (for example, STRUCTURE) 81 , and removing loci thatviolate HWP can help to ensure unbiased results for downstreamanalyses in randomly mating populations 82 . Deviations from HWPoften reflect sequencing, assembly or alignment errors (such as aheterozygote deficit caused by allelic dropout or a heterozygoteexcess caused by paralogous regions) 47,60,83,84. However, loci out ofHWP can also indicate real biological phenomena, such as crypticpopulation substructuring (Fig. 2c) or balancing selection 85. As aresult, it is crucial to filter HWP within sample-groups (for example,within populations) rather than study wide (for example, globally onall samples) 86 (discussed below) and to do so with a low stringency ifthe loci under selection or those that differ between populations areof interest. That said, some metrics, such as FST, can be biased upwardby the careless removal of loci that are not in HWP within popula-tions 86, which is potentially problematic if population delineations

]]>
Variable, Covariables and Covariates. http://fanhuan.github.io/en/2024/12/02/Covariant/ 2024-12-02T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/02/Covariant 12.10 Partial RDA(redundancy analysis) and variance partitioning

Covariables, What they are and

When you are not interested in the infuluence of specific explanatory variables, it is possible to partial out their effect using technics such as partial linear regression.

Chapter 5.3 Partial Linear Regression.

5.2 Multiple (linear) regression analysis: A multiple regression analysis is a statistical technique used to examine the relationship between one dependent variable (or outcome variable) and two or more independent variables (or predictors).

Backwards selection:

Why do we do model selection? We know that we cannot include all the parameters. Some are not significant; some will not pass the test (the increase of parameter will be punished). Another way to think about it is that we should not use the full model. First of all we have the problem of overfitting. Why is overfitting a problem? It is too tailor to the current dataset thus it is very hard to use the model to predict for unseen data or future events. Another way to understand the necessity of using only a subset of the parameters is that (i) the interpretation of the model, the fewer the parameters the easier to interpretate. This is very similar to the predictability of models. (ii) precision of predicted intervals and confidence bands ill be smaller. This is because each parameter will have its own se. The more the paramters, the bigger the sd in the prediction.

Therefore there are many ways of evaluating how optimal a model is for different subsets of the same set of parameters.

  1. The very famouse AIC (Akaike Information Criteria). Who is Akaike? Hirotugu Akaike (赤池 弘次) is a japanese statistician. The formular looks like this:

AIC = n log(SSresidual) + 2(p+1) -nlog(n).

As you can see there are thre parameters, n(# of observations), SSresidual (goodness of fit) and p(# of parameters). In model selectin, n is fixed, therefore we are basically comparing how well the model fits the data while punishing the number of parameters. As you can see if the scale of y is very big, the effect of p will be relatively small even after log.

  1. Adjusted R2. Usually R2 = 1- SS_residual/SS_total. Adjusted R2 panelize for parameters.

Adjusted R2 = 1 - SS_residual/SS_total * (n-1)/(n-1-p). Why (n-1)/(n-p+1)?

  1. BIC (Bayesian Information Criterion). Formular:

BIC = -2ln(L) + pln(n)

]]>
Heritability and how to estimate http://fanhuan.github.io/en/2024/11/18/Heritability/ 2024-11-18T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/11/18/Heritability In this post I am going to talk about what is heritability and how to estimate it using WGS data for non-family data.

“Statistics prior to animal breeding was not very concerned with predicting random effects. These were somewhat seen as nuisance parameters.” Here nuisance parameters mean that those are not of primary interest but still affect the model and its estimates, which is true in a normal mixed models with random effect.

Reference

  1. Population Genomics, Concepts, Approaches and Applications. (Springer Cham, 2019). doi:10.1007/978-3-030-04589-0.
  2. Textbook Animal Breeding and Genetics (second edition, 2024) https://wiki.groenkennisnet.nl/space/TAB
]]>
One Liners for StatsQuest Fundamentals http://fanhuan.github.io/en/2024/10/21/StatQuest-Fundementals/ 2024-10-21T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/21/StatQuest-Fundementals I love StatsQuest and have watched quite some videos from it. After a while, I lose track of which one I have watched and as you know usually Josh will ask you to make sure to watch some pre-requisites. Here I am documenting what I have watched and try to summarize each with just one (ok a few) sentence. Today we will start with the fundamentals.

  1. Histograms They are geneticists! They are statisticians with (my) domain knowledge!

  2. The Main Ideas behind Probability Distributions Dag? Not Bam?

  3. The Normal Distribution The normal curves are drawn such that 95% of the measurements fall between +/- 2 sd around the mean.

  4. The mean, th emedian and the mode. If it’s normal they are the same.

  5. The Exponential Distribution Used when estimating time between two events. I never thought about exponential distribution in this sense. For me it is the relationship between time(x) and population size(y). If a population starts at 100 individuals (y0 = 100) and has a growth rate of λ=0.05 (5% per unit of time), the population at time t would be: y(t)=100e^0.05t. But if you think about it, whenever they give examples of exponential distribution, it goes down along the x-axis, instead of up (time-population)!

]]>
Variance, Covariance, Correlation, Variation and Covariation http://fanhuan.github.io/en/2024/10/15/Some-Concepts/ 2024-10-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/15/Some-Concepts Variance is the most basic one. It measures the spread or dispersion of a single variable’s values around its mean. Var(X)=1/n∑​(xi​−mean)^2. Standard deviation is sqrt(Var(X)).

Covariance measures the degree to which two variables change together. It indicates whether two variables tend to increase or decrease in tandem. Cov(X,Y)=1/n∑​(xi​−xmean)(yi​−mean​). You can see that if Xi and Yi are both greater or smaller than their means, the product will be positive. If the trend is different, it would be negative and the Cov(X,Y) will be smaller.

Correlation is standardized covariance. folumar of correlation is cor(y1, y2) = cov(y1, y2)/sqrt(var(y1)var(y2)). If the variables (y1, and y2) are already normalized (mean = 0, sd=1), then cor(y1, y2) = cov(y1, y2). Note that in simple linear regression: R2=cor(y, y_hat)^2. If you have a vcv (variance covariance) matrix, you can turn it into a correlation matrix via stats::cov2cor.

Both variation and covariation are broader terms comparing to variance and covariance, which are precise statistic terms with defined calculation equations.

]]>
Genotype Decode http://fanhuan.github.io/en/2024/10/15/Genotype-Decode/ 2024-10-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/15/Genotype-Decode As you have noticed there are many ways of denoting the genotypes even though it is usually a pretty straight forward thing for diploids: R/R, R/A and A/A.

  1. VCF original: 0/0: R/R; 0/1: R/A; 1/1: A/A. ./.: missing call
  2. Dosage of ALT: 0/1/2/NA.
    1. Rule: 0/0: 0; 0/1: 1; 1/1: 2; ./.: NA. Note that heterozygotes are always represented as 0/1, no 1/0 would be called.
    2. Tools: can use plink -recode A to turn the vcf format or .bed into this format.
  3. {-1,0,1}
    1. Rule: I am not sure about this one. I came across that this should be the dosage of major allele, not ref allele. But currently I am using it as if -1: R/R; 0: R/A; 1: A/A.
    2. Tools: both {sommer} and {rrBLUP} uses this format. Or to be more accurate, they can deal with any scaled (mean = 0) SNP matrix.
  4. 0/1. I actually do not know what this means. Homo and Hetero?

There are also tools that do not care about the format, e.g {BGLR}. You can either give it in the 0/1/2 format or -1/0/1 format. You just need to scale it before using so that each column has a mean of 0 and a standard deviation of 1. This ensures that all SNPs are on a comparable scale. This standardization helps ensure that each SNP contributes equally to the model, which is particularly important when you have variables with different ranges (like 0, 1, 2 in the SNP matrix). Without scaling, SNPs with larger variance could disproportionately influence the model compared to those with smaller variance. In Bayesian ridge regression, this step ensures that the prior regularization is applied more evenly across all SNPs.

If you are going to calculate h2 based on the Vu, you also need to further scale it by the number of SNPs so the random effect takes the appropriate portion of the total variance via scale(X)/sqrt(ncol(X)). y also need to be standardized in order to calculate h2 so it is camparable across traits. A matrix also need to be standadized, can do A/mean(diag(A)).

For methods that involves regularization (ridge regression, lasso, etc.), scaling also helps maintain stability and interpretability of the posterior distributions for the effect sizes of the SNPs. Here regularization means ways to favor flatting estimates therefore to prevent from overfitting. Ridge regression is usually referred to as L2 regularization and lasso is referred to as L1 regularization. The YouTube channel StatQuest with Josh Starmer has exellent vidoes on this topic that you can check out if you are not familiar with them yet.

]]>
Kinship matrix (I) -- A Matrix http://fanhuan.github.io/en/2024/10/14/A-Matrix/ 2024-10-14T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/14/A-Matrix Kinship matrix

A kinship matrix (K), sometimes also referred to as matrix G, or genetic relationship matrix, is a variance-covariance (vcv) matrix, the ones we usually use in the randome effects part of mixed models. Therefore, the dianogals is within-individual variance, which is represented by the degree of homozygosity or inbreeding level. Actually, the diagnoals are 1 + F, where F is the inbreeding coefficient as descirbed in a previous post about HWE. The offdiagnals are covariance between pairs of individuals. If you are not familiar with covariance, check out this previous post.

Kinship matrices can be calcuated based on different kind of information: pedigree, DNA markers and/or sequence data.

A matrix

If a kinship matrix is estimated using pedigree info, it is called a A matrix, A for additive. Sometimes it is also referred to as numerator relationship matrix (numerator/demoninator) and I have a theory of where the name is coming from later in the post. Traditionally, before we had molecular data, kinship matrices can only be estimated using pedigrees information, therefore it is the expected genetic relationships between individuals in a population. It is supposed to captures the probability that two alleles in two individuals are identical by descent (IBD).

GRM

If a kinship matrix is estimated using genome sequencing, it is called a Genomic Relationship Matrix (GRM). See their differences summarized below: img

Computation of A matrix

Here we will follow the R code in Austin Putz’s post for A matrix computation. I have a separate post on GRM

# create original pedigree



ped <- matrix(cbind(c(3:6), c(1,1,4,5), c(2,0,3,2)), ncol=3)

# change row/col names
rownames(ped) <- 3:6
colnames(ped) <- c("Animal", "Sire", "Dam")

# print ped
print(ped)

This won’t work because in a pedigree, every one in 2nd and 3rd column (parents), needs to be in the first column (we need to know the parents of everyone). Now we have 1,2,3,4,5 in the 2nd and 3rd column, but only 3,4,5 are in the first column, which means we need to add 1 and 2 in the first column. If we do not know their parents, just use 0. Note that pedigree needs to be sorted from oldest (top) to youngest (bottom), meaning parents goes before offsprings.

ped <- matrix(cbind(c(1:6), c(0,0,1,1,4,5), c(0,0,2,0,3,2)), ncol=3)

# change row/column names
rownames(ped) <- 1:6
colnames(ped) <- c("Animal", "Sire", "Dam")

# print matrix
print(ped)

Then it gives the logic for generating the A matrix. Basically you need to generate the off-diagnal first. The relationship between individual 1 and individual 2 is defined as the average relationship between individual 1 and the parents of individual 2:

aind1,ind2=0.5(aind1,sire2+aind1,dam2).

After you have the off-diagnals, the diagnals are easy:

a_diag=1+0.5(a_sire,a_dam), and since everything in the sire and dam are in column 1, (a_sire, a_dam) is one of the off-diagnal.

The only thing that is in the way is the 0s, when the parents info is unknown.

Let’s look at the code.

createA <-function(ped){
    
    if (nargs() > 1 ) {
      stop("Only the pedigree is required (Animal, Sire, Dam)")
    }
    
    # This is changed from Gota's function
    # Extract the sire and dam vectors
    s = ped[, 2]
    d = ped[, 3]
    
    # Stop if they are different lengths
    if (length(s) != length(d)){
      stop("size of the sire vector and dam vector are different!")
    }
    
    # set number of animals and empty vector
    n <- length(s)
    N <- n + 1
    A <- matrix(0, ncol=N, nrow=N)
    
    # set sires and dams (use n+1 if parents are unknown: 0)
    s <- (s == 0)*(N) + s
    d <- (d == 0)*N + d
    
    start_time <- Sys.time()
    # Begin for loop
    for(i in 1:n){
      
      # equation for diagonals
      A[i,i] <- 1 + A[s[i], d[i]]/2
      
      for(j in (i+1):n){    # only do half of the matrix (symmetric)
        if (j > n) break
        A[i,j] <- ( A[i, s[j]] + A[i, d[j]] ) / 2  # half relationship to parents
        A[j,i] <- A[i,j]    # symmetric matrix, so copy to other off-diag
      }           
    }
    
    # print the time it took to complete
    cat("\t", sprintf("%-30s:%f", "Time it took (sec)", as.numeric(Sys.time() - start_time)), "\n")
    
    # return the A matrix
    return(A[1:n, 1:n])
    
  }

The first thing to notice is that it starts as a n+1 by n+1 matrix filled with 0. This means unknown relationships are by default 0 unless changed later.

Secondly, it fills by the order of a11, a12, a13 all the way to a16, then a22, a23 to a26, etc. This is why the ordering from old to young is so important. Because the eldest ones are the ones with unkown parents, therefore a11 is always 1 (a77 is 0, 7 for unknown).

Now that we have the A matrix, how do we understand those numbers intuitively?

On the off diagnal:

  1. “0.5”: a13, a14 and a23 they are parent-offspring, 50% heritance. a15, 1 is the father of 3 and 4, and 5 is the child of 3 and 4, so still 50%.
  2. “0.25”: a16 = (a1,5 + a1,2)/2 = ((a1,4 + a1,3)/2 + a1,2)/2 = ((0.5 + 0.5)/2 + 0)/2 = 0.25.
  3. “0”: a12, a24: a12 = (a1,7 + a1,7)/2 = 0. a24=(a2,1 + a2,7)/2 = (0+0)/2 = 0

On the diagnal:

  1. “1”: a11, a22 and a44: they all have one or two 0 in their parents info, theirfore (a_sire, a_dam) is always 0.
  2. “1”: a33 = 1 + 0.5 * (a12),
  3. “1.125”: a55 and a66. This is still in the diagnal, a55=1+0.5 * a34. a34 = (a13 + a37)/2 = (0.5+0)/2, so a55=1+0.5 * 0.25 = 1.125. Same with a66. But What does it mean by exceeding 1? Can think about this in terms of uneven variance in the general least square case.

Since we know A matrix is vcv matrix where the diagnals are not always 1, you can turn it into relationships(similarities) by dividing the off-diagonal elements by the square roots of the product of the cooresponding diagonals. This is called taking a covariance matrix and reducing it to a correlation matrix. Recall that correlation is standardized covariance.

There is a equation for it:

# convert A to actual relationships
A_Rel <- correlateR::cov2cor(A)

# print matrix
print(round(A_Rel, 4))

How it works is that eventually everything on the diagonal becomes 1. The ones on the offdiagonal are scaled by its related variance. for example, a1,2 will be scalled by a11 an a22: a12_new = a12_old/sqrt(a11 * a22). I think this is why the A matrix is called the numerator relationship matrix since its elements go above the division bar.

Indeed when you use the rrBLUP::A.mat(SNP_matrix), the diagnals in the matrix that is returned are not just ones. You can scale it by A = A/mean(diag(A)). This function basically does transposed cross product of the SNP_matrix. therefore a GBLUP model where Z=SNP_matrix is equivalent to K=A.mat(SNP_matrix).

OK I hope know you understand what is happening. The original post is much better than mine!

]]>