Huan Fan http://fanhuan.github.io 2025-02-20T09:55:13+00:00 huan.fan@wisc.edu Coalescence Theory http://fanhuan.github.io/en/2025/02/20/Coalescence-Theory/ 2025-02-20T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/20/Coalescence-Theory When I was doing some reading on IBD, I ran into this passage in Powell 2010

“Traditionally, the probability that two alleles are IBD was most often calculated from a known pedigree and so the individuals at the top of the pedigree (the founders) form a natural base population. However, it is becoming common to use data on genetic markers such as SNPs to estimate the probability of being IBD without reference to a known pedigree and, in this case, there is no obvious base population. Moreover, the concept of IBD seems to conflict with the well-established coalescence theory in which all alleles are descended from a common ancestor but at different times in the past.”

Then I was a bit lost. Why are they in conflict? Then I asked ChatGPT and this is the answer:

The seeming conflict comes from how “identity by descent” (IBD) and coalescence theory define “common ancestry” over different time scales and with different reference points.

IBD in Pedigree Context

  • IBD Definition:
    In a pedigree, IBD typically means that two alleles are identical because they were inherited from the same recent common ancestor.
  • Base Population:
    When you work with a known pedigree, you often define a set of founders as your “base population.” Alleles are considered IBD if they are identical by descent from those founders.

Coalescence Theory

  • Coalescence Concept:
    Coalescence theory tells us that if you trace the ancestry of any pair of alleles far enough back in time, they will eventually converge on a single common ancestral allele.
  • Universal Common Ancestor:
    This means that, theoretically, all alleles in a population are IBD if you go far enough back—because they all share an ancestor (for example, the most recent common ancestor of the species).

The Conflict

  • Time Scale and Practical Use:
    • IBD: In practice, IBD is used to assess recent shared ancestry. It’s meaningful when you have a defined base population or cutoff time, so you can say, “These two alleles come from the same recent ancestor.”
    • Coalescence Theory: Coalescence does not have a built-in cutoff; it implies that, given infinite time, all alleles are IBD. This makes the concept trivial if you ignore time scales.
  • No Natural Base Population:
    When using genetic markers like SNPs without a known pedigree, there’s no obvious set of founders. In coalescence theory, every allele is descended from a common ancestor if you go back far enough. This contrasts with the practical use of IBD, where we’re interested in whether two alleles share a recent common ancestor, not just any ancestor in the distant past.

In Summary

  • IBD (Traditional/Practical Use):
    It’s about recent common ancestry relative to a defined base population. Two alleles are considered IBD if they are inherited from the same recent ancestor.

  • Coalescence Theory:
    It shows that all alleles coalesce into a single ancestral allele over time, meaning that if you extend the time horizon indefinitely, every pair of alleles is IBD.

Thus, the conflict arises because coalescence theory makes the notion of IBD universal (all alleles are IBD eventually), while in practice, we want to quantify recent relatedness using IBD. This is why when using genetic markers without a known pedigree, researchers must define an arbitrary or biologically meaningful base population or time threshold for IBD to be useful.

Then the article went on about how this conflict is dealt with:

“In practice, this conflict has been ignored by using IBD concepts for recent common ancestors and coalescence analysis for distant common ancestors; however, the two categories of ancestor merge, especially when using dense SNP or DNA sequence data.”

Sounds familiar? This is exactly the problem we discussed in the quantitative genetics vs. population genetics post. In quantitative genetics, we usually working with populations with recent common ancestors, sometimes even with clear pedigree; while in population genetics, we usually have little to no understanding of the pedigree and relies solely on molecular markers. While just as this sentence described, now that we have whole genome information on various types of populations combined, how things are done differently before needs to be unified.

So, is coalescence theory still relevant?

]]>
IBD and IBS http://fanhuan.github.io/en/2025/02/19/IBD-And-IBS/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/IBD-And-IBS IBD

Identity-by-descent, also known as identical-by-descent. In Speed and Balding 2015, it is defined as the “phenomenon whereby two individuals share a genomic region as a result of inheritance from a recent common ancester, where ‘recent’ can mean from an ancestor in a given pedigree, or with on intervening mutations event or with no intervening recombination event.”

The probability of IBD ,or F, is tightly linked to “Traditional measures of relatedness, which are based on probabilities of IBD from common ancestors within a pedigree, depend on the choice of pedigree”. If the pedigree is known, the expected IBD is A matrix. However unfortunately there is no consistant definition of IBD probabilities without pedigree.

However when the pedigree is unknown, IBD relationships can only be inferred from the population at speculation, and unfortunately there is no consist

In another review paper, Powell 2010 defined it as “alleles that are descended from a common ancestor in a base population”. You can see the two definitions are slightly different. The former uses “genomic region” as the unit whereas the latter uses “alleles”. Alleles are versions of genes, where as “genomic region” can be non-genic, also can be of any length, so the former is more generic. Also, the latter emphasized on the concept of “base population”. The probability of IBD is sometimes referred to as F, and it “has to be defined with respect to a base (reference) population; that is, the two alleles are descended from the same ancestral allele in the base population.” Why so? As you can imaging, if an allele is very rare in the base population, then the possibility of IBD is very high. On the contrary, if an allele is very prominant in the base population, two individuals having the same allele could be due to chance. See another post on how to determine the base population.

“If the two alleles are in the same diploid individual then F is the inbreeding coefficient of the individual at this locus.” See more on how IC is calculated in this post.

IBS

Identity-by-state, also known as identical-by-state. This concept is relatively simple. It just means that two things, be it alleles or genomic regions, they are the same in two different individuals, iregardless whether it is IBD. This sounds familiar right? The relationship between IBD and IBS is like the one between orthologs and homologs.

IBS is waht we see in the current dataset, and is usually used to calculate the G matrix with unknown pedigree. As you can see this can lead to erroneous inference because a consistent base population is not used.

There we borrow an illustration from [Powell 2010]((https://www.nature.com/articles/nrg2865) to demonstrate the difference between IDB and IBS.

img

So in this figure, as long as the letter is the same, they are IBS, so all the Gs and all the Ts are IBS respectively. However, you also need to have the same background color to be IBD. For example, C1 and C2 are IBD, B3 and B4 are not IBD, C4 and C5 are not IBD either. Note that this relationship is usually considered within the same generation, not crossing generations. Another thing to note in this figure is that the Base population used for the estimation of IBD coefficients should be B1, B2, B3 and B4, not the current C1 to C5. This is why you need to specify the founders or any know pedigree info in the .fam file. I wonder whether gcta takes this info? I tried but it does not :(

What plink offers

PLINK provides tools to calculate genetic similarity between individuals using IBS and Hamming distance. IBS measures the proportion of alleles shared between two individuals across all markers. It ranges from 0 (no alleles shared) to 1 (all alleles shared). Hamming distance measure the mismatches between two individuals, therefore they are inversely related, and it is specified as [‘1-ibs’]. You can choose based on whether you’d like a similarity matrix (ibs) or distance matrix (1-ibs).

This is an option called flat-missing. The manual reads:

“Missingness correction When missing calls are present, PLINK 1.9 defaults to dividing each observed genomic distance by (1-<sum of missing variants’ average contribution to distance>). If MAF is nearly independent of missingness, this treatment is more accurate than the usual flat (1-) denominator. However, if independence is a poor assumption, you can use the 'flat-missing' modifier to force PLINK 1.9 to apply the flat missingness correction."

But how do I know if MAF is dependent of missingness or not in my data? In this case you can investigate their relationship in your own data by generating those two stats.

plink --bfile data --mising --out stats
plink --bfile data --freq --out stats

Then you can calculate the correlation of the F_miss column in the .lmiss file and the MAF column in the .frq file. If it is significantly greater than 0, there might be a correlation. In my case it is amost 0.15 therefore I should turn on the flat-missing option. Then the cmd looks like:

plink --bfile plink_data --distance ibs flat-missing --out ibs_distance

These metrics are useful for understanding relatedness, population structure, and data quality.

IBD at allele vs chromosome segment.

“In this definition of ‘chromosome segment IBD’ there is no need for a base population.”

]]>
Base Population and Why It Matters http://fanhuan.github.io/en/2025/02/19/Base-Population/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/Base-Population In the .fam file prepared for plink , there are two columns for you to specify one’s father (PID) and mother (MID) in this dataset, 0 if unknown. Those with both PID and MID as 0 are considered as founders. Note that “By default, if parental IDs are provided for a sample, they are not treated as a founder even if neither parent is in the dataset.” In that case you need to manually make them founders via --make-founders.

Why do we need founders? Because only they are included in some calculations such as minor allele frequencies/counts or Hardy-Weinberg equilibrium tests, both related to the concept of base population.

Traditionally, the probability that two alleles are IBD was most often calculated from a known pedigree and so the individuals at the top of the pedigree (the founders) form a natural base population, where the founders themselves are unrelated.

The probability that two alleles are IBD has to be defined with respect to a base (reference) population; that is, the two alleles are descended from the same ancestral allele in the base population.

The point of coalescence is the most recent common ancestor. The status of alleles there is in the ancestral state.

]]>
BLUP http://fanhuan.github.io/en/2025/02/19/BLUP/ 2025-02-19T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/19/BLUP https://rpubs.com/amputz/BLUP

]]>
Starts and Bars http://fanhuan.github.io/en/2025/02/17/Stars-And-Bars/ 2025-02-17T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/17/Stars-And-Bars While checking on the generalization of HWE, I had to refresh my memory on multinomial expansion. For any positive integer m and any non-negative integer n, the multinomial theorem describes how a sum with m terms expands when raised to the nth power. I do remember that the sum of exponents in each term needs to be the same as the original nth power, but I forgot on how to calculate the coefficient.

Then I came across this method called stars and bars, which is used to determing how many terms an multinomial expansion has. I don’t remember when or whether I’ve learnt this method in school, but in Chinese it is called “隔板法”(https://zh.wikipedia.org/zh-sg/%E9%9A%94%E6%9D%BF%E6%B3%95). It is solving for the number of combinations of nonnegative integer indices k1 through km such that the sum of all ki is n. Let’s consider the case where we have 3 terms, a, b and c, and we want to expand to the power of 4, (a+b+c)^4. So in this case, n=4 and m=3, where we need to split 4 stars into 3 groups, with 0-4 starts in each group. How? We only need 3-1=2 bars to put amongst those starts, and they will be separated into 3 groups.

It could be something like:


**|**| (a^2 * b^2 * c^0)

or

|*|*** (a^0 * b^1 * c^3)

As you can see the number of combination would be n + (m-1) choose (m-1), i.e, there are altogether n+m-1 positions, and we need to choose (m-1) to place the bars, simple. In our example, it would be 6 choose 2, which is 6!/(4! * 2!) = 15, and there are indeed 15 different combinations such as a^4 or b * c^3.

A harder question would be how to understand the multinomial coefficients. Let’s keep thinking along the lines of stars and bars.

When we were thinking about the number of terms, we were thinking about where to put the bars, but treating the stars anonymously (they are just stars!). Now imaging that they are not. We actually have 4 distinctive starts, 1,2,3,4, then for the first example,

**|**| (a^2 * b^2 * c^0)

there would be 6 different groupings:

12|34|
13|24|
14|23|
23|14|
24|13|

meaning, the arrangement within each group should be cancled/devided, therefore the formular is 4!/(2! * 2! * 0!), or in general term, it would be n!/(k1!k2!…km!) for each term.

Now let’s think about a special case where m = 2, meaning there are always just 2-1=1 bar, the bar can be placed at n+1 different positions, when it is placed at the kth position (let the bar be in front all the stars as 0), the coefficient would be n!/(k! * (n-k)!), which is actually (n choose k), the binomial coefficient.

Now thinking back on the generalization of HWE, there can be more than two alleles at one locus (more bars), or more than two sets of chromosomes (more stars), i.e. polypoidy, or a combination of both. But now we have no problem for expansion in any case.

I hope this post helps you to understand the multinomial expansion and generalization of HWE.

]]>
Inbreeding Coefficient http://fanhuan.github.io/en/2025/02/17/Inbreeding-Coefficient/ 2025-02-17T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/17/Inbreeding-Coefficient Inbreeding coefficient

The inbreeding coefficient is usually referred to as F. As we explained in the IBD vs. IBS post, F is actually the probability of identity-by-decent (IBD) of two alleles. If the two alleles are in the same diploid individual, then F is the inbreeding coefficient of the individual at this locus.

Inbreeding coefficient at locus level

It is defined as 1 - O(f(Aa))/E(f(Aa)). The expectation is based on Hardy Weinburg Equilibrium. It can be generalized to multi-allelic and polyploidy. See more in this post.

Inbreeding coefficient at individual level

But as you can see in the output of plink --het, each sample gets a F. It used O(HOM), the number of observed homozygous loci (3rd column), and E(HOM), the number of expected homozygous loci (4th column), and the total number of loci in the 5th column, N(NM), to calculate F, the 5th column using equation (O(HOM) - E(HOM)) / (N(NM) - E(HOM)). The higher the F, the more inbreed, or more homozygous than expected. O(HOM) and N(NM) is easy to count. For a locus with MAF of p, its E(HOM) would be 1-2p(1-p) based on HWE, then we just sum the E(HOM) up for all the loci where this individual has a genotype, or no missing data. If there is no missing data, the E(HOM) should be the same for all the individuals.

GRM

I was using gcta --make-grm and the values on the diagnols are supposed to be 1+F. However I find huge discrepency between this value and the F reported in plink --het and I was wondering why. I tried to follow the Methods on a toy dataset as described in Yang 2011 NG, and I understand they are doing very different things, but the general trend should be the same given the same dataset.

]]>
DeepSeek R1 Deployment http://fanhuan.github.io/en/2025/02/04/DSR1-Deployment/ 2025-02-04T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/02/04/DSR1-Deployment OK everybody is talking about deepseek and I wanted to see for myself.

I am following the note from Xihan Li.

Specs of my desktop.

  • CPU(free -h): 128G
  • GPU(lspci | grep -i vga): two RTX A4500, each has 20 GB of GDDR6 memory.

I guess I will try the smallest one (1.58-bit, 131GB).

]]>
GRM for Family Data http://fanhuan.github.io/en/2025/01/20/GRM-Papers/ 2025-01-20T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2025/01/20/GRM-Papers 0: Unrelated individuals

In a previous post we talked about how GRM is calcuated in Proferssor Yang Jian’s landmark 2010 NG paper. It is for unrelated individuals, where it is assumed that the average relationship between all pairs of individuals in 0 and the average relationships of an individual with itself is 1 (see the last paragraph of the Statistical framework of the ONLINE METHODS section). The relationship of an individual with itself provides an unbiased estimate of the inbreeding coefficient (F), with a mean of 1+F and variance of 1 when F=0. F for each locus is one minus the observed frequency of heterozygotes over that expected from Hardy–Weinberg equilibrium. When more He is observed than expected, F<0, outbreeding; when less He is observed than expected, 0<F<1, inbreed. Then the observed equals the expected, F=0, thus E(1+F)=1. Therefore the higher number in GRM on the diagnols for an individual, the higher degree of heterozygosity.

There is a slightly differen flavor for this version:

–make-grm-alg 0 The default value is 0, and the GRM is calculated using the equation sum{[(xij - 2pi)(xik - 2pi)] / [2pi(1-pi)]} as described in Yang et al. 2010 Nat Genet. If the value = 1, the GRM will be calculated using the equation sum[(xij - 2pi)(xik - 2pi)] / sum[2pi(1-pi)].

For my data it does not make a big difference.

1: Inbred data

On the same page as the usual --make-grm, there is an option called --make-grm-inbred Make a GRM for an inbred population such as inbred mice or inbred crops. Note the difference between inbred data and family data. Inbred data is usually referring to low, very low degree of heterozygosity, which is usually a result of generations of inbred. Whereas family data still have a good amount of heterozygosity. This definition focus on the pedigree. In the Citation part, two papers were mentioned, Yang 2010 NG and Yang 2011 AJHG (the GCTA paper). However when I searched for the word inbred in both papers, no hits. Therefore theoretically, I do not know what happens when you use this option. However, I did compapred the two GRMs resulted from the two options with the same input data, let’s call them GRM and GRM_inbred, and at least for my data, GRM = GRM_inbred * 2.

2: Family data.

GCTA offers an implementation of this method proposed by Zaitlen et al. 2013 PLoS Genetics. This is their description of this method:

…estimate pedigree-based and SNP-based h2 simultaneously in one model using family data. The main advantage of this method is that it allows us to estimate SNP-based h2 in family data without having to remove related individuals.

Their documentation is great. Basically you make a grm with --make-grm, and then you create another grm based on this grm using --make-bK. This sets the off-diagnals (relationships) that are lower than user-defined threshold (i.e. unrelated) to 0. Then this GRM is about related individuals only. You can inclulde both matrices in your model as random effect if you have a mixed of unrelated (the first GRM) and related (the second GRM). For family data where everyone is related, you could just use the second GRM.

]]>
Kinship matrix (II) -- GRM http://fanhuan.github.io/en/2024/12/12/GRM/ 2024-12-12T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/12/GRM In a previous post we talked about the first way to estimate a kinship matrix, which is through pedigree. In this post we will cover how to do it with genomic data.

This process is described in detail in the methods session of Proferssor Yang Jian’s landmark 2010 NG paper and you use gcta to generate one. Here I am trying to understand it by recreating the process. See this post on more methods on different flavor of GRM.

Step 1: genotype matrix or SNP matrix.

This part is simple. It’s just converting the vcf to a coded matrix where ref/ref is 0, ref/alt is 1 and alt/alt is 2. Note that this is difference from rrBLUP where the convention is {-1/0/1}. See more about genotype denotation in a previous post. It usually has n rows (number of sample) and m columns (m variants or SNPs).

Step 2: scaling of the genotype matrix.

This genotype matrix is then scaled column-wise, meaning, each variant is scaled on its own based on its allel frequency in this dataset, with the assumption that each locus/SNP is under Hardy-Weinburd Equilibrium (HWE). You see two assumptions are made here. 1. This dataset is representative of the base population therefore the allele frequencies are also representative. 2. Any assumptions that HWE carries such as an infinite random-mating population.

The scaling of the genotype matrix is to ensure that each variant is treated equally to the total genetic effect iregardless of its allel frequency. Mathematically it is to reach a mean of 0 and variance of 1. If the allele frequency of the the alternative state in variant_i is pi, then we scale by minus the mean, which is (2*pi^2 + 1*2*pi(1-pi) + 0*(1-pi)^2) = 2pi, and divided by the sd (var is 2pq is derived from Var(xij) = E(xij^2) - E(xij)^2). Then {0,1,2} (denoted as xij) will become (xij - 2pi)/sqrt(2*pi*(1-pi)).

To demonstrate let’s have toy dataset with 5 individuals and 3 causual SNPs to a certain trait.

xij SNP1 SNP2 SNP3
Individual 1 1 1 2
Individual 2 0 1 0
Individual 3 1 0 1
Individual 4 2 0 0
Individual 5 1 0 0
SUM 5 2 3
pi 0.5 0.2 0.3

First let’s calculate the pi. p1 = (1+0+1+2+1)/10 = 0.5, p2=0.2 and p3=0.3. If Then use the equation (xij - pi)/sqrt(2*pi*(1-pi)). Here is a snipett of code for this calculation:

scale_geno <- function(fi, geno){
  qq <- -2*fi/sqrt(2*fi*(1-fi))
  Qq <- (1-2*fi)/sqrt(2*fi*(1-fi))
  QQ <- (2-2*fi)/sqrt(2*fi*(1-fi))
  a <- ifelse(geno == 0, qq, ifelse(geno == 1, Qq, QQ))
  return(a)
}

In a table it looks like:

genotype\pi 0.5 0.2 0.3
qq/0 -1.41 -0.71 -0.93
Qq/1 0 1.06 0.62
QQ/2 1.41 2.82 2.16

and the scaled genotype matrix (zij) looks like the following. The sum of each column is 0 so the mean is also 0, and the variance of the whole matrix is 1.1, close to the expectation.

zij SNP1 SNP2 SNP3
Indi 1 0 1.06 2.16
Indi 2 -1.41 1.06 -0.93
Indi 3 0 -0.71 0.62
Indi 4 1.41 -0.71 -0.93
Indi 5 0 -0.71 -0.93
SUM 0 0 0

Step 3: turning a genotype matrix (n x m) into a relationship matrix (n x n)

By now, we should be able to have a relationship matrix between these five individuals by doing ZZ’/m(=3). Let’s call it the G matrix and it is 5 X 5 in this case.

G Ind1 Ind2 Ind3 Ind4 Ind5
Ind1 1.93 0.30 0.20 -0.92 -0.92
Ind2 0.30 0.70 -0.44 -0.63 0.04
Ind3 0.20 -0.44 0.30 -0.02 -0.02
Ind4 -0.92 -0.63 -0.02 1.12 0.46
Ind5 -0.92 0.04 -0.02 0.46 0.46

In reality, we know little about where the causual SNPs are. Instead, we will just use all the SNPs, as long as some of them are tightly linked to the actually QTLs. However this assumes that the allele frequency we see in our dataset hold true for the population, which ignores the sampling error associated with each SNP. In order to improve the estimate of G, Yang 2010 proposed a weighted average across all SNPs. In this method, the values on the off-diaganols is the same as ZZ’/m. The only adjustment happens on the diaganols (equation 5 in the image below). Specifically, they are defined as 1+F and F is called inbreeding coefficient. I know how to calculate the inbreeding coefficient when considering one locus, which is 1 - H(O)/H(E), H(O) for observed number of heterozygotes and H(E) for expected number of heterozygotes under HWE. However I do not know how you calculate F when there are multiple loci. The j=k part in Equation 6 sums it up for all the SNPs and return a 1+F for each individual based on Eq. 5, but I do not understand why Eq. 5 is true. They lost me at When j=k, var(Aijj) part. But let’s just use on our small toy dataset for now.

img

Aijj <- function(p, x){
  f <- (x^2 - (1+2*p)*x + 2*p^2)/2*p*(1-p)
  a <- 1 + f
  return(a)
}
Aijj SNP1 SNP2 SNP3 Ajj(mean)
Indi 1 0.9375 0.9744 1.1029 1.0049
Indi 2 1.0625 0.9744 1.0189 1.0186
Indi 3 0.9375 1.0064 0.9559 0.9666
Indi 4 1.0625 1.0064 1.0189 1.0293
Indi 5 0.9375 1.0064 1.0189 0.9876

Note that for heterozygotes (1), the number is smaller than 1 (f is negetive) and the homozygote, the number is greater than 1 (f is positive). This matches with our understanding of increasing in homozygosity where in inbreed. The last column is the average over i or across the SNPs. Now we can see that among those 5 individuals, number 3 and 5 are outbreed and the rest are inbreed. But this does not match with intuition. Indi 1 has two hetero site out of 3, why it is above 1? Need to check futher. Assuming this is correct, now we can replace the diaganols with the new calculations.

A Ind1 Ind2 Ind3 Ind4 Ind5
Ind1 1.00 0.30 0.20 -0.92 -0.92
Ind2 0.30 1.02 -0.44 -0.63 0.04
Ind3 0.20 -0.44 0.97 -0.02 -0.02
Ind4 -0.92 -0.63 -0.02 1.03 0.46
Ind5 -0.92 0.04 -0.02 0.46 0.99

We can see that the order changed. Previous from low to high it was 3,5,2,4,1, now it is 3,5,1,2,4. Still something with Indi 1. But the ones on the diaganols are closer to 1 comparing to G.

Currently I am using gcta and then I convert it from bin format to the txt format. gcta offers a snipet of code for the conversion but it did not work for me. Here is my snipet:

read_grm <- function(prefix) {
  # Construct file paths based on the prefix
  grm_bin <- paste0(prefix, ".grm.bin")  # Binary file containing the GRM
  grm_id <- paste0(prefix, ".grm.id")    # File containing IDs (individuals)
  
  # Step 1: Read the ID file
  grm_ids <- read.table(grm_id, header = FALSE, stringsAsFactors = FALSE)
  colnames(grm_ids) <- c("FID", "IID")
  
  # Step 2: Read the binary GRM file
  # Read the size of the binary file (each element is stored as a 4-byte float)
  n <- nrow(grm_ids)  # Number of individuals
  grm_data <- readBin(grm_bin, what = "numeric", size = 4, n = n * (n + 1) / 2)
  
  # Step 3: Reshape the GRM data into a matrix
  # Initialize an empty matrix to store the GRM values
  grm_matrix <- matrix(0, n, n)
  
  # Fill the matrix from the GRM binary file (which is in lower triangular format)
  index <- 1
  for (i in 1:n) {
    for (j in 1:i) {
      grm_matrix[i, j] <- grm_data[index]
      grm_matrix[j, i] <- grm_data[index]  # Symmetrize the matrix
      index <- index + 1
    }
  }
  
  # Add row and column names using the IDs
  rownames(grm_matrix) <- grm_ids$IID
  colnames(grm_matrix) <- grm_ids$IID
  
  # Return the GRM matrix
  return(grm_matrix)
}

Is there R code that takes the genotype matrix and gives the GRM? I am sure I am reinventing the wheel here. Le’t try the A.mat from rrBLUP.

M <- matrix(c(1,1,2,
              0,1,0,
              1,0,1,
              2,0,0,
              1,0,0),5,3)
# rrBLUP decodes in {-1,0,1}
M <- M-1
A <- A.mat(M)

OK we are not done yet!

Now let’s hand calculate the right side of the the methods section of Yang 2010 which is “Unbiased estimate of the relationship at the causal variants and the genetic variance”, or A*.

The result is very different… OK let’s explore more tomorrow. hand calculation vs. gcta bin to txt vs. A.mat.

]]>
Goodness of Fit http://fanhuan.github.io/en/2024/12/03/Goodness-of-fit/ 2024-12-03T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/12/03/Goodness-of-fit We keep hearing about this phrase, goodness of fit, sometimes hyphenated. But I never pause to think about that it is. I just search my papers, where I have several (close to 10!) statistical ebooks and none of them mentioned anything about it. Oh well. Wiki it is. Btw what would you do? ChatGPT? I wish I could fix the comment part of my blog…

OK, here is what Wiki says about goodness of fit.

“The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measures can be used in statistical hypothesis testing, e.g. to test for normality of residuals, to test whether two samples are drawn from identical distributions (see Kolmogorov–Smirnov test), or whether outcome frequencies follow a specified distribution (see Pearson’s chi-square test). In the analysis of variance, one of the components into which the variance is partitioned may be a lack-of-fit sum of squares.”

So to paraphrase, the goodness of fit is a way to evaluate statistical models, and it focuses on how well the model (expectations) fits the observations. For example, R2 is a goodness-of-fit measure. This led me to think what other ways of evaluating statistical models could be. Recalling the steps we take after constructing a linear model, there are diagnostic tests (residual checks), model comparison, significance of coefficients, etc. Here is a summary table from ChatGPT:

img

However, as you can see, nothing was mentioned about the significance of coefficients. When I ask chatGPT, it says: “Testing whether a coefficient in a linear regression model is significant is not typically classified as a type of model evaluation. Instead, it is considered part of inference or hypothesis testing about the relationships between variables in the model.” Oh my. Inference.

Brian’s understand of inference

When I was taking JHU’s Data Science Specialization on Coursera, one of the course is Statistical inference. It happens after Reproducible Research and before Regression Models. In the beginning of the course, Brian Caffo defined it as the process of drawing formal conclusions from data., which is further defined as settings where one wants to infer facts about a population using noisy statistical data where uncertainty must be accounted for. Not very conclusive. Later in the course we talked about probability, conditional probability, expectations, variance, common distributions, asymptopia (law of large numbers and central limit thereom), t confidence intervals, hypothesis testing, pValues, power, multiple testing and resampling. So some basic statistical concepts.

Kyle’s understand of inference.

In Kyle’s advanced statistics course where I co-teach, he did mention about inference and back then I did pause to contemplate on this word. On the slide for Inference he says:

  1. How to evaluate whether our model fits the data well? This includes goodness-of-fit measure such as R2 and diagnostic tests that evaluates residuals.
  2. How to evaluate whether all our predictors are useful for the model? This includes t-tests or ANOVA that evaluates model parameters. This is usually referred to as hypothesis testing, where we assumes the

Wiki’s inference

When I looked on Wiki, statistical inference is mentioned as opposed to descriptive statistics, which mainly includes “measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness”.

“Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.[1] Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. “

Before we can evaluate whether the predictors are useful for the model, we need to first find/solve the parameters/coefficients. How parameters are found in models? I can think of four:

  1. LSE: least squares estimation
  2. MLE: maximum likelihood estimation
  3. Bayesian: summarizing the posterior
  4. Loss function: machine learning.

Then I asked ChatGPT to give me a more comprehensive table:

img

]]>