Huan Fan http://fanhuan.github.io 2024-10-16T06:26:57+00:00 huan.fan@wisc.edu http://fanhuan.github.io/en/2024/10/16/2022-09-02-Tomlinson-2016/ 2024-10-16T06:26:57+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/16/2022-09-02-Tomlinson-2016 Link to the article

题目:Defence against vertebrate herbivores trades off into architectural and low nutrient strategies amongst savanna Fabaceae species 生活在稀树草原的一些豆科植物为了抵御植食动物(不含昆虫)的啃食,进化出了物理防御或者低营养的策略

摘要

Herbivory contributes substantially to plant functional diversity and 植食作用对植物功能多样性有着明显的贡献, in ways that move far beyond direct defence trait patterns, 这种贡献不仅体现在直接跟防御有关的形状的规律上, as effective growth strategies under herbivory require modification of multiple functional traits that are indirectly related to defence. 在植食压力下植物如何有效成长,还需要间接跟防御相关的多种功能特征在此压力下的修正。 In order to understand how herbivory has shaped plant functional diversity, 要理解植食作用是如何塑造了植物的功能多样性, we need to consider the physiology and architecture of the herbivores and how this constrains effective defence strategies. 我们需要从植食动物的生理和构造上来理解它们如何限制植物的有效防御策略(反正就是互相对抗吧)。 Here we consider herbivory by mammals in savanna communities that range from semi-arid to humid conditions. 本文研究不同水分条件下(半干旱到湿润)哺乳动物对萨王纳植物群落的影响。

We posited that the saplings of savanna trees can be grouped into two contrasting defence strategies against mammals, namely architectural defence versus low nutrient defence. 我们提出,萨王纳树种的幼树依据他们对植食动物的防御策略聚为两类,结构型和低能型(手动狗头)。

We provide a mechanistic explanation for these different strategies based on the fact that plants are under competing selection pressures to limit herbivore damage and outcompete neighbouring plants. 对于这些植物在来自植食者和周围植物的竞争选择压力下采取的不同策略,我们提出一个机制性的解释:

Plant competitiveness depends on growth rate, itself a function of leaf mass fraction (LMF) and leaf nitrogen per unit mass (Nm). 植物的竞争力与其生长速率有关,而生长速率跟叶重比例(LMF)和单位质量叶片氮(Nm)都是成正比的(公式1)。

Architectural defence against vertebrates (which includes spinescence) limits herbivore access to plant leaf materials, 结构型防御(包括刺)妨碍了植食动物对植物叶片的取食, and partly depends on leaf-size reduction, 且通常伴随有叶片面积的减小, thereby compromising LMF. 所以LMF较小。 Low nutrient defence requires that leaf material is of insufficient nutrient value to support vertebrate metabolic requirements, 低营养型防御植物的叶子营养满足不了脊椎动物的代谢需求, which depends on low Nm. 因为他们的Nm低。 Thus there is an enforced tradeoff between LMF and Nm, leading to distinct trait suites for each defence strategy. 所以如果要生长速率高,就不可能LMF和Nm都低,只能选一样。(不能都选吗?能,但是一旦资源放在两边,就不可能竞争的过资源放一边的,选择比努力重要啊朋友) We demonstrate this tradeoff by showing that numerous traits can be distinguished between 28 spinescent (architectural defenders) and non-spinescent (low nutrient defenders) Fabaceae tree species from savannas, where mammalian herbivory is an important constraint on plant growth. 我们比较了生活在植食压力很大的萨王纳里分属这两类防御策略的28个豆科树种的许多特征,展示了LMF-Nm之间的权衡。(到这里都没提水的事…)
Distributions of the strategies along an LMF-Nm tradeoff further provides a predictive and parsimonious explanation for the uneven distribution of spinescent and non-spinescent species across water and nutrient gradients. 因为不同的物种在这个LMF-Nm权衡光谱上,可以想见带刺的植物算是选择了低LMF,即高Nm, 那可能需要生活在营养条件比较好的地方,(跟水什么关系?光和作用吗?),于是造成了有刺植物跟没刺植物在水分和营养梯度上的不均匀分布。

最后一句真的很突然,因为说实验设计那里没有提到有水分或者养分的处理啊。当然摘要里面不说不代表没有。只能说老板真的不是很在意摘要。

]]>
http://fanhuan.github.io/en/2024/10/16/2020-12-05-kraken2/ 2024-10-16T06:26:57+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/16/2020-12-05-kraken2 I am dealing with a dataset of plant transcriptome where I am mining for microbial signals. As you might have guessed that I have tried HuMANN3-alfa (please see the previous post). With its

]]>
http://fanhuan.github.io/en/2024/10/16/2019-07-17-The-IT-Crowd/ 2024-10-16T06:26:57+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/16/2019-07-17-The-IT-Crowd
"Oh you are a data scientist! Can you fix my computer?"
"No."

Today I was asked to install a linux system on a laptop with windows on. This is not the type of work I’d like to take at all, but I need to go to a long meeting so I thought I could do this at the background since most of the time you just wait. But when I asked what type of linux this person wants (I assumed Ubuntu), I was told CentOS. I’ve never installed it before so here is to remind me not to take dirty jobs like this.

Step 1 : making a USB stick as your installer

Found this helpful post on setting up a USB key to install CentOS. The only problem is that I have Mac. Therefore needed this post to help with the use of dd. In short:

  1. Download a centOS iso
  2. Plug in your USB and find out where it is mounted by duskutil list. In my case it is mounted at /dev/disk2.
  3. Unmount the disk diskutil unmountDisk /dev/disk2
  4. Write the image to it sudo dd if=CentOS-7-x86_64-DVD-1810.iso of=/dev/disk2. Noticed that centOS is much better than Ubuntu and this steps takes a while (3-4h).

Step 2:Installation

From now on follow the actual post.

F12 for Lenovo to be able to select USB for booting.

First warning comes with disk space. Deleted one of the partition. Needed about 100G.

Everything else is intuitive except that in the post it used minimal install which is equivalent to server version of Ubuntu. Selected Gnome version since this was done on an laptop for personal uses only. This explains why the file was so big. It contains all versions of centOS.

]]>
http://fanhuan.github.io/en/2024/10/16/2019-07-17-Metagenome-Sequencing-Effort/ 2024-10-16T06:26:57+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/16/2019-07-17-Metagenome-Sequencing-Effort
"How much sequencing should I get for each sample?" asked the experimental scientist.
"Depends on your sample." Answers the bioinformatician.

You got some money for a sequencing project. You’ve done your experimental design to make sure each treatment has a fair number of replicates, and then it comes the million dollar question: how much sequences should you get.

]]>
Variance, Covariance, Correlation, Variation and Covariation http://fanhuan.github.io/en/2024/10/15/Some-Concepts/ 2024-10-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/15/Some-Concepts Variance is the most basic one. It measures the spread or dispersion of a single variable’s values around its mean. Var(X)=1/n∑​(xi​−mean)^2. Standard deviation is sqrt(Var(X)).

Covariance measures the degree to which two variables change together. It indicates whether two variables tend to increase or decrease in tandem. Cov(X,Y)=1/n∑​(xi​−xmean)(yi​−mean​). You can see that if Xi and Yi are both greater or smaller than their means, the product will be positive. If the trend is different, it would be negative and the Cov(X,Y) will be smaller.

Correlation is standardized covariance. folumar of correlation is cor(y1, y2) = cov(y1, y2)/sqrt(var(y1)var(y2)). If the variables (y1, and y2) are already normalized (mean = 0, sd=1), then cor(y1, y2) = cov(y1, y2)

Both variation and covariation are broader terms comparing to variance and covariance, which are precise statistic terms with defined calculation equations.

]]>
Genotype Decode http://fanhuan.github.io/en/2024/10/15/Genotype-Decode/ 2024-10-15T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/15/Genotype-Decode As you have noticed there are many ways of denoting the genotypes even though it is usually a pretty straight forward thing for diploids: R/R, R/A and A/A.

  1. VCF original: 0/0: R/R; 0/1: R/A; 1/1: A/A. ./.: missing call
  2. Dosage of ALT: 0/1/2/NA.
    1. Rule: 0/0: 0; 0/1: 1; 1/1: 2; ./.: NA. Note that heterozygotes are always represented as 0/1, no 1/0 would be called.
    2. Tools: can use plink -recode A to turn the vcf format or .bed into this format.
  3. {-1,0,1}
    1. Rule: I am not sure about this one. I came across that this should be the dosage of major allele, not ref allele. But currently I am using it as if -1: R/R; 0: R/A; 1: A/A.
    2. Tools: both {sommer} and {rrBLUP} uses this format. Or to be more accurate, they can deal with any scaled (mean = 0) SNP matrix.
  4. 0/1. I actually do not know what this means. Homo and Hetero?

There are also tools that do not care about the format, e.g {BGLR}. You can either give it in the 0/1/2 format or -1/0/1 format. You just need to scale it before using so that each column has a mean of 0 and a standard deviation of 1. This ensures that all SNPs are on a comparable scale. This standardization helps ensure that each SNP contributes equally to the model, which is particularly important when you have variables with different ranges (like 0, 1, 2 in the SNP matrix). Without scaling, SNPs with larger variance could disproportionately influence the model compared to those with smaller variance. In Bayesian ridge regression, this step ensures that the prior regularization is applied more evenly across all SNPs.

If you are going to calculate h2 based on the Vu, you also need to further scale it by the number of SNPs so the random effect takes the appropriate portion of the total variance via scale(X)/sqrt(ncol(X)). y also need to be standardized in order to calculate h2 so it is camparable across traits. A matrix also need to be standadized, can do A/mean(diag(A)).

For methods that involves regularization (ridge regression, lasso, etc.), scaling also helps maintain stability and interpretability of the posterior distributions for the effect sizes of the SNPs. Here regularization means ways to favor flatting estimates therefore to prevent from overfitting. Ridge regression is usually referred to as L2 regularization and lasso is referred to as L1 regularization. The YouTube channel StatQuest with Josh Starmer has exellent vidoes on this topic that you can check out if you are not familiar with them yet.

]]>
A Matrix http://fanhuan.github.io/en/2024/10/14/A-Matrix/ 2024-10-14T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/10/14/A-Matrix What is a A matrix? A stands for additive, so it is additive relationship matrix derived from pedigree. It is also know as numerator relationship matrix but I do not know why. It is a matrix used in quantitative genetics and animal breeding that represents the expected genetic relationships between individuals in a population. It is calculated using pedigree information and captures the probability that two alleles in two individuals are identical by descent (IBD).

The A matrix usually appears in the random effect part as vcv structure in mixed models. “Statistics prior to animal breeding was not very concerned with predicting random effects. These were somewhat seen as nuisance parameters.” Here nuisance parameters mean that those are not of primary interest but still affect the model and its estimates, which is true in a normal mixed models with random effect.

Here we will follow the R code in Austin Putz’s post.

# create original pedigree



ped <- matrix(cbind(c(3:6), c(1,1,4,5), c(2,0,3,2)), ncol=3)

# change row/col names
rownames(ped) <- 3:6
colnames(ped) <- c("Animal", "Sire", "Dam")

# print ped
print(ped)

This won’t work because in a pedigree, every one in 2nd and 3rd column (parents), needs to be in the first column (we need to know the parents of everyone). Now we have 1,2,3,4,5 in the 2nd and 3rd column, but only 3,4,5 are in the first column, which means we need to add 1 and 2 in the first column. If we do not know their parents, just use 0. Note that pedigree needs to be sorted from oldest (top) to youngest (bottom), meaning parents goes before offsprings.

ped <- matrix(cbind(c(1:6), c(0,0,1,1,4,5), c(0,0,2,0,3,2)), ncol=3)

# change row/column names
rownames(ped) <- 1:6
colnames(ped) <- c("Animal", "Sire", "Dam")

# print matrix
print(ped)

Then it gives the logic for generating the A matrix. Basically you need to generate the off-diagnal first. The relationship between individual 1 and individual 2 is defined as the average relationship between individual 1 and the parents of individual 2:

aind1,ind2=0.5(aind1,sire2+aind1,dam2).

After you have the off-diagnals, the diagnals are easy:

a_diag=1+0.5(a_sire,a_dam), and since everything in the sire and dam are in column 1, (a_sire, a_dam) is one of the off-diagnal.

The only thing that is in the way is the 0s, when the parents info is unknown.

Let’s look at the code.

createA <-function(ped){
    
    if (nargs() > 1 ) {
      stop("Only the pedigree is required (Animal, Sire, Dam)")
    }
    
    # This is changed from Gota's function
    # Extract the sire and dam vectors
    s = ped[, 2]
    d = ped[, 3]
    
    # Stop if they are different lengths
    if (length(s) != length(d)){
      stop("size of the sire vector and dam vector are different!")
    }
    
    # set number of animals and empty vector
    n <- length(s)
    N <- n + 1
    A <- matrix(0, ncol=N, nrow=N)
    
    # set sires and dams (use n+1 if parents are unknown: 0)
    s <- (s == 0)*(N) + s
    d <- (d == 0)*N + d
    
    start_time <- Sys.time()
    # Begin for loop
    for(i in 1:n){
      
      # equation for diagonals
      A[i,i] <- 1 + A[s[i], d[i]]/2
      
      for(j in (i+1):n){    # only do half of the matrix (symmetric)
        if (j > n) break
        A[i,j] <- ( A[i, s[j]] + A[i, d[j]] ) / 2  # half relationship to parents
        A[j,i] <- A[i,j]    # symmetric matrix, so copy to other off-diag
      }           
    }
    
    # print the time it took to complete
    cat("\t", sprintf("%-30s:%f", "Time it took (sec)", as.numeric(Sys.time() - start_time)), "\n")
    
    # return the A matrix
    return(A[1:n, 1:n])
    
  }

The first thing to notice is that it starts as a n+1 by n+1 matrix filled with 0. This means unknown relationships are by default 0 unless changed later.

Secondly, it fills by the order of a11, a12, a13 all the way to a16, then a22, a23 to a26, etc. This is why the ordering from old to young is so important. Because the eldest ones are the ones with unkown parents, therefore a11 is always 1 (a77 is 0, 7 for unknown).

Now that we have the A matrix, how do we understand those numbers intuitively?

On the off diagnal:

  1. “0.5”: a13, a14 and a23 they are parent-offspring, 50% heritance. a15, 1 is the father of 3 and 4, and 5 is the child of 3 and 4, so still 50%.
  2. “0.25”: a16 = (a1,5 + a1,2)/2 = ((a1,4 + a1,3)/2 + a1,2)/2 = ((0.5 + 0.5)/2 + 0)/2 = 0.25.
  3. “0”: a12, a24: a12 = (a1,7 + a1,7)/2 = 0. a24=(a2,1 + a2,7)/2 = (0+0)/2 = 0

On the diagnal:

  1. “1”: a11, a22 and a44: they all have one or two 0 in their parents info, theirfore (a_sire, a_dam) is always 0.
  2. “1”: a33 = 1 + 0.5 * (a12),
  3. “1.125”: a55 and a66. This is still in the diagnal, a55=1+0.5 * a34. a34 = (a13 + a37)/2 = (0.5+0)/2, so a55=1+0.5 * 0.25 = 1.125. Same with a66. But What does it mean by exceeding 1? Can think about this in terms of uneven variance in the general least square case.

Now that we know A matrix is vcv matrix where the diagnals are not always 1, you can turn it into relationships(similarities) by dividing the off-diagonal elements by the square roots of the product of the cooresponding diagonals. This is called taking a covariance matrix and reducing it to a correlation matrix. Recall that correlation is standardized covariance. Thus, we call A the numerator(above the line) relationship matrix.

There is a equation for it:

# convert A to actual relationships
A_Rel <- cov2cor(A)

# print matrix
print(round(A_Rel, 4))

How it works is that eventually everything on the diagnal becomes 1. The ones on the offdiagnal are scaled by its related variance. for example, a1,2 will be scalled by a11 an a22: a12_new = a12_old/sqrt(a11 * a22). Apparently this is why the A matrix is called the numerator relationship matrix (a12_old is the numerator).

Indeed when you use the rrBLUP::A.mat(SNP_matrix), the diagnals in the matrix that is returned are not just ones. You can scale it by A = A/mean(diag(A)). This function basically does transposed cross product of the SNP_matrix. therefore a GBLUP model where Z=SNP_matrix is equivalent to K=A.mat(SNP_matrix).

OK I hope know you understand what is happening. The original post is much better than mine!

]]>
What does MONOELLEIC mean in a GLnexus generated VCF file http://fanhuan.github.io/en/2024/09/30/Monoallelic/ 2024-09-30T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/09/30/Monoallelic I received a vcf generated by DeepVariant + GLnexus. In the FILTER column, there are only two types: . and MONOALLELIC. I was a bit confused about this monoallelic idea.

In the VCF file it was defined as:

##FILTER=<ID=MONOALLELIC,Description=”Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites”>

This is quite loaded sentence. It is more like a description than a definition.

A better description was given in the GLnexus paper.

Figure 1 gives a very detailed illustration of what it means to be MONOALLELIC.

img

The caption goes like this:

“Figure 1. Allele unification in joint variant calling. (A) Example abbreviated gVCF records for four participants, giving genome position, reference and alternate alleles, and initial called genotypes. Gray records indicate sequencing coverage for regions with no apparent variation. (B) Schematic view of the alternate alleles seen across the four gVCF inputs; they cluster into two sites except for a spanning deletion allele. (C) Example pVCF representation for these variants, with two multiallelic sites and a third “monoallelic” site representing the deletion allele which could not be unified into the multiallelic sites without introducing phase constraints artificially. The input alleles, genotype calls and (not shown) QC measures from the input gVCF records must be “projected” onto the pVCF site representation.”

To understand this, we first need to understand the g(enome)VCF format. This idea was first introduced in the Isaac paper (Raczy 2013 Bioinformatics) The major difference between a gVCF format and a VCF format is that it not only provides output for all variant, but also the non-variant genomic loci.

This is a bit counter-intuitive. Isn’t that if not specified, we would assume that it is the same as the reference? What further information does it provide?

One apparent distinction is grey lines. “Gray records indicate sequencing coverage for regions with no apparent variation. “ For example the first gray lines in Alice’g gVCF. 22 is chr, 101 is pos, C is REF and the place for ALT is <NR> which stands for <NON_REF> . In the GATK gVCF format doc, <NON_REF> was explained as “This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.” It’s like a placeholder. There is no mutation at this site in this sample, but there might be in other samples a.k.a in this population, and this line holds this place from pos 101 until 101 as specified in END=105.

So there is only one problem, how do we understand the first gray line in Carol’s gVCF. It says END=111, however there is a mutation at 106. This did not happen in the examples given in the GATK gVCF format doc. I checked one of my gVCF files and it also did not happen. Let’s treat it as a typo for now and can come back later when it is relevant. Here let’s look at a chuck of gVCF file with complete records:

gVCF example

As you can see in the mapping file below, there is no variant called, but any location with possible mutations were document, e.g. line 2, chr1_27 is where the red ‘T’ is.

image-20240930132842493

In my merged pVCF file, there are also MONOALLELLIC sites. One of them is at chr1_10758. It looks like:

chr1 10758 chr1_10758_C_CTA C CTA 37 . AF=0.041252;AQ=37 GT:DP:AD:GQ:PL:RNC 0/0:17:17,0:50:0,51,509:.. chr1 10758 chr1_10758_CCG_C CCG C 30 MONOALLELIC AF=0.029324;AQ=30 GT:DP:AD:GQ:PL:RNC ./.:17:.,0:50:0,0,0:11

Let’s look at the call in the first sample.

We sampled three samples’ gVCFs. In chr1_10758 their call looks like this:

AGO001_08 chr1 10758 . CCG C,<*> 14.7 PASS . GT:GQ:DP:AD:VAF:PL 0/1:5:22:7,15,0:0.681818,0:12,0,3,990,990,990

AGO003_08 chr1 10758 . C CTA,<*> 11.1 PASS . GT:GQ:DP:AD:VAF:PL 1/1:7:10:1,9,0:0.9,0:10,7,0,990,990,990

AGO090_13 chr1 10758 . CCG C,<*> 15.6 PASS . GT:GQ:DP:AD:VAF:PL 1/1:10:10:1,9,0:0.9,0:15,10,0,990,990,990

In the merged pVCF their genetypes are:

./0:22:7,0:3:0,0,0:-. 1/1:10:1,9:1:10,7,0:.. ./.:10:1,0:6:0,0,0:– ./1:22:.,15:5:0,0,0:1. ./.:10:.,0:3:0,0,0:11 1/1:10:.,9:3:0,0,0:..

Some explaination: e.g ./0:22:7,0:3:0,0,0:-.

  • GT: genotype. ./0 is a half call. It means there are enough evidence for Ref but not enought to make it a 0/0 call (7 is not big enough).
  • DP: read depth 22
  • AD: Allelic depths for the ref and alt alleles in the order listed 7,0.
  • GQ: conditional Genotype Quality, encoded as a phred quality -10logP (genotype call is wrong, conditioned on the site’s being variant). The higher the better. 3 is pretty low.
  • PL: the Phred-scaled genotype Likelihoods rounded to the closest integer. The lower the better. 0,0,0. All equally low.
  • RNC: Description=”Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele”> In our example there are five types of RNC in those 6 calls:
    • ’–’: both phase “unrepresentable overlapping deletion”.
    • ’..’: n/a (there are calls, therefore no “Reason for No CAll”)
    • ’-.’: there is one call (n/a), the other half is due to “unrepresentable overlapping deletion”
    • ‘1.’: there is one call (n/a), the other half is due to “site is Monoallelic, no assertion about presence of REF or ALT allele”
    • ‘11’: “site is Monoallelic, no assertion about presence of REF or ALT allele”

OK so in this case, the insertion called in AGO003_08 wasn’t included in the merged call. Maybe it was only found in this sample. 9 Reads, actually not bad. Both AGO001_08 and AGO090_13 had the same call: CCG_C, while the former had a R/A call (0/1) and the latter had a A/A call. Why? The former nad 7 reads for R and 15 reads for A, while the later had 1 read for R and 9 reads for A. Fair enough.

In sum, a MONOALLELIC row in a pVCF file generated by GLnexus is a result of inability to merge all called genotypes at the same site into a single multiallelic variant. In the case of position 105 on chromosome 22 in Figure 1, they were able to do it by extending int position 106 and 107. But in position 100, we cannot extend chr22_100_T_C to 105 to accommodate chr22_100_TCGTCA_T, otherwise you would be merging an indel with a SNV. By the way I just learnt that SNP is a smaller concept than SNV. Only SNV found in >1% of the population can be called SNP.

Apparently this is not a new concept. In a 1995 Nature Genetics paper, Papadopoulos et. al used it in the most literal way: Monoallelic mutations are “mutations (that) occur in only one allele”, which could be represented as 0/1 in a diploid genome (please note that since I work on a diploid genome so everything I wrote on my website would be under this assumption unless specified otherwise). While here it is referring to that the REF part of the call in this position is already shown in the variant above and here they are only showing the presence of ALT, thus the mono.

I checked the same position called via freebayes. It combined it with an adjacent variant (chr1_10760) into a multiallelic site. It looks like:

chr1_10758 CCGTAG CTACGG,CCATAG

I would say I prefer this version, so I do not need to deal with the MONOALLELIC situation where I would have filtered it out due to the high presence of “missed calls”. But I would love filtered out this version called by freebayes as well since it is not bi-allelic. What would you do? Leave me a comment.

]]>
Deterministic vs Stochastic. http://fanhuan.github.io/en/2024/09/26/Deterministic-vs-Stochastic/ 2024-09-26T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/09/26/Deterministic-vs-Stochastic As you can easily see, English is not my native language. There are many words that I have encountered numerous times in my life, and usually I have a vague idea of what they mean. But usually due to no exact matching word in my mother tough, it stays as some cloudy murky blob. One such word is ‘orthogonal’. Recently, I finally made the effort to understand what it means and where it came from, and life actually has became slightly better! Therefore I decided to take more time to understand some re-occurring words better so I like this world a bit more.

Today let’s try deterministic and stochastic.

We all know what determine and determined means. When a model is deterministic, it usually means that in that the same input it will always produce the same prediction, no stochastic processes are involved.

]]>
SNP Filtering - Basic Principles http://fanhuan.github.io/en/2024/09/25/SNP-Filtering/ 2024-09-25T00:00:00+00:00 Huan Fan http://fanhuan.github.io/en/2024/09/25/SNP-Filtering With high sequencing coverage, big genomes and big samples size, you can easily get dozens of millions of SNPs. Of course you are not going to work with all of them. Therefore filtering becomes very important. In this article we will talk about some basic principles for SNP filter, how to do it using plink which is fast and versatile, and discuss some examples or tutorials on how people are doing it in the field right now.

Part 1: Hard filtering

Removing low quality ones:

  1. Remove SNPs with high missing call rates.

    ​ Missing calls are shown as ./. There are also half missing calls which looks like 1/. or ./0. However they would have been taking care of when converting your vcf to bed using plink by using the option --vcf-half-call m. This is treating half calls as Missing calls. Other options includes: ‘haploid’/’h’: Treat half-calls as haploid/homozygous (the PLINK 1 file format does not distinguish between the two). This maximizes similarity between the VCF and BCF2 parsers. ‘reference’/’r’: Treat the missing part as reference.

    ​ After all half missing calls are taken care of, can use --geno to filter out all variants with missing call rates exceeding the provided value (default 0.1, this means if you just specify –geno without any number it will use 0.1). The lower the more SNPs will be filtered out.

Tools.

  1. Non-ML based
  2. ML based
    1. DeepVariant is a caller so it does not do the filtering per say. However, being a highly accurate caller, it already sets the bar for artefacts to go into the results high. Btw it is usually used together with GLnexus, which is a scalable merging tool that can deal with hundres and thousands of samples with big genomes, with heavy disk space usage as compensation.
]]>