Huan Fan

Lost In Translation

Huan Fan — 2025-05-28T00:00:00+00:00

While working with a vcf file, I noticed that one of the variant looked like this:

ID REF ALT

chr1_254_A_T T A

I was pretty confused. The ID suggested that A is the REF call and T is the Alternative. However the REF and ALT columns suggest the opposite. I was immediate alarmed since this could have cause problematic genotype calls where 0/0 and 1/1 are switched.

How could this be? I checked my original vcf file with which the current one is a subset of, things are OK. the ID is still chr1_254_A_T, and the REF is A and ALT is T. So where did thing go wrong?

Looking through my notes, I realized that I have converted my vcf to plink format, did some prunning there, and then converted the plink files back to vcf. Could this be the problem?

The variant information is stored in the .bim file, and here is its definition

.bim (PLINK extended MAP file)
Extended variant information file accompanying a .bed binary genotype table. (--make-just-bim can be used to update just this file.)

A text file with no header line, and one line per variant with the following six fields:

Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
Variant identifier
Position in morgans or centimorgans (safe to use dummy value of '0')
Base-pair coordinate (1-based; limited to 231-2)
Allele 1 (corresponding to clear bits in .bed; usually minor)
Allele 2 (corresponding to set bits in .bed; usually major)

As you can see, column 5 is the minor allele and column 6 is the major. This means we have lost which one is REF and which one is ALT. When you use plink --recode vcf to convert your .bim back to vcf, it will just assume that the major is REF, which is not always true.

So what can you do? When converting your vcf to the plink format via plint --vcf input.vcf --make-bed, make sure to add either --keep-allele-order or --real-ref-alleles. Then the .bim file will be correct and when you convert it back to vcf later, there should be any problem. It is said that from plink 2.0 does not have this problem and will always respect the original REF/ALT order.

Happy genotyping!

Building a Linkage Map - Part I

Huan Fan — 2025-05-05T00:00:00+00:00

A linkage map is usually required for a lot of quantitative genetic (QTL mapping) and population genetic analysis (gene flow). In this series of post I will be talking about how to build one based on whole genome sequencing data from a family design.

If you are dealing with millions of markers or even more, the only option that I found appropriate is Lep-MAP3. Btw please let me know if you have a better option as I am really not perfectly happy about it.

Sample size and pedigree structure

As for sample size, I will just give you a rule of thumb.

200: reasonable

100 - 200: questionable

< 100: unreliable.

Lep-MAP3 should allow grandparents and half-siblings besides full-sibling and parent-offsprings. It can also deal with selfings. You can include everyone in the same analysis to make your sample size larger.

Input prep

Only two files are required. One is the genotype likelihood, usually a vcf file, and one is a pedigree file.

The first one is pretty straight forward. Just make sure the vcf you provided is not just some genotype calls. If your vcf is recorded from a plink format file, you most-likely have lost the GL or PL info.

The pedigree file looks very confusing. But it is actually just a transpose of the .fam file in the plink format, plus two extra columns in the front as place holder for Chromosome aod positions in the later output. I wrote a python script to help you with the conversion. Note that only two parents are allow in one family, but there can be multiple families. Now let’s talk about some of the more complicated cases.

Multiple families sharing some family members.

In this case you need to list all relevent individuals in each family, under the same ID. For example, GP1 is one of the grandparents for Family_1 and Family_2, then in the pedigree file you will have two columns for GP1, one under Family_1 and one under Family_2. This is also how the program detects half-siblings if they see individuals appeared as parents in multiple families. ParentCall2 will actually prompt you to turn on the halfSibs=1 option in this case. Note that one need to turn on the grandparentPhase=1 in the OrderMarkers2 step when grandparents are identified.

What about selfing ones?

If there are selfing families, which is pretty common in plant breeding. This is what the author suggested on the wiki page of Lep-MAP3:

“As Lep-MAP3 assumes two parents for each family, a selfing crosses cannot be directly analysed with it. However, it is possible to add two dummy parents (one male and another female) to the pedigree.”

Note that you also need to add some dummy columns in your vcf files in this case.

The author also mentioned that “Data for the single parent is not really needed, but the grandparents (say two individuals from different lines crossed to form the parent) can be used.” I don’t know what he means here. Could be, the grandparents are the important info; or if you do not have the single parent info, just use the grandparents as their parents? I wanted to ask this in the forum but kept getting “Spambot protection engaged” from Sourceforge…

Also someone in the forum asked whether one can turn on both grandparentPhase=1 and selfingPhase=1 in the OrderMarkers2 step. The author says that the later is used when there is no grandparents and do not know how the program will function when both are turned on.

Overall pipeline

Step 1: ParentCall2

Recall what is PL: sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.

PL=−10∗logP(Genotype | Data).

P(G D) is calculated by P(D G)P(G)/P(D). See more details on this HaplotypeCaller page of gatk. -10log(P(G|D) will put PL into Pred score scale (Q=-10*logErrorRate). Then PL is normalize across all genotypes by subtracting the value of the lowest PL from all the values, then the PL value of the most likely genotype is 0.

e.g. this is the genotype and their PL value for three samples.

0/1:38,0,59 0/0:0,69,689 0/0:0,57,569

The order goes like: 0/0, 0/1 and 1/1. In the first sample, the most likely genotype is 0/1 (PL=0), and second likely is 0/0 (PL=38). The second and third sample both are called as 0/0, but we have more confidence in the second sample since the different between the most and second most likely genotype is larger (69 > 57).

Note that each variant for each sample will get a ten column string of posterio probabilities. Why ten? This is the number of combinations with replacement with 4 types of nucleotides. CR(n,r) = C(n+r-1, 2) = C(5,2) = 5 x 4 / 2 = 10. However it the 10-number posterior columns are not ordered lexicographically (AA, AC, AG, AT, CC, … TT) but fixed by genotype indices (like VCF’s GT field) rather than nucleotide combinations. It looks like this:

1.REF/REF (0/0)

2.REF/ALT1 (0/1)

3.REF/ALT2 (0/2)

4.REF/ALT3 (0/3)

5.ALT1/ALT1 (1/1)

6.ALT1/ALT2 (1/2)

7.ALT1/ALT3 (1/3)

8.ALT2/ALT2 (2/2)

9.ALT2/ALT3 (2/3)

10.ALT3/ALT3 (3/3)

However since I already prefitered my data so only bi-allelic variants are kept, you will only see the 1st, 2nd and 5th columns are used.

Genotype Likelihood

Huan Fan — 2025-04-17T00:00:00+00:00

Yesterday when we were talking about sequencing depth we ran into a word: genotype likelihood. I know what is genotype; I also know what is likelihood; but what is genotype likelihood?

Before we start, let’s do a quick recap on the difference between probability and likelihood, since they are usually a mixtures in my head unless I really try to focus on their differences.

Probability = What is the chance of this outcome, given a model or known parameters? P(D∣θ)
Likelihood = How plausible is this model (or parameter value), given the data I observed? L(θ∣D)

Therefore probabilies are used in simulation from known models, and likelihood is used in inferring model parameters from observed data (what we are doing most of the time).

OK, so from our understanding of genotype and likelihood, genotype likelihood should be L(AA/Aa/aa

mapping results).

Or is it?

so, some sort of quality score, like the QUAL column in a vcf? P(data∣no variant). Phred Quality Score (Q)=−10×log10(P)

So:

QUAL = 30 → 1 in 1000 chance the site is not a real variant

QUAL = 50 → 1 in 100,000 chance it's a false positive

Reference chain: Kardos 2024 Molecular Ecology -> 2019 Bertrand MEE -> 2016 Vieira Bioinformatics -> 2009 The sequence alignment/map format and SAMtools

How Low is Low?

Huan Fan — 2025-04-16T00:00:00+00:00

You have decided to do whole genomic sequencing (WGS) for your research project. You contacted your sequencing service provider. The first question you will get is: how much data do you need.

What they are actually asking is: what is the sequencing depth, or sometimes referred to as coverage are you expecting.

We all know that coverage limits the kind of analysis we could carry out. But how much coverage is enough?

In Hemstrom 2024, they tried to define what is Low-coverage WGS. They really tried; they put it into the glossary part:

Low-coverage whole-genome sequencing: 
Whole-genome sequencing (WGS)with small numbers of reads covering most genomic loci (low coverage);
the number of reads constituting low coverage varies widely depending on the discipline, 
methodology and research question. Low-coverage WGS often requires genotype likelihood-based methods.

OK. So what have we got from these sentences? That “the number constituting low coverage varies widely depending on the discipline, methodology and research question”. This means no matter which discipline, which methodology and what kind of research questions you have, you still do not know what is considered low-coverage! But once you’ve decided that your coverage is indeed low for your perticular circumstance, you should use “genotype likelihood-based methods”.

Wow. Where do we start. Maybe let’s understand more about this “genotype likelihood-based methods” and it might help us understand when we need to use it and back calculate what is considered low-coverage. Here is a post on genotype likelihood if you are not sure what it is.

Here they cited an attack, sorry, no, a comment on a pretty famous paper on the inbreeding of North American wolves. In the wolf paper, the sequencing coverage is 7X. Wow OK that actually sounds low. Imaging if you have a heterozygous site, you won’t have five reads to support either, let alone the PCR duplication, which can be actually very high (5% to 50% in my current dataset). OK I would say anything below 10X is a no-brainer low. Later I also discovered this paper used RAD-seq. 7X coverage RAD-seq for 437 individuals (ok the sample size is pretty good). Man we need more funding on conservation.

OK back to the main topic. How low is considered low? The comment paper actually investigated on this matter and showed us some data.

This is the meat of the paper. Let’s take a look at some of the relevant subplots.

Figure 1c: This is saying the probability of seeing both alleles in a heterozygous locus will reach amost 1 when the read depth is 10. However this is assuming sequence reads are independent (no PCR duplicates) and that each allele is equally likely to be sequenced. So 10 is the absolute low threshold. You should at least do better than 10.

Figure 1d: F-ROH(run of homozygousity), a finer way of estimating inbreed coefficient (F), see this paper on Newzealand hihi (a friendly bird) on more details of ROH, stabalizes after the read depth reaches 5. You may say ok this is no problem since the coverage is 7. No. Then mean coverage is 7, meaning a lot of the loci might have <5 coverage.

Figure 1e: H-obs is the percentage of heterozygous sites observed, and it just kept on rising even after 20X.

Figure 1f: H-exp is the percentage of heterozygous sites calculated based on Hardy-Weinberg Equilibrium. It stabalizes after 10X. But as the authors pointed out, the pattern is clearer than in Figure 1e, since nobody with an H-exp higher than 0.22 had a read depth lower than 10X. This is to say the H-exp is capped by the read depth.

Figure 1g: Here missingness means missing calls of genotypes at a site for an individual. You can see that only when the read depth reached 15 when the trend stablizes.

OK, based on this one study, I will just say that 10X is the bare-minimum, and only >20X can be considered safe for a diploid genome.

Please take note on the ‘>’ before 20X. Let me emphasize. This is not the mean, but the min! If you tell your sequence service provider that you want 20X, you might end up with lots of samples or loci under 20X, even under 10X. I took a brief look on the dataset that I am working on right now. There is indeed a strong correlation between the mean depth of the variants called, and the mean depth of the sequencing effort (r close to 0.9). However the ratio between the two is between 0.5 to 0.75. That is to say in the worse case, only half of the reads were useful in calling the variants. That translate to 27X(0.75) to 40X(0.5) of sequencing effort. This ratio is negatively correlated with the duplication rate (r close to -0.8). Maybe you can go for 30X, and resequence the ones with low variant coverage later.

Good luck to everyone on securing a bigger funding!

Candidate Genes, what's next?

Huan Fan — 2025-04-15T00:00:00+00:00

You’ve done GWAS and there are some peaks, and some of them seem to lie within or next to some important genes. Now what do you do? How do you validate what you find?

Transcription Factors, What Are They and How to Find Them

Huan Fan — 2025-04-02T00:00:00+00:00

what I did before.

I downloaded all the TF for oil plam from PlantTF and did orthologger to see which one maps to it.

Now I want to use iTAK.

HISAT2 Alignment Statistics

Huan Fan — 2025-03-25T00:00:00+00:00

I’m trying to tidy the mapping statistics given by HISAT2 into a table. First let’s get on the same page with the statistics it spits out.

For example:

50964542 reads; of these:
  50964542 (100.00%) were paired; of these:
    2909022 (5.71%) aligned concordantly 0 times
    45584085 (89.44%) aligned concordantly exactly 1 time
    2471435 (4.85%) aligned concordantly >1 times
    ----
    2909022 pairs aligned concordantly 0 times; of these:
      205308 (7.06%) aligned discordantly 1 time
    ----
    2703714 pairs aligned 0 times concordantly or discordantly; of these:
      5407428 mates make up the pairs; of these:
        2967368 (54.88%) aligned 0 times   
        2131328 (39.41%) aligned exactly 1 time
        308732 (5.71%) aligned >1 times
97.09% overall alignment rate

Line by line:

50964542 reads; of these:                                                      # Total read pairs
  50964542 (100.00%) were paired; of these:                                    
    2909022 (5.71%) aligned concordantly 0 times                               # Unaligned Concordant Pairs
    45584085 (89.44%) aligned concordantly exactly 1 time                      # Concordant Unique Pairs
    2471435 (4.85%) aligned concordantly >1 times                              # Concordant Multi-mapped
    ----
    2909022 pairs aligned concordantly 0 times; of these:
      205308 (7.06%) aligned discordantly 1 time                               # Discordant Alignments
    ----
    2703714 pairs aligned 0 times concordantly or discordantly; of these:      # Unaligned Pairs Total
      5407428 mates make up the pairs; of these:
        2967368 (54.88%) aligned 0 times                                       # Single-end Unaligned
        2131328 (39.41%) aligned exactly 1 time                                # Single-end Unique
        308732 (5.71%) aligned >1 times                                        # Single-end Multi
97.09% overall alignment rate                                                  # Overall Alignment Rate

This stats tells us the underlying logic of HISAT2.

Check whether reads are paired. In this case all of them are (100%)
Try to align reads in pairs. This resulted in three groups:
- Concordant Unique Pairs (45584085). This means both mates aligned uniquely in correct orientation/distance. These are the high quality alignment.
- Concordant Multi-mapped (2471435). Both mates aligned to multiple locations in correct orientation/distance. Likely to repetitive regions.
- The rest (2909022). For a lack of better words let’s call this group Non-Concordant Pairs.
Now it is about different situations within the Non-Concordant Pairs.
- Discordant alignment (205308): this means both mates aligned uniquely, but not concordantly (not in correct orientation or distance). These alignments are interesting since they could suggest potential structural variation or mis-assemblies.
- Single-end (2703714 pairs or 5407428 mates). For the rest of reads, basically HISAT2 failed to treat them as pairs, and now is trying to salvaged them individually as single-end. For these mates, there are obviously three situations: Single-end Unaligned (2967368), Single-end Unique (2131328) and Single-end Multi (308732). Among those, the only thing that could be informative is the Single-end Unique.
Now you might have guessed how the overall alignment rate is calcualted: 1- Single-end Unaligned/(Total read pairs * 2), since those are the only reads that faied to map to anywhere in the reference. This could be due to distances between the ref and your sample, or an incomplete ref in terms of short read mapping.

I wrote a short python script that takes the stderr of HISAT2 and tidy it up in the terminology defined in this post.

Tensors, Tokens and Embeddings

Huan Fan — 2025-02-28T00:00:00+00:00

Don’t let the jargons scare you away is the thing that you need to remind yourself constantly. They are here for a lack of better words.

Tensors (张量)

Tensor is actually a pretty fundamental concept.

Scaling laws in the context of natural language processing (NLP) and computer vision refer to the predictable relationships between the size of a model (e.g., number of parameters), the amount of training data, and the model’s performance (e.g., accuracy, loss, or other metrics). These laws describe how performance improves as you scale up key factors like model size, dataset size, and computational resources. But before we can understand its importance, we need to first understand power law.

Power Law

A power-law relationship is a mathematical relationship between two quantities where one quantity varies as a power of the other. In other words, one quantity is proportional to the other raised to an exponent. Mathematically, it is expressed as:

[ y = k \cdot x^n ]

Where:

( y ) is the dependent variable (e.g., model performance),
( x ) is the independent variable (e.g., model size, dataset size, or compute),
( k ) is a constant (proportionality factor),
( n ) is the exponent (a constant that determines the shape of the relationship).

Key Characteristics of Power-Law Relationships:

Non-linear: Unlike linear relationships (( y = mx + b )), power-law relationships are non-linear. This means that changes in ( x ) lead to disproportionate changes in ( y ).
Scale-invariant: Power-law relationships appear the same at all scales. If you zoom in or out on the data, the relationship retains its shape.
Heavy-tailed distribution: In many real-world systems, power-law relationships describe phenomena where small events are common, but large events are rare (e.g., word frequency in language, city sizes, or income distribution).

Examples of Power-Law Relationships:

Natural Language Processing (NLP):
- Model performance (e.g., perplexity or accuracy) often improves as a power-law function of model size, dataset size, or compute. For example: [ \text{Performance} \propto (\text{Model Size})^n ]
- This means doubling the model size might lead to a less-than-doubling improvement in performance, depending on the exponent ( n ).
Computer Vision:
- Image recognition accuracy often scales as a power-law function of the number of training images or model parameters.
Real-World Phenomena:
- Zipf’s Law: In linguistics, the frequency of a word is inversely proportional to its rank in the frequency table (e.g., the most common word appears twice as often as the second most common word).
- Pareto Principle (80/20 Rule): 80% of outcomes often come from 20% of causes (e.g., 80% of wealth is owned by 20% of the population).
- Network Science: The distribution of connections in many networks (e.g., social networks, the internet) follows a power law, where a few nodes have many connections, and most nodes have few.

Visualizing a Power-Law Relationship:

When plotted on a log-log scale (where both axes are logarithmic), a power-law relationship appears as a straight line. This is because taking the logarithm of both sides of the equation ( y = k \cdot x^n ) gives:

[ \log(y) = \log(k) + n \cdot \log(x) ]

This is the equation of a straight line with slope ( n ) and intercept ( \log(k) ).

Why Power-Law Relationships Matter in AI:

Predictability: Power-law relationships allow researchers to predict how performance will improve as they scale up resources (e.g., model size, data, compute).
Optimization: Understanding power-law scaling helps allocate resources efficiently. For example, if performance improves slowly with larger models, it might be better to invest in more data or better algorithms.
Benchmarking: Power-law relationships provide a framework for comparing different models and architectures.

Example in AI Scaling:

In OpenAI’s research on scaling laws for language models, they found that: [ \text{Test Loss} \propto (\text{Model Size})^{-\alpha} \cdot (\text{Dataset Size})^{-\beta} \cdot (\text{Compute})^{-\gamma} ] Here, ( \alpha ), ( \beta ), and ( \gamma ) are exponents that describe how performance improves with scaling.

In summary, a power-law relationship describes how one quantity changes as a power of another. It is a fundamental concept in AI scaling, as well as in many natural and social phenomena.

Scaling Law and Power Law

Huan Fan — 2025-02-24T00:00:00+00:00

Scaling Law

Power Law

[ y = k \cdot x^n ]

Where:

( y ) is the dependent variable (e.g., model performance),
( x ) is the independent variable (e.g., model size, dataset size, or compute),
( k ) is a constant (proportionality factor),
( n ) is the exponent (a constant that determines the shape of the relationship).

Key Characteristics of Power-Law Relationships:

Non-linear: Unlike linear relationships (( y = mx + b )), power-law relationships are non-linear. This means that changes in ( x ) lead to disproportionate changes in ( y ).
Scale-invariant: Power-law relationships appear the same at all scales. If you zoom in or out on the data, the relationship retains its shape.
Heavy-tailed distribution: In many real-world systems, power-law relationships describe phenomena where small events are common, but large events are rare (e.g., word frequency in language, city sizes, or income distribution).

Examples of Power-Law Relationships:

Natural Language Processing (NLP):
- Model performance (e.g., perplexity or accuracy) often improves as a power-law function of model size, dataset size, or compute. For example: [ \text{Performance} \propto (\text{Model Size})^n ]
- This means doubling the model size might lead to a less-than-doubling improvement in performance, depending on the exponent ( n ).
Computer Vision:
- Image recognition accuracy often scales as a power-law function of the number of training images or model parameters.
Real-World Phenomena:
- Zipf’s Law: In linguistics, the frequency of a word is inversely proportional to its rank in the frequency table (e.g., the most common word appears twice as often as the second most common word).
- Pareto Principle (80/20 Rule): 80% of outcomes often come from 20% of causes (e.g., 80% of wealth is owned by 20% of the population).
- Network Science: The distribution of connections in many networks (e.g., social networks, the internet) follows a power law, where a few nodes have many connections, and most nodes have few.

Visualizing a Power-Law Relationship:

[ \log(y) = \log(k) + n \cdot \log(x) ]

This is the equation of a straight line with slope ( n ) and intercept ( \log(k) ).

Why Power-Law Relationships Matter in AI:

Predictability: Power-law relationships allow researchers to predict how performance will improve as they scale up resources (e.g., model size, data, compute).
Optimization: Understanding power-law scaling helps allocate resources efficiently. For example, if performance improves slowly with larger models, it might be better to invest in more data or better algorithms.
Benchmarking: Power-law relationships provide a framework for comparing different models and architectures.

Example in AI Scaling:

In summary, a power-law relationship describes how one quantity changes as a power of another. It is a fundamental concept in AI scaling, as well as in many natural and social phenomena.

Foundation Model

Huan Fan — 2025-02-24T00:00:00+00:00

I was reading the Evo paper, and it referred the Evo model as a foundation model (基础模型). I had to look up what that means.

In the context of this paper, a foundation model refers to a large, general-purpose machine learning model that is trained on vast amounts of data and can be adapted (or fine-tuned) for a wide range of downstream tasks. Foundation models are designed to capture broad patterns and relationships in the data, making them highly versatile and powerful tools for various applications.

Key Characteristics of Foundation Models:

Large-Scale Training: Foundation models are trained on massive datasets, often using unsupervised or self-supervised learning techniques.
General-Purpose: They are not task-specific but are designed to learn general representations of the data (e.g., language, images, or biological sequences).
Transfer Learning: Once trained, foundation models can be fine-tuned or adapted to specific tasks with relatively little additional data.
Versatility: They can be applied across multiple domains and tasks, often outperforming specialized models.

Examples of Foundation Models:

Natural Language Processing (NLP):
- GPT (Generative Pre-trained Transformer):
  - Developed by OpenAI, GPT models (e.g., GPT-3, GPT-4) are trained on vast amounts of text data and can perform tasks like text generation, translation, summarization, and question answering.
- BERT (Bidirectional Encoder Representations from Transformers):
  - Developed by Google, BERT is trained to understand the context of words in a sentence and is used for tasks like sentiment analysis, named entity recognition, and question answering.
- T5 (Text-To-Text Transfer Transformer):
  - Developed by Google, T5 treats all NLP tasks as a text-to-text problem, making it highly flexible for tasks like translation, summarization, and classification.
Computer Vision:
- CLIP (Contrastive Language–Image Pretraining):
  - Developed by OpenAI, CLIP connects images and text, enabling tasks like zero-shot image classification and image-text retrieval.
- DALL·E:
  - Also developed by OpenAI, DALL·E generates images from textual descriptions, demonstrating the ability to combine vision and language understanding.
Biology and Bioinformatics:
- Protein Models:
  - AlphaFold: Developed by DeepMind, AlphaFold predicts protein structures from amino acid sequences, revolutionizing structural biology.
  - ESM (Evolutionary Scale Modeling): Developed by Meta AI, ESM models are trained on protein sequences to predict structure, function, and evolutionary relationships.
- DNA Models:
  - DNABERT
  - NT (Nucleotide Transfoermer)
  - Evo: Evo is a foundation model designed to capture the multimodality of the central dogma (DNA → RNA → protein) and the multiscale nature of evolution. It can likely be applied to tasks like gene function prediction, protein design, and evolutionary analysis. Evo2 is just released and eukaryotic genomes are included in the training this time.
Multimodal Models:
- Flamingo:
  - Developed by DeepMind, Flamingo combines vision and language understanding, enabling tasks like image captioning and visual question answering.
- Gato:
  - Also developed by DeepMind, Gato is a general-purpose model capable of performing tasks across multiple domains, including text, images, and robotics.

Why Foundation Models Are Important:

Efficiency: Instead of training a new model from scratch for every task, foundation models can be fine-tuned with minimal additional data and computation.
Performance: Foundation models often achieve state-of-the-art performance on a wide range of tasks due to their large-scale training and generalization capabilities.
Innovation: They enable new applications and discoveries by providing a powerful base for further research and development.

Evo as a Foundation Model:

In the case of Evo, it is designed to capture two key aspects of biology:

Multimodality of the Central Dogma: Evo can handle the flow of genetic information from DNA to RNA to proteins, integrating multiple biological modalities.
Multiscale Nature of Evolution: Evo can analyze evolutionary patterns at different scales, from molecular scales to systems scales(interaction between different modality of molecules) and entire genomes (see their figure from the paper below).

As a foundation model, Evo can be fine-tuned for various biological tasks, such as predicting gene functions, designing proteins, or analyzing evolutionary relationships, making it a versatile tool for computational biology.

Update from Evo2

Evo is trained on prokaryotes and phage genomes. Evo2 is trained on “a highly curated genomic atlas spanning all domains of life”.
Pretraining data set: 300 billion nt (from 2.7 million genomes) vs. 9.3 (abstract) or 8.84 (methods) trillion nt. openGenome2 (the one Evo 2 was trained on) included a 33% expansion of representative prokaryotic genomes from 85,205 to 113,379 (357 billion nucleotides), a total of 6.98 trillion nucleotides from eukaryotic genomes, 854 billion nucleotides of non-redundant metagenomic sequencing data, 2.82 billion nucleotides of organelle genomes, and 602 billion nucleotides of subsets of eukaryotic sequence data to focus on likely functional regions of the genomes by focusing on different windows around coding genes. This means eukaryotic genomes takes about 80% of the pretraining dataset.

Let’s take a closer look on how the eukaryotic genomes were chosen:

vs. 9.3 trillion DNA base pairs.

Model parameter size: 7B(Evo) vs 7B and 40B (Evo2). The differences between the two Evo2 model is that the 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens. Note that GPT3 has 175B.
Token context window1 million: 131Kb vs 1Mb, both at single-nucleotide resolution. This number is 2048 for GPT3.
Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation.