Large genome model: Open source AI trained on trillions of bases

Date:

Share:



Late in 2025, we covered the development of an AI system called Evo that was trained on massive numbers of bacterial genomes. So many that, when prompted with sequences from a cluster of related genes, it could correctly identify the next one or suggest a completely novel protein.

That system worked because bacteria tend to cluster related genes together—something that’s not true in organisms with complex cells, which tend to have equally complex genome structures. Given that, our coverage noted, “It’s not clear that this approach will work with more complex genomes.”

Apparently, the team behind Evo viewed that as a challenge, because today it is describing Evo 2, an open source AI that has been trained on genomes from all three domains of life (bacteria, archaea, and eukaryotes). After training on trillions of base pairs of DNA, Evo 2 developed internal representations of key features in even complex genomes like ours, including things like regulatory DNA and splice sites, which can be challenging for humans to spot.

Genome features

Bacterial genomes are organized along relatively straightforward principles. Any genes that encode proteins or RNAs are contiguous, with no interruptions in the coding sequence. Genes that perform related functions, like metabolizing a sugar or producing an amino acid, tend to be clustered together, allowing them to be controlled by a single, compact regulatory system. It’s all straightforward and efficient.

Eukaryotes are not like that. The coding sections of genes are interrupted by introns, which don’t encode for anything. They’re regulated by a sequence that can be scattered across hundreds of thousands of base pairs. The sequences that define the edges of introns or the binding sites of regulatory proteins are all weakly defined—while they have a few bases that are absolutely required, there are a lot of bases that just have an above-average tendency to have a specific base (something like “45 percent of the time it’s a T”). Surrounding all of this in most eukaryotic genomes is a huge amount of DNA that has been termed junk: inactive viruses, terminally damaged genes, and so on.



Source link

━ more like this

Android 17’s new Contact Picker stops apps from accessing your entire contact list

Android 17 is getting a new Contact Picker that changes how apps access your contacts list. Earlier reports hinted at this shift toward...

Reddit may ask you to prove you’re human as it cracks down on bot accounts

Reddit is stepping up its fight against bots, and now your account could be asked to prove it is human if the platform...

PSA: T-Mobile customers have a week to sign up for a free year of MLB.TV

Today marks the start of the 2026 baseball season and in what has sort of become an annual tradition, T-Mobile is once again...

Razer’s latest Blade 16 goes Intel-powered with better battery and performance

Razer has officially unveiled the 2026 Blade 16, and this time, the biggest change isn’t the design, but what’s inside. After switching to...

Jury rules against Meta and YouTube in social media addiction case

A jury in Los Angeles has found that Meta and YouTube were negligent in a closely-watched trial over social media addiction. The companies...
spot_img