Synthetic Signatures: Detecting and Classifying AI-Generated Genomes
Aaron Maiwald is a Dphil at the University of Oxford in statistics, focusing on the intersection of AI and biology, with with the ultimate goal of pandemic prevention. His work is supported by the Berkeley Existential Risk Initiative.
AI biodesign tools are enabling us to generate novel organisms, outpacing natural evolution. For example, the genomic language model EVO-2 has been used to generate entire, functional bacteriophage genomes with low similarity to natural counterparts. While bacteriophages are harmless to humans, it is plausible that models like EVO can soon be used to generate human-infecting viral genomes that are highly dissimilar from known pathogens. For DNA-synthesis screening and metagenomic surveillance to detect such threats, we need tools that can identify AI-designed sequences. In this project, we will aim to create a dataset of AI-generated sequences and build classifiers to distinguish them from natural and traditionally engineered sequences. We might extend this to not only detect if AI design is present but also detect which AI tool was used.
Skills you could learn: how to pretrain a language model; working with large biological datasets; machine learning/coding skills.
Ideal candidate: at least moderate coding abilities; experience with deep learning libraries like PyTorch, or a great interest in learning that; sceptical mindset; excited and driven to get projects over the finish line.