Selective Data Exclusion for Safer Biological Language Models
Aaron Maiwald is a Dphil at the University of Oxford in statistics, focusing on the intersection of AI and biology, with with the ultimate goal of pandemic prevention. His work is supported by the Berkeley Existential Risk Initiative.
Biological language models promise to transform biological design. Trained to predict the next nucleotide or amino acid, these systems have been used to generate new regulatory elements or antibodies. Like any tool for biological design, they are dual-use. To reduce the risk of misuse, various strategies aim to selectively reduce performance on dangerous viral sequences. One strategy is to simply exclude some viral sequences from the model's training data. The recently developed genomic language model, EVO 2, excluded all eukaryote-infecting viruses, a superset of those dangerous to humans. While this appears to effectively destroy model performance on human-infecting viral genomes, it also affects performance on totally benign viral sequences. This makes it less attractive as a strategy to future model developers and may also reduce beneficial model capabilities.
Can we do better? How selectively can we deteriorate model performance on human-infecting viruses? Would it be possible to exclude a much narrower set of viruses and thereby leave model capabilities on benign viruses intact? This project aims to train a set of genomic and protein language models with different levels of data exclusion to assess whether this is possible.
Skills you could learn: how to pretrain a language model; working with large biological datasets; machine learning/coding skills.
Ideal candidate: at least moderate coding abilities; experience with deep learning libraries like PyTorch, or a great interest in learning that; sceptical mindset; excited and driven to get projects over the finish line.