close
close

DPLM-2: a multimodal protein language model integrating sequence and structural data

DPLM-2: a multimodal protein language model integrating sequence and structural data

Proteins, vital macromolecules, are characterized by their amino acid sequences, which determine their three-dimensional structure and function in living organisms. Effective modeling of generative proteins requires a multimodal approach to simultaneously understand and generate sequences and structures. Current methods often rely on separate models for each modality, which limits their effectiveness. Although advances such as diffusion models and protein tongue models have shown promise, there is a pressing need for models that integrate both modalities. Recent efforts such as Multiflow highlight this challenge by demonstrating limitations in sequence understanding and structure generation, highlighting the potential of combining evolutionary knowledge with sequence-based generative models.

There is growing interest in developing protein LMs that operate on an evolutionary scale, including ESMs, TAPEs, ProtTrans, and others that excel at a variety of downstream tasks by harvesting evolutionary information from sequences. These models have shown promise in predicting protein structures and the effects of sequence variations. Simultaneously, diffusion models have become widespread in structural biology for protein generation, with different approaches focusing on different aspects such as the orientation of the protein backbone and residues. Models such as RFDiffusion and ProteinSGM demonstrate the ability to engineer proteins to perform specific functions, and Multiflow integrates structure and sequence cogeneration.

Researchers from Nanjing University and ByteDance Research presented DPLM-2, a multimodal protein backbone model that extends the discrete diffusive protein language model to include both sequences and structures. DPLM-2 learns the joint distribution of sequences and structures from experimental and synthetic data using a search-free quantization tokenizer. The model addresses issues such as providing structured learning and exposure bias in sequence generation. DPLM-2 efficiently co-generates compatible amino acid sequences and 3D structures, outperforming existing methods on a variety of conditional generation tasks while providing structure-based representations useful for predictive applications.

DPLM-2 is a multimodal protein diffusion language model that integrates protein sequences and their 3D structures using a discrete diffusion probability framework. It uses a token-based representation to transform 3D protein backbone coordinates into discrete structure tokens, providing alignment with corresponding amino acid sequences. DPLM-2 training includes a high-quality dataset that focuses on denoising at various noise levels to generate both protein structures and sequences simultaneously. In addition, DPLM-2 uses a lookup-free quantizer (LFQ) to efficiently tokenize structure, achieve high reconstruction accuracy, and strong correlation with secondary structures such as alpha helices and beta sheets.

The study evaluates DPLM-2 on a variety of generative and comprehension tasks, focusing on unconditional protein generation (structure, sequence, and cogeneration) and several conditional tasks such as folding, refolding, and motif scaffolding. For unconditional protein generation, we evaluate the model’s ability to simultaneously generate 3D structures and amino acid sequences. The quality, novelty and diversity of the engineered proteins are analyzed using metrics such as projectability and foldability, as well as comparison with existing models. DPLM-2 is highly effective in generating a variety of high-quality proteins and demonstrates significant advantages over baseline models.

DPLM-2 is a multimodal diffuse protein language model designed to understand, generate, and reason about protein sequences and structures. Although it performs well in the problems of protein cogeneration, folding, refolding, and motif scaffolding, a number of limitations remain. Limited structural data hinder DPLM-2’s ability to learn robust representations, especially for longer protein chains. Additionally, although converting structures into discrete symbols helps multimodal modeling, it may result in the loss of detailed structural information. Future research should combine the strengths of both sequence-based and structure-based models to expand protein generation capabilities.


Check Paper. All credit for this study goes to the researchers of this project. Also don’t forget to follow us Twitter and join our Telegram channel And LinkedIn Groops. If you like our work, you’ll like our newsletter.. Don’t forget to join our 50 thousand+ ML SubReddit.

(Upcoming webinar – October 29, 2024) Best Platform for Serving Finely Tuned Models: Predibase Inference Engine (Advanced)


Sana Hassan, a consulting intern at Marktechpost and a dual degree student at IIT Madras, is passionate about using technology and artificial intelligence to solve real-world problems. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of artificial intelligence and real-world solutions.

Listen to our latest AI podcasts and AI research videos here ➡️