close
close

A dataset of quantum chemical calculations of representative protein folds using the molecular orbital fragment method

A dataset of quantum chemical calculations of representative protein folds using the molecular orbital fragment method

The three-dimensional structures of biological macromolecules such as proteins and nucleic acids are critical to understanding their functions. These structures can be determined experimentally using X-ray crystallography, nuclear magnetic resonance spectroscopy and cryo-electron microscopy. The results of this research make more than 200,000 structures available from the Protein Data Bank (PDB) on the websites of wwPDB group members.1,2,3. Recently AlphaFold24 made it possible to create accurate model structures of proteins even in the absence of experimental information. Uniprot5 provides a database of AlphaFold2 model structures called the AlphaFold Protein Structure Database (AlphaFold DB).6. Since new insights gained from such robust frameworks are useful, the accumulation of computational data from simulations is expected to become increasingly important.

There are two main methodologies for calculating biomacromolecules: molecular dynamics (MD) simulations.7 to study dynamic behavior and quantum mechanical (QM) calculations of precise electronic states. MD simulations are used to study loop flexibility, molecular conformation in solvents, and especially interactions with ligand molecules. Although MD simulations take into account dynamic structural changes, they typically use fixed charges. Biological macromolecules also perform their functions by forming specific atomic networks, including hydrogen bonds, ionic bonds, and nonpolar interactions, all of which involve a structure-dependent electronic state. QM is a promising ab initio method that can determine the electronic state of a given molecular conformation. In general, the computational cost of QM calculations is approximately proportional to the fourth to sixth power of the number of basis functions; therefore, QM is mainly applied to small molecules. Several methods have been developed to overcome this limitation. QM/MM methods such as ONIOM are hybrid approaches that logically separate molecules, allowing quantum chemical calculations in target regions and molecular force field calculations in others. Such methods have also been used to study chemical and enzymatic reactions.8.

Currently, the fragment molecular orbital (FMO) method9 is a promising full QM method applicable to biological macromolecules. The FMO method separates biological macromolecules such as proteins and nucleic acids into residual fragments and performs quantum chemical calculations (Figure 1a). The FMO method has been implemented in programs such as GAMESS.10,11,12and ABINIT-MP13,14,15 and is still under development.

Rice. 1
figure 1

Summary of QM-based protein structure energy data set by FMO. (A) Protein structure can be divided into fragments based on amino acid units. (b) IFIE/PIEDA data is calculated based on the interactions between fragments. (With) The dataset includes protein atomic coordinates and IFIE/PIEDA energy data.

Data obtained using the FMO method include interfragment interaction energy (IFIE, also called pairwise interaction energy (PIE)), total energy, and atomic charge. The advantage of IFIE/PIE is the description of interresidue interactions and facilitating the energetic interpretation of inter- and intramolecular interactions (Fig. 1b). Pair Interaction Energy Decomposition Analysis (PIEDA)16 is a method for fragment interaction analysis that decomposes IFIE into the components electrostatic interaction (ES), exchange repulsion (EX), charge transfer with higher order mixed interactions (CT+mix), and dispersive interaction (DI), and can be used to quantitative determination of which of these components is actively involved in binding between fragments. For example, hydrogen bonds, which often occur during interactions between backbone and side chains of amino acid residues, can be assessed using the ES and CT+ mixture components. The DI component is particularly suitable for assessing nonpolar interactions and makes a large contribution to CH/π and π–π couplings.17,18,19,20,21. Computer simulations of protein–ligand binding based on experimental structures have been reported.22.23.

IFIE and PIEDA in the FMO method have the following relationships. The total energy of a molecule can be calculated using the following equation9:

$${E}_{{\rm{total}}}\approx {\sum }_{I > J}^{N}\left({E}_{{IJ}}^{{\prime} } -{E}_{I}^{{\prime} }-{E}_{J}^{{\prime} }\right)+{\sum }_{I > J}^{N}{\ rm{Tr}}\left({\triangle D}^{{IJ}}{V}^{{IJ}}\right)+{\sum }_{I > J}^{N}{E}_ {I}^{{\prime} }$$

(1)

Where \({E}_{{IJ}}^{{\prime} }\), \({E}_{J}^{{\prime} }\)And \({E}_{J}^{{\prime} }\) – these are energies without taking into account the electrostatic potential of the environment between the fragments I And Jfragment Iand fragment Jrespectively, N – number of fragments in a molecule, \({\triangle D}^{{IJ}}\) is the difference density matrix, and \({V}^{{\rm{IJ}}}\) – electrostatic potential of surrounding fragments. IFIE is determined using the following equation:

$${\triangle E}_{{IJ}}=\left({E}_{{IJ}}^{{\prime} }-{E}_{I}^{{\prime} }-{ E}_{J}^{{\prime} }\right)+{\rm{Tr}}\left({\triangle D}^{{IJ}}{V}^{{IJ}}\right) $$

(2)

PIEDA components16 can be obtained from the following equation:

$${\triangle E}_{{IJ}}=\triangle {E}_{{IJ}}^{{\rm{ES}}}+\triangle {E}_{{IJ}}^{{ \rm{EX}}}+\triangle {E}_{{IJ}}^{{\rm{CT}}+{\rm{mix}}}+\triangle {E}_{{IJ}}^ {{\rm{DI}}}$$

(3)

where IFIE is described by four types of energy terms.

A well-known quantum chemistry dataset is the QM9 dataset, which contains quantum chemistry calculation values ​​for molecular structures consisting of nine non-hydrogen atoms.24. Our group also provides FMO calculation data from the FMODB database containing electronic states of biological macromolecules.25. The FMODB currently includes 37,450 entries, derived from the unique 7,783 PDB entries as of July 23, 2024. Such datasets are used for machine learning applications, and fully electronic protein data is already being used to build artificial intelligence platforms and other uses.26. The data recorded in FMODB depends on the interests of the researchers. For example, there are many calculations for the protein kinase family (e.g., CDK2, p38 MAP, and Aurora), the nuclear receptor family (e.g., ERα and ERβ), and related proteins of SARS-CoV-2.27and apoproteins according to X-ray diffraction analysis25.28. The authors aim to make FMO calculation results available to all structures deposited in the PDB for a wide range of applications of the FMO method. As of September 2024, there were over 220,000 entries in the PDB; however, analysis of all records is only possible if sufficient computing resources, such as supercomputers, can be used without restrictions. Because the convergence of FMO calculations depends on the atomic coordinate of proteins and can be unpredictable for individual proteins due to differences in amino acid sequences and crystallization conditions such as resolution, it is desirable to collect convergence rate data and FMO-based energy distributions for representative structures before performing calculations FMO for all proteins in the PDB.

SCOP2, which is a protein fold database, was chosen as the dataset in this study to provide FMO calculation data for a wide range of proteins.29.30. SCOP2 is a hierarchical classification of protein folds based on their structural and evolutionary relationships. It was derived from a subset of experimentally determined protein structures deposited in the PDB. The database is periodically updated to include new families and structures. As of June 29, 2022, there were 5,936 families in SCOP2. In this study, we present a complete FMO computational dataset that covers all experimentally characterized protein folds. This dataset, derived from protein structures associated with SCOP2 families, serves as a valuable resource for assessing the current capabilities of FMO methods and allows researchers to easily access quantum chemistry data for areas of interest.

In the FMO method, as in any QM calculation, judicious selection of calculation methods and basis sets is of paramount importance to obtain reliable and accurate results. The Hartree-Fock (HF) method is a fundamental ab initio quantum chemical method that uses the Hamilton operator and Slater determinant to approximate the ground state wave function of a molecular system. Although the STO-3G minimal basis set offers advantages in computational cost, it requires at least double zeta basis set and polarization functions to describe the various interactions in biomolecules. In the context of FMO calculations, the MP2/6-31 G* level of theory (FMO-MP2/6-31 G*) is preferred due to the balance between accuracy and computational cost. This is due to the fact that, in contrast to the HF method, the MP2 method (second order Möller–Plesset perturbation theory)31,32,33 can account for electronic correlation, and the 6-31 G* basis set includes polarization functions for the polarization of non-hydrogen atoms. FMO-MP2/6-31 G* is often used in the study of relatively medium-sized organic compounds and the analysis of intermolecular interactions, including hydrogen bonds, CH/π.34and π–π interactions between small molecules and proteins.35.36. Additionally, all data published in FMODB use this level of theory.25. Validation of the energy values ​​obtained by the FMO method using various combinations of calculation methods and basis sets has been limited to a limited number of systems.37. However, the recent development of supercomputers has made it possible to use higher levels of theory.

Basis functions are mathematical representations that approximate the spatial distribution of electrons within atomic orbitals. The characteristics of the basis sets used in this study are listed in Table 1. These functions are used to express molecular orbitals as linear combinations of atomic orbitals. In this study, we extended the 6-31 G basis set to include polarization functions for only non-hydrogen atoms and hydrogen atoms, denoted 6-31 G* and 6-31 G**, respectively, thereby increasing the accuracy of the calculation. electronic structure calculations. In addition, we used a correlation-consistent polarized valence double zeta (cc-pVDZ) basis set, which was specifically designed to account for electron correlation effects. Consequently, our data set now includes the FMO-MP2/6-31 G*, FM0-MP2/6-31 G**, and FMO-MP2/cc-pVDZ levels of theory. While MP2/6-31 G* only includes polarization functions (i.e. additional p-orbital functions) for non-hydrogen atoms, both MP2/cc-pVDZ and MP2/6-31 G** include them for hydrogen atoms. The cc-pVDZ basis set is distinguished by the use of Dunning-type functions and is constructed as a correlation-consistent basis.38. Since the formation of CH/π and π-π interactions due to dispersion forces associated with electronic correlations as well as hydrogen bonds promotes protein folding, the use of either 6-31 G** or cc-pVDZ is considered necessary to correctly assess the polarization of hydrogen atoms .

Table 1. Properties of the basis sets used in this study.

Thus, there is currently no quantum chemical data set covering more than 5000 protein structures classified into different families calculated using multiple quantum chemical levels of theory. This dataset is not only useful for analyzing protein functions and interactions, but is also expected to serve as training data for developing machine learning models for protein charge prediction. Notably, providing energy values ​​calculated using three different basis sets for the same fragment pairs facilitates the analysis of the influence of hydrogen atom polarization and electron correlation on intermolecular interactions.