A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


2. Variation datasets affecting protein stability

A. Datasets of single variants

These benchmark datasets with variations affecting stability of the protein have been collected from literature.

Dataset 1

This dataset contains 1784 mutations from 80 proteins with experimentally determined ΔΔG values in ProTherm (ProTherm update Dec. 19, 2008). It consists of 1,154 positive cases of which 931 are destabilizing (ΔΔG =0.5 kcal/mol), 222 are stabilizing (ΔΔG = -0.5 kcal/mol), and 631 neutral cases (0. 5 kcal/mol= ΔΔG = -0.5 kcal/mol).

Download: Dataset 1
References:
Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010, 31(6):675-684.
Kumar MD, Bava KA, Gromiha MM, Prabakaran P, Kitajima K, Uedaira H, Sarai A: ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res 2006, 34(Database issue):D204-206.   PUBMED  

Dataset 2

This dataset of 2156 variations was made from a list of 964 single mutations ( Guerois et al. 2002) and from a set of 2972 single variations obtained from the ProTherm database (Kumar et al., 2006) after filtering for duplicate entries. NMR determined structures are excluded from this dataset and only the average ΔΔG value was given when several ΔΔG values were present for a single variation.

Download: Dataset 2

Reference: Potapov V, Cohen M, Schreiber G. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel. 2009, 22(9):553-560.   PUBMED  

Dataset 3

This dataset is composed of two sub datasets.  One is the training dataset containing 339 mutants experimentally studied in nine proteins and the other is the test dataset containing 625 variants from ProTherm.

  1. Training dataset: 339 variants from 9 proteins.  Download: Dataset 3(a)
  2. Blind test dataset: 625 variants from 28 proteins. Download: Dataset 3(b)

Reference: Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320(2):369-387.   PUBMED  

Dataset 4

This dataset is derived from the July 2003 release of ProTherm and contains two sub datasets. The first one, S1615, was used for training/testing the neural network system. The second one, S388, was used as the test and contains 388 variations collected only at physiological conditions. S388 is a subset of S1615. Only single variations with ΔΔG in Protherm and structures deposited in PDB are present in the datasets.

  1. Training dataset: S1615 - 1615 variants from 42 proteins. Download: Dataset 4 (a)
  2. Test dataset - S388 (subset of the first) - 338 variants from 17 proteins. Download: Dataset 4(b)

References: Capriotti E, Fariselli P, Casadio R. A neural-network-based method for predicting protein stability changes upon single point mutations. Bioinformatics. 2004, 20 Suppl 1:i63-68.   PUBMED  

Dataset 5

This dataset consists of stability affecting variants taken from ProTherm database (Updated: February 22, 2013). The correctness and quality of each variant was checked manually. Several variants from the ProTherm database failed the quality criteria and were excluded. In total, the dataset contains 1,564 variations from 99 proteins, 77% of which came from ProTherm. The remaining variants have been corrected from the versions present in ProTherm or are new additions. This dataset has been used to train and test a novel tool, PON-Tstab, for predicting effect of variant on stability.

  1. Training dataset: Dataset used for training and testing PON-Tstab. Download: PON-Tstab dataset

References: A manuscript describing the dataset and the tool, PON-Tstab, has been submitted for publication. In, the mean time, please contact the authors for more detail and use the following link for citation. http://structure.bmc.lu.se/PON-Tstab/

Dataset 6

Datasets used to train and test I-Mutant2.0.

  1. 2087 variants with sequence information Download: I_Mutant2.0_S2087 dataset
  2. 1948 variants with 3D structures Download: I_Mutant2.0_S2087 dataset

Reference: Capriotti, E.; Fariselli, P.; Casadio, R. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 2005, 33:W306-W310.   PUBMED  

Dataset 7

Datasets used by Saraboji and coworkers.

  1. 1791 variations with PDB structure. Thermal denaturation method Download: Saraboji_S1791 dataset
  2. 1396 variants with thermal denaturation Download: Saraboji_S1396 dataset
  3. 2204 variants with chemical denaturation Download: Saraboji_S2204 dataset

Reference: Saraboji, K.; Gromiha, M. M.; Ponnuswamy, M. N. Average assignment method for predicting the stability of protein mutants. Biopolymers 2006, 82:80-92 doi: 10.1002/bip.20462.   PUBMED  

Dataset 8

Dataset used for iPTREE-STAB

  1. 1859 single variants in 64 proteins Download: iPTREE-STAB_S1859 dataset

Reference: Huang, L. T.; Gromiha, M. M.; Ho, S. Y. iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations. Bioinformatics 2007, 23:1292-1293.   PUBMED  

Dataset 9

Datasets used for SVM-WIN31 and SVM-3D12

  1. 1681 substitutions in 58 proteins Download: SVM-WIN31_SVM-3D12_S1681 dataset
  2. 1634 varianst in 55 proteins, PDB structures available Download: SVM-WIN31_SVM-3D12_S1634 dataset
  3. 499 additional variants from a later version of ProTherm Download: SVM-WIN31_SVM-3D12_S499 dataset

Reference: Capriotti, E.; Fariselli, P.; Rossi, I.; Casadio, R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 2008, 9 ddGSuppl 2: S6. doi: 10.1186/1471-2105-9-S2-S6   PUBMED  

Dataset 10

Dataset used for PoPMuSiC-2.0

  1. 2648 subsitutitons in 131 proteins Download: PoPMuSiC-2.0_S2648 dataset

Reference: Dehouck, Y.; Grosfils, A.; Folch, B.; Gilis, D.; Bogaerts, P.; Rooman, M. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 2009, 25:2537-2543 doi: 10.1093/bioinformatics/btp445   PUBMED  

Dataset 11

Dataset used for sMMGB

  1. 1109 variants Download: SMMGB_1109 dataset

Reference: Zhang, Z.; Wang, L.; Gao, Y.; Zhang, J.; Zhenirovskyy, M.; Alexov, E. Predicting folding free energy changes upon single point mutations. Bioinformatics 2012, 28:664-671. doi: 10.1093/bioinformatics/bts005   PUBMED  

Dataset 12

Dataset used for M8 and M47

  1. 2760 variants in 75 proteins Download: M47andM8_S2760 dataset
  2. 1810 variants in 71 proteins. Cases with ΔΔG between -0.5 and 0.5 kcal/mol excluded from S2760
    Download: M47andM8_S1810 dataset

Reference: Yang, Y.; Chen, B.; Tan, G.; Vihinen, M.; Shen, B. Structure-based prediction of the effects of a missense variant on protein stability. Amino Acids 2013, 44:847-855 doi: 10.1007/s00726-012-1407-7   PUBMED  

Dataset 13

Dataset used for EASE-MM

  1. 238 variants, subselection of I-Mutant2.0 Download: EASE-MM_S238 dataset
  2. 1676 variants Download: EASE-MM_S1676 dataset
  3. 543 variants in 55 proteins. Subset PopMusici2.0 dataset of 2648 variants. <25% sequence identity to both S1676 and S236 Download: EASE-MM_S543 dataset

Reference: Folkman, L.; Stantic, B.; Sattar, A. Feature-based multiple models improve classification of mutation-induced stability changes. BMC Genomics 2014, 15 Suppl 4:S6 doi: 10.1186/1471-2164-15-S4-S6   PUBMED  

Dataset 14

Dataset used for HoTMuSiC

  1. 1626 variants in 90 proteins Download: HotMuSiC_S1626 dataset

Reference: Pucci, F.; Bourgeas, R.; Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC. Sci Rep 2016, 6:23257 doi: 10.1038/srep23257   PUBMED  

Dataset 15

Dataset used for SAAFEC

  1. 1262 variants in 49 proteins Download: SAAFEC_S1262 dataset
  2. 983 variants in 42 proteins with 3D structures Download: SAAFEC_S983 dataset

Reference: Getov, I.; Petukh, M.; Alexov, E. SAAFEC: Predicting the Effect of Single Point Mutations on Protein Folding Free Energy Using a Knowledge-Modified MM/PBSA Approach. Int J Mol Sci 2016, 17:512 doi: 10.3390/ijms1704051   PUBMED  

Dataset 16

Dataset used for STRUM

  1. 3421 variants, protein structures available Download: STRUM_Q3421 dataset
  2. 306 variants in 32 proteins, sequence identity <60% to S2648 of PoPMuSiC Download: STRUM_Q306 dataset

Reference: Quan, L.; Lv, Q.; Zhang, Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 2016, 32:2936-2946 doi: 10.1093/bioinformatics/btw361   PUBMED  

Dataset 17

Dataset used for a metapredictor

  1. 605 variants in 60 proteins. Measurements at pH 5-9 and temperature 20-30℃ Download: Broom_S605 dataset

Reference: Broom, A.; Jacobi, Z.; Trainor, K.; Meiering, E. M. Computational tools help improve protein stability but with a solubility tradeoff. J Biol Chem 2017, 292:14349-14361 doi: 10.1074/jbc.M117.784165   PUBMED  

Dataset 18

Dataset used for Automute

  1. 1962 variants from S2204 of Saraboji et al. by removing cases which missed from PDB or had less than six nearest neighbours Download: AUTOMUTE_S1962 dataset
  2. 1925 selection of S1948 (I-Mutant2.0) after filtering Download: AUTOMUTE_S1925 dataset
  3. 1749 selection of S1791 of Saraboji et al. by removing cases which missed from PDB or had less than six nearest neighbours Download: AUTOMUTE_S1749 dataset

Reference: Masso, M.; Vaisman, II. Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis. Bioinformatics 2008, 24:2002-2009 doi: 10.1093/bioinformatics/btn353   PUBMED  

Dataset 19

Dataset for TP53 variants

  1. 42 variants in TP53 protein Download: 42_variations_in_P53 dataset

Reference: Pires, DE.; Ascher, DB.; Blundell, TL. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 2014, 30:335-342 doi: 10.1093/bioinformatics/btt691   PUBMED  

B. Datasets of double variants

These datasets contain cases with double variants

Dataset 1

Dataset used for WET-STAB

  1. D180 double variants in 27 proteins Download: D180 dataset

Reference: Huang, LT.; Gromiha, MM. Reliable prediction of protein thermostability change upon double mutation from amino acid sequence. Bioinformatics 2009, 25:2181-2187 doi: 10.1093/bioinformatics/btp370   PUBMED