A benchmark database for variations

Home | Instructions | Datasets | Citing | Disclaimer |

General structural datasets

A.General structural datasets


Dataset used for PON-SC

Datasets for residue side chain clashes. 7796 variations PDB in F1 and 350 variations from 5 test datasets in F2.

    F1,      F2

Reference: Čalyševa, J., Vihinen, M. (2017). PON-SC - program for identifying steric clashes caused by amino acid substitutions. BMC bioinformatics, 18(1), 531. doi:10.1186/s12859-017-1947-7.  PUBMED  


Semi-automatically derived and hand-curated collection of proteins, which possess an amino acid that has been changed by a SNV and 3D atomic coordinates are available in the PDB. F1 contains a benchmark dataset of 374 unique human variants, each corresponding to a different PDB entry.


Reference: Bhattacharya, R., Rose, P. W., Burley, S. K., & Prlić, A. (2017). Impact of genetic variation on three dimensional structure and function of proteins. PloS one, 12(3), e0171355. doi:10.1371/journal.pone.0171355.  PUBMED  


Dataset for Missense3D

1,965 disease-causing and 2,134 neutral variants.


Reference: Bhattacharya, R., Rose, P. W., Burley, S. K., & Prlić, A. (2017). Can Predicted Protein 3D Structures Provide Reliable Insights into whether Missense Variants Are Disease Associated? J Mol Biol, 431(11), e0171355. doi:10.1016/j.jmb.2019.04.009.  PUBMED  


Dataset for protein structural analysis

6025 disease-associated and 4536 neutral variants.


Reference: Gao M, Zhou H, Skolnick J (2015). Insights into Disease-Associated Mutations in the Human Proteome through Protein Structural Analysis Structure, 23(7):1362-9. doi:10.1016/j.str.2015.03.028  PUBMED  


Dataset for analysis of accessibility of variants

F1 is a dataset of variations covered by 3D structure (HVAR3D-2.0). F2 is a dataset of protein sequences with variations (HVARSEQ)

    F1    F2

Reference: Savojardo C, Manfredi M, Martelli P, Casadio R (2021). Solvent Accessibility of Residues Undergoing Pathogenic Variations in Humans: From Protein Structures to Protein Sequences, Front Mol Biosci. 2021 Jan 7;7:626363. doi: 10.3389/fmolb.2020.626363.  PUBMED  

B. Transmembrane proteins


Membrane protein datasets with a total of 2058 variants in F1.

  1. HTPd_variants_info.csv
  2. HTPd.fasta
  3. DS508.fasta
  4. DS1289.fasta
  5. mpHTP.fasta

Reference: Orioli T, Vihinen M, Benchmarking subcellular localization and variant tolerance predictors on membrane proteins, BMC Genomics;20(Suppl 8):547. doi: 10.1186/s12864-019-5865-0.   PUBMED  


Dataset for mCSM-membrane

Training dataset of 485 variants, 347 pathogenic, 138 bening variants. Test dataset of 54 variants, 38 pathogenic, 16 benign variants.

F1 contains data set used for cross-validation and F2 contains data set used as blind-test for stability study. F3 contains data set used for cross-validation and F4 contains data set used as blind-test for pathogenicity study

    F1     F2     F3     F4

Reference: Pires D, Rodrigues C, Ascher D, mCSM-membrane: predicting the effects of mutations on transmembrane proteins, Nucleic Acids Res;48(W1):W147-W153. doi: 10.1093/nar/gkaa416.   PUBMED  


Dataset for TMSNP

2624 pathogenic and 196 705 non-pathogenic variants used to train TMSNP a transmembrane protein variant predictor


Reference: Garcia-Recio A, Gómez-Tamayo J, Reina I, Campillo M, Cordomí A, Olivella M, TMSNP: a web server to predict pathogenesis of missense mutations in the transmembrane region of membrane proteins, NAR Genom Bioinform. doi: 10.1093/nargab/lqab008.   PUBMED  


Dataset for transmembrane proteins

    F1     F2     F3     F4     F5

Reference: Ge F, Zhu Y, Xu J, Muhammad A, Song J, Yu D, MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins, Comput Struct Biotechnol J. 2021 Nov 19;19:6400-6416. doi: 10.1016/j.csbj.2021.11.024.   PUBMED  

Last updated: 2022-02-25 by Niloofar Shirvanizadeh.