VariBench_logo

A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


1. Variation datasets affecting protein tolerance

DATASET 1

This is the neutral dataset or non synonymous coding SNP dataset comprising 21,170 human non synonymous coding SNPs with allele frequency 40.01 and chromosome sample count 449 from the dbSNP database build 131. This set was used for training PON-P.

DATASET 2

This is a subset of DATASET 1 one from which cancer cases are removed and is also composed of neutral and pathogenic datasets

DATASET 3

Amino acid substitutions annotated to affect protein activity were collected from the Protein Mutant Database (PMD). This set was used for testing PON-P.

DATASET 4

This is a subset of the DATASET1 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues which may cause problems with certain applications.

DATASET 5

This is a subset of the DATASET2 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues which may cause problems with certain applications.

DATASET 6

This is a subset of the DATASET 3 extracted by clustering the protein sequencesbased on their sequence similarity

DATASET 7

This is a subset of the DATASET 2 filtered by the availability of features used in PON-P2. This dataset is used for training and testing PON-P2.

DATASET 8

These datasets were developed and used for the evaluation of selected prediction tools and for training of the consensus classifier PredictSNP.

DATASET 9

Filtered versions of five publicly available benchmark datasets for pathogenicity prediction. The sets were filtered/selected from HumVar, ExoVar, PredictSNP, VariBench and SwissVar.

DATASET 10

Protein-specific and general pathogenicity predictors for amino acid substitutions

Last updated: 2017-05-15.