A benchmark database for neutral variations from dbSNP

VariSNP is a benchmark database suite comprising variation datasets that can be used for developing and testing the performance of variant effect prediction tools.  VariSNP contains datasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode, so all variations are considered neutral or non-pathogenic.

Here you find descriptions of the dataset columns, where columns 1-23 come from dbSNP; columns 24-29 have been generated with the Mutalyzer Name Checker tool (Mutalyzer) and columns 30-32 have been generated with the VariOtator batch tool (VariOtator):

  1. dbSNP_id: dbSNP RefSNP cluster ID number (rs#)
  2. heterozygosity: Estimated average heterozygosity from allele frequencies of this RefSNP. Values between 0 and 1. You can find a document describing the computation of average heterozygosity and standard error for dbSNP RefSNP clusters at NCBI
  3. heterozygosity_standard_error: Standard error of heterozygosity estimate. See column 2
  4. creation_date: Date when the RefSNP cluster was instantiated
  5. creation_build: Build (NCBI release) number when the RefSNP cluster was created
  6. update_date: Most recent date the RefSNP cluster was updated (member added or deleted)
  7. update_build: Build number (NCBI release) when the RefSNP cluster was updated
  8. observed_alleles: Observed variation alleles. All allele(s) observed at this position in the reference. Can be something like A/C or A/C/G/T or -/ACC
  9. asn_from: Start position of snp on contig, counting from 0. This position is always from the beginning of the contig regardless of the snp orientation to contig and regardless of the contig orienation to chromosome
  10. asn_to: End position of snp on contig
  11. reference_allele: Reference allele(s), this can be a '-' in the case of an insertion
  12. orientation: Orientation of RefSNP sequence to contig sequence. Values are 'forward' or 'reverse'
  13. minor_allele_frequency: Global minor allele frequency. dbSNP is reporting the minor allele frequency for each rs included in a default global population. Since this is being provided to distinguish common polymorphism from rare variants, the MAF is actually the second most frequent allele value. In other words, if there are 3 alleles, with frequencies of 0.50, 0.49, and 0.01, the MAF will be reported as 0.49. The current default global population is 1000Genome phase 1 genotype data from 1094 worldwide individuals, released in the May 2011 dataset. Values from 0 to 0.50
  14. minor_allele: Minor allele
  15. sample_size: Sample size, which is the number of chromosomes in the sample population
  16. validation: Validation method, type of evidence used to confirm the variation. Present values can be byHapMap; byOtherPop; byFrequency; by1000G; by2Hit2Allele; byCluster
  17. hgvs_names: Description(s) of the variation according to HGVS recommendations
  18. allele_origin: Genetic origin of the allele, e.g. germline, somatic, inherited, maternal
  19. clinical_significance: Clinical significance. Assertions of clinical significance for alleles of human sequence variations are reported as provided by the submitter and not interpreted by NCBI. Submissions based on processing data from OMIM® were assigned the value of ‘probable-pathogenic’. If there is a published authoritative guideline about the pathogenicity of any allele, that is included in the report. The supported values are: unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other
  20. functional_class: Variation functional class. Variations are assigned functional classes, which report if a variation is located in a locus region, in a transcript, or in a coding region. This column contains one or more functional classes (fxnClass), values can be cds-indel, downstream-variant-500B, frameshift-variant, intron-variant, missense, nc-transcript-variant, reference, splice-acceptor-variant, splice-donor-variant, stop-gained, stop-lost, synonymous-codon, upstream-variant-2KB, utr-variant-3-prime. In this column you can also find values for a to the functional class corresponding Sequence Ontology term (soTerm), the mRNA accession (mrnaAcc) and version (mrnaVer), gene symbol (symbol) and the Entrez gene id (geneid)
  21. ncbi_gi: NCBI gi number.
  22. ncbi_accession: NCBI accession and version number of reference sequence, e.g. NG_01234.5
  23. gene_symbol: Gene symbol (provided by HGNC).
  24. refseq_start_description: Description relative to transcription start on reference sequence
  25. coding_dna_description: Coding DNA variant description according to HGVS recommendations
  26. protein_description: Protein variant description according to HGVS recommendations
  27. coding_reference: NCBI RefSeq accession and version number (mRNA), e.g. NM_01234.5
  28. protein_reference: NCBI RefSeq accession and version number (protein), e.g. NP_01234.5
  29. predicted_RNA_variation: Predicted RNA variant description according to HGVS recommendations (without reference)
  30. DNA_annotation: Variation Ontology VariO annotation on DNA level
  31. RNA_annotation: Variation Ontology VariO annotation on RNA level
  32. protein_annotation: Variation Ontology VariO annotation on protein level