Protein Structure and
Bioinformatics Group

Prof. Mauno Vihinen

Lund University

News from PSB

2017-08-24 New publication:

Schaafsma GCP, 2017
Tools and annotations for variation.

Lund University Publications

Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods. We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it. BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even. The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis. To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids. Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased.

2017-08-28 New publication:

Teku, GN, 2017
Computational analysis on the effects of variations in T and B cells. Primary immunodeficiencies and cancer neoepitopes.

Lund University Publications

Computational approaches are essential to study the effects of inborn and somatic variations. Results from such studies contribute to better diagnosis and therapies. Primary immunodeficiencies (PIDs) are rare inborn defects of key immune response genes. Somatic variations are main drivers of most cancers. Large and diverse data on PID genes and proteins can enable systems biology studies on their dynamic effects on T and B cells. Amino acid substitutions (AASs) are somatic variations that drive cancers. However, AASs also cause cancer-associated antigens that are recognized by lymphocytes as non-self, and are called neoantigens. Detail analysis these neoantigens can be performed due to the availability of cancer data from many consortia. The purpose of this thesis was to investigate the effects of PIDs on T and B cells and to explore features of neoepitopes in cancers. The object of the first study was to detect the central T cell-specific protein network. The purpose of the second and third studies were to reconstruct the T and B cell network model and simulate the dynamic effects of PID perturbations. The aim of the fourth study was to characterize neoepitopes from pan-cancer datasets. The immunome interactome was reconstructed, and the links weighed with gene expression correlation of integrated, time series data (Paper I). The significance of the weighted links were computed with the Global Statistical Significance (GloSS) method, and the weighted interactome network was filtered to obtain the central T cell network. Next, the T cell network model was reconstructed from literature mining and the core T cell protein interaction network (Paper II). The B cell network model was reconstructed by mining the literature for central B cell interactions (Paper III). The normalized HillCube software was used to study the dynamic effects of PID perturbations in T and B cells. Proteome-wide amino AASs on putatively derived 8-, 9-, 10-, and 11-mer neoepitopes in 30 cancer types were analyzed with the NetMHC 4.0 software (Paper IV). The interconnectedness of the major T cell pathways are maintained in the central T cell PPI network. Empirical evidence from Gene Ontology term and essential genes enrichment analyses were in support for the central T cell network. In the T and B cell simulations for several knockout PIDs correspond to previous results. In the T cell model, simulations for TCR, PTPRC, LCK, ZAP70 and ITK indicated profound disruption in network dynamics. BCL10, CARD11, MALT1, NEMO and MAP3K14 simulations showed significant effects. In B cell, the simulations for LYN, BTK, STIM1, ORAI1, CD19, CD21 and CD81 indicated profound changes to many proteins in the network. Severe effects were observed in the BCL10, IKKB, knockout CARD11, MALT1, NEMO, IKKB and WIPF1 simulations. No major effects were observed for constitutively active PID proteins. The most likely epitopes are those which are detected by several macromolecular histocompartibility complexes (MHCs) and of several peptide lengths. 0.17% of all variants yield more than 100 neoepitopes. Amino acid distributions indicate that variants at all positions in neoepitopes of any length are, on average, more hydrophobic compared to the wild-type. The core T cell network approach is general and applicable to any system with adequate data. The T and B cell models enable the understanding of the dynamic effects of PID disease processes and reveals several novel proteins that may be of interest when diagnosing and treating immunological defects. The neoepitope characteristics can be employed for targeted cancer vaccine applications in personalized therapies.

2017-06-22 New publication:

Daneshjou et al., 2017
Working towards precision medicine: predicting phenotypes from exomes in the Critical Assessment of Genome Interpretation (CAGI) challenges.

Precision medicine aims to predict a patient's disease risk and best therapeutic options by using that individual's genetic sequencing data. The Critical Assessment of Genome Interpretation (CAGI) is a community experiment consisting of genotype-phenotype prediction challenges; participants build models, undergo assessment, and share key findings. For CAGI 4, three challenges involved using exome sequencing data: bipolar disorder, Crohn's disease, and warfarin dosing. Previous CAGI challenges included prior versions of the Crohn's disease challenge. Here, we discuss the range of techniques used for phenotype prediction and discuss the methods used for assessing predictive models. Additionally, we outline some of the difficulties associated with making predictions and evaluating them. The lessons learned from the exome challenges can be applied to both research and clinical efforts to improve phenotype prediction from genotype. In addition, these challenges serve as a vehicle for sharing clinical and research exome data in a secure manner with scientists who have a broad range of expertise, contributing to a collaborative effort to advance our understanding of genotype-phenotype relationships.

2017-04-25 New publication:

Schaafsma G, Vihinen M, 2017
Large differences in proportions of harmful and benign amino acid substitutions between proteins and diseases.
Hum Mutat 38: 839-848 doi: 10.1002/humu.23236

Genes and proteins are known to have differences in their sensitivity to alterations. Despite numerous sequencing studies, proportions of harmful and harmless substitutions are not known for proteins and groups of proteins. To address this question, we predicted the outcome for all possible single amino acid substitutions in nine representative protein groups by using the PON-P2 method. The effects on 996 proteins were studied and vast differences were noticed. Proteins in the cancer group harbour the largest proportion of harmful variants (42.1%) while the non-disease group of proteins not known to have a disease association and not involved in the housekeeping functions had the lowest number of harmful variants (4.2%). Differences in the proportions of the harmful and benign variants are wide within each group but they still show clear differences between the groups. Frequently appearing protein domains show a wide spectrum of variant frequencies, whereas no major protein structural class-specific differences were noticed. Amino acid substitution types in the original and variant residues showed distinctive patterns, which are shared by all the protein groups. The observations are relevant for understanding genetic bases of diseases, variation interpretation and for the development of methods for that purpose.

2017-02-23 New publication:

Niroula A, Vihinen M, 2017
PON-P and PON-P2 predictor performance in CAGI challenges: Lessons learned.
Hum Mutat doi: 10.1002/humu.23199

Computational tools are widely used for ranking and prioritizing variants for characterizing their disease relevance. Since numerous tools have been developed, they have to be properly assessed before being applied. Critical Assessment of Genome Interpretation (CAGI) experiments have significantly contributed towards the assessment of prediction methods for various tasks. Within and outside the CAGI, we have addressed several questions that facilitate development and assessment of variation interpretation tools. These areas include collection and distribution of benchmark datasets, their use for systematic large scale method assessment, and the development of guidelines for reporting methods and their performance. For us, CAGI has provided a chance to experiment with new ideas, test the application areas of our methods, and network with other prediction method developers. In this article, we discuss our experiences and lessons learned from the various CAGI challenges. We describe our approaches, their performance, and impact of CAGI on our research. Finally, we discuss some of the possibilities that CAGI experiments have opened up and make some suggestions for future experiments.

2017-02-16 Update VariSNP datasets:

The VariSNP benchmark datasets were updated using the dbSNP xml datasets from the NCBI ftp website (NCBI), which were last modified in November 2016 and contain variants from update build 149 (GRCh38p7)

2017-02-06 Update VariO:

Variation Ontology VariO was updated to version 1.05, with some minor changes and corrections. Some terms were removed:
VariO:0166 gene structure variation
VariO:0167 gene fusion
VariO:0168 gene deletion
VariO:0169 complete gene deletion
VariO:0170 partial gene deletion
VariO:0205 uncharacterized chromosomal variation
Some new terms were introduced:
VariO:0166 antigen receptor gene rearrangement
VariO:0167 effect on DNA form
VariO:0168 somatic hypermutation
VariO:0169 class switch recombination
VariO:0170 antigen receptor gene conversion
VariO:0205 dinucleotide expansion
VariO:0378 DNA transposon
VariO:0379 LINE
VariO:0380 SINE
VariO:0388 LTR
VariO:0389 pentanucleotide expansion
VariO:0390 effect on DNA double helix
VariO:0391 plasmid
VariO:0392 insertion sequence
VariO:0393 effect on DNA pseudoknot
VariO:0394 effect on DNA cruciform
VariO:0395 self-cleavage by ribozyme activity
VariO:0403 group I intron
VariO:0404 group II intron
VariO:0405 dicentric translocation
VariO:0406 dicentric isoduplication
VariO:0407 edited DNA
VariO:0408 RNA chimera
VariO:0409 frameshifted RNA
VariO:0410 spliced RNA
VariO:0411 alternatively spliced RNA
VariO:0412 effect on catalytic DNA activity
VariO:0413 effect on A DNA
VariO:0414 effect on B DNA
VariO:0415 effect on C DNA
VariO:0416 effect on L DNA
VariO:0417 effect on S DNA
VariO:0418 effect on D DNA
VariO:0419 effect on H DNA
VariO:0420 effect of four-stranded DNA
VariO:0421 effect on Z DNA
VariO:0422 effect on intramolecular DNA triple helix
VariO:0423 effect on intermolecular DNA triple helix
VariO:0424 effect on DNA-RNA hybrid
VariO:0425 effect on RNA triplex helix
VariO:0426 effect on four-stranded RNA
VariO:0427 type of chromosomal amplification
VariO:0428 genome variation
VariO:0429 complex genomic variation
VariO:0430 nucleotide expansion
VariO:0431 effect on R loop
VariO:0432 effect on T loop
VariO:0433 effect on D loop

2017-01-10 New publication:

Niroula A, Vihinen M, 2017
Predicting severity of disease-causing variants.
Hum Mutat 38: 357-364 doi: 10.1002/humu.23173

Most diseases, including those of genetic origin, express a continuum of severity. Clinical interventions for numerous diseases are based on the severity of the phenotype. Predicting severity due to genetic variants could facilitate diagnosis and choice of therapy. Although computational predictions have been used as evidence for classifying the disease-relevance of genetic variants, special tools for predicting disease severity in large scale are missing. Here, we manually curated a dataset containing variants leading to severe and less severe phenotypes and studied the abilities of variation impact predictors to distinguish between them. We found that these tools cannot separate the two groups of variants. Then, we developed a novel machine learning-based method, PON-PS (, for classification of amino acid substitutions associated with benign, severe, and less severe phenotypes. We tested the method using an independent test dataset and variants in four additional proteins. For distinguishing severe and non-severe variants, PON-PS showed an accuracy of 61% in the test dataset which is higher than for existing tolerance prediction methods. PON-PS is the first generic tool developed for this task. The tool can be used together with other evidence for improving diagnosis and prognosis and for prioritization of preventive interventions, clinical monitoring, and molecular tests.

2016-11-10 New publication:

Vihinen M, 2016
How to define pathogenicity, health and disease?
Hum Mutat 38: 129-136 doi 10.1002/humu.23144

Scientific and clinical communities produce ever increasing amounts of data and details about health and disease. Our ability to understand and utilize this information is limited due to imprecise language and lack of well-defined concepts. This problem involves also the principal concepts of health, disease and pathogenicity. Here, a systematic model is presented for pathogenicity, as well as for health and disease. It has three components: extent, modulation and severity, which jointly define the continuum of pathogenicity. The model is population based, and once implemented can be used for numerous purposes such as diagnosis, patient stratification, prognosis, finding phenotype-genotype –correlations or explaining adverse drug reactions. The new model has several benefits including health economy by allowing evidence based personalized/precision medicine.

2017-01-09 New publication:

Viennas E, Komianou A, Mizzi C, Stojiljkovic M, Mitropoulou C, Muilu J, Vihinen M, Grypioti P, Papadaki S, Pavlidis C, Zukic B, Katsila T, van der Spek PJ, Pavlovic S, Tzimas G, Patrinos GP. 2017
Expanded national database collection and data coverage in the FINDbase worldwide database for clinically relevant genomic variation allele frequencies.
Nucleic Acids Res. 45: D846-D853

FINDbase ( is a comprehensive data repository that records the prevalence of clinically relevant genomic variants in various populations worldwide, such as pathogenic variants leading mostly to monogenic disorders and pharmacogenomics biomarkers. The database also records the incidence of rare genetic diseases in various populations, all in well-distinct data modules. Here, we report extensive data content updates in all data modules, with direct implications to clinical pharmacogenomics. Also, we report significant new developments in FINDbase, namely (i) the release of a new version of the ETHNOS software that catalyzes development curation of national/ethnic genetic databases, (ii) the migration of all FINDbase data content into 90 distinct national/ethnic mutation databases, all built around Microsoft's PivotViewer ( software (iii) new data visualization tools and (iv) the interrelation of FINDbase with DruGeVar database with direct implications in clinical pharmacogenomics. The abovementioned updates further enhance the impact of FINDbase, as a key resource for Genomic Medicine applications.

2017-01-09 New publication:

Hamasy A, Wang Q, Blomberg KE, Mohammad DK, Yu L, Vihinen M, Berglöf A, Smith CI. 2017
Substitution scanning identifies a novel, catalytically active ibrutinib-resistant BTK cysteine 481 to threonine (C481T) variant.
Leukemia 31: 177-185

Irreversible Bruton tyrosine kinase (BTK) inhibitors, ibrutinib and acalabrutinib have demonstrated remarkable clinical responses in multiple B-cell malignancies. Acquired resistance has been identified in a sub-population of patients in which mutations affecting BTK predominantly substitute cysteine 481 in the kinase domain for catalytically active serine, thereby ablating covalent binding of inhibitors. Activating substitutions in the BTK substrate phospholipase Cγ2 (PLCγ2) instead confers resistance independent of BTK. Herein, we generated all six possible amino acid substitutions due to single nucleotide alterations for the cysteine 481 codon, in addition to threonine, requiring two nucleotide substitutions, and performed functional analysis. Replacement by arginine, phenylalanine, tryptophan or tyrosine completely inactivated the catalytic activity, whereas substitution with glycine caused severe impairment. BTK with threonine replacement was catalytically active, similar to substitution with serine. We identify three potential ibrutinib resistance scenarios for cysteine 481 replacement: (1) Serine, being catalytically active and therefore predominating among patients. (2) Threonine, also being catalytically active, but predicted to be scarce, because two nucleotide changes are needed. (3) As BTK variants replaced with other residues are catalytically inactive, they presumably need compensatory mutations, therefore being very scarce. Glycine and tryptophan variants were not yet reported but likely also provide resistance.

2016-11-02 New publication:

Vihinen M, 2016
Establishment of an international database for genetic variants in esophageal cancer.
Ann NY Acad Sci 1381: 45-49

The establishment of a database has been suggested in order to collect, organize, and distribute genetic information about esophageal cancer. The World Organization for Specialized Studies on Diseases of the Esophagus and the Human Variome Project will be in charge of a central database of information about esophageal cancer-related variations from publications, databases, and laboratories; in addition to genetic details, clinical parameters will also be included. The aim will be to get all the central players in research, clinical, and commercial laboratories to contribute. The database will follow established recommendations and guidelines. The database will require a team of dedicated curators with different backgrounds. Numerous layers of systematics will be applied to facilitate computational analyses. The data items will be extensively integrated with other information sources. The database will be distributed as open access to ensure exchange of the data with other databases. Variations will be reported in relation to reference sequences on three levels--DNA, RNA, and protein-whenever applicable. In the first phase, the database will concentrate on genetic variations including both somatic and germline variations for susceptibility genes. Additional types of information can be integrated at a later stage.

2016-09-15 New publication:

Vihinen M, 2016
Both generic and protein-specific tolerance predictors are needed.
Hum Mutat 37: 989

2016-06-27 New publication:

Yang Y, Niroula A, Shen B, Vihinen M, 2016
PON-Sol: prediction of effects of amino acid substitutions on protein solubility.
Bioinformatics 32: 2032-2034 doi: 10.1093/bioinformatics/btw066

Solubility is one of the fundamental protein properties. It is of great interest because of its relevance to protein expression. Reduced solubility and protein aggregation are also associated with many diseases.
We collected from literature the largest experimentally verified solubility affecting amino acid substitution (AAS) dataset and used it to train a predictor called PON-Sol. The predictor can distinguish both solubility decreasing and increasing variants from those not affecting solubility. PON-Sol has normalized correct prediction ratio of 0.491 on cross-validation and 0.432 for independent test set. The performance of the method was compared both to solubility and aggregation predictors and found to be superior. PON-Sol can be used for the prediction of effects of disease-related substitutions, effects on heterologous recombinant protein expression and enhanced crystallizability. One application is to investigate effects of all possible AASs in a protein to aid protein engineering.
PON-Sol is freely available at The training and test data are available at

2016-06-09 Update VariSNP datasets:

The VariSNP benchmark datasets were updated using the dbSNP xml datasets from the NCBI ftp website (NCBI), which were last modified in April 2016 and contain variants from update build 147 (GRCh38p2)

2016-03-21 New publication:

Niroula A, Vihinen M, 2016
PON-mt-tRNA: a multifactorial probability-based method for classification of mitochondrial tRNA variations.
Nucl. Acids Res. 44: 2020-2027 doi: 10.1093/nar/gkw046

Transfer RNAs (tRNAs) are essential for encoding the transcribed genetic information from DNA into proteins. Variations in the human tRNAs are involved in diverse clinical phenotypes. Interestingly, all pathogenic variations in tRNAs are located in mitochondrial tRNAs (mt-tRNAs). Therefore, it is crucial to identify pathogenic variations in mt-tRNAs for disease diagnosis and proper treatment. We collected mt-tRNA variations using a classification based on evidence from several sources and used the data to develop a multifactorial probability-based prediction method, PON-mt-tRNA, for classification of mt-tRNA single nucleotide substitutions. We integrated a machine learning-based predictor and an evidence-based likelihood ratio for pathogenicity using evidence of segregation, biochemistry and histochemistry to predict the posterior probability of pathogenicity of variants. The accuracy and Matthews correlation coefficient (MCC) of PON-mt-tRNA are 1.00 and 0.99, respectively. In the absence of evidence from segregation, biochemistry and histochemistry, PON-mt-tRNA classifies variations based on the machine learning method with an accuracy and MCC of 0.69 and 0.39, respectively. We classified all possible single nucleotide substitutions in all human mt-tRNAs using PON-mt-tRNA. The variations in the loops are more often tolerated compared to the variations in stems. The anticodon loop contains comparatively more predicted pathogenic variations than the other loops. PON-mt-tRNA is available at

2016-03-15 New publication:

Niroula A, Vihinen M, 2016
Variation Interpretation Predictors: Principles, Types, Performance and Choice.
Hum Mutat. 37: 579-597 doi: 10.1002/humu.22987

Next-generation sequencing methods have revolutionized the speed of generating variation information. Sequence data have a plethora of applications and will increasingly be used for disease diagnosis. Interpretation of the identified variants is usually not possible with experimental methods. This has caused a bottleneck that many computational methods aim at addressing. Fast and efficient methods for explaining the significance and mechanisms of detected variants are required for efficient precision/personalized medicine. Computational prediction methods have been developed in three areas to address the issue. There are generic tolerance (pathogenicity) predictors for filtering harmful variants. Gene/protein/disease-specific tools are available for some applications. Mechanism and effect-specific computer programs aim at explaining the consequences of variations. Here, we discuss the different types of predictors and their applications. We review available variation databases and prediction methods useful for variation interpretation. We discuss how the performance of methods is assessed and summarize existing assessment studies. A brief introduction is provided to the principles of the methods developed for variation interpretation as well as guidelines for how to choose the optimal tools and where the field is heading in the future.

2016-02-26 New publication:

Vihinen M, Hancock, JM, Maglott, DR, Landrum, MJ, Schaafsma, GCP, Taschner, PEM, 2016
Human Variome Project Quality Assessment Criteria for Variation Databases.
Hum Mutat. 37: 549-558 doi: 10.1002/humu.22976

Numerous databases containing information about DNA, RNA, and protein variations are available. Gene-specific variant databases (locus-specific variation databases, LSDBs) are typically curated and maintained for single genes or groups of genes for a certain disease(s). These databases are widely considered as the most reliable information source for a particular gene/protein/disease, but it should also be made clear they may have widely varying contents, infrastructure, and quality. Quality is very important to evaluate because these databases may affect health decision-making, research, and clinical practice. The Human Variome Project (HVP) established a Working Group for Variant Database Quality Assessment. The basic principle was to develop a simple system that nevertheless provides a good overview of the quality of a database. The HVP quality evaluation criteria that resulted are divided into four main components: data quality, technical quality, accessibility, and timeliness. This report elaborates on the developed quality criteria and how implementation of the quality scheme can be achieved. Examples are provided for the current status of the quality items in two different databases, BTKbase, an LSDB, and ClinVar, a central archive of submissions about variants and their clinical significance.

2016-02-03 Update VariSNP datasets:

The VariSNP benchmark datasets were updated using the dbSNP xml datasets from the NCBI ftp website (NCBI), which were last modified in January 2016 and contain variants from update build 146 (GRCh38p2)

2016-01-18 New publication:

Schaafsma GCP, Vihinen M, 2016
VariOtator, a software tool for variation annotation with the Variation Ontology.
Hum Mutat. Hum Mutat. 37: 344-349. doi: 10.1002/humu.22954.

The Variation Ontology (VariO) is used for describing and annotating types, effects, consequences and mechanisms of variations. To facilitate easy and consistent annotations, the online application VariOtator was developed. For variation type annotations VariOtator is fully automated, accepting variant descriptions in Human Genome Variation Society (HGVS) format, and generating VariO terms, either with or without full lineage, i.e. all parent terms. When a coding DNA variant description with a reference sequence is provided, VariOtator checks the description first with Mutalyzer and then generates the predicted RNA and protein descriptions with their respective VariO annotations. For the other sublevels - function, structure and property - annotations cannot be automated, and VariOtator generates annotation based on provided details. For VariO terms relating to structure and property, one can use attribute terms as modifiers and Evidence Code (ECO) terms for annotating experimental evidence. There is an online batch version, and stand-alone batch versions to be used with a Leiden Open Variation Database (LOVD) download file. A SOAP web service allows client programs to access VariOtator programmatically. Thus, systematic variation effect and type annotations can be efficiently generated to allow easy use and integration of variations and their consequences.

2015-12-03 Update VariSNP datasets:

The VariSNP benchmark datasets were updated using the dbSNP xml datasets from the NCBI ftp website (NCBI), which were last modified in October 2015 and contain variants from update build 144 (GRCh38)

2015-09-09 New publication:

Niroula A, Vihinen M, 2015
Classification of amino acid substitutions in mismatch repair proteins using PON-MMR2.
Hum Mutat. 36: 1128-1134. doi: 10.1002/humu.22900.

2015-08-26 PON-P2 prediction data available:

Prediction data from PON-P2 for amino acid substitutions in COSMIC (v68) are available here.

2015-08-20 Update VariO:

Variation Ontology VariO was updated to version 1.04, with some minor changes and corrections. Three new terms were introduced:
VariO:0017 nonsynonymous variation
VariO:0343 synonymous variation
VariO:0363 effect on RNA G-quadruplex

2015-08-19 New publication:

Vihinen M, 2015
Muddled genetic terms miss and mess the message.
Trends Genet. 31:423-425. doi: 10.1016/j.tig.2015.05.008

A critical aspect of science is the clear communication of complicated matters. However, language is often ambiguous, and the message can get lost in the telling. In particular, genetic terms can have different meanings for different people. Here, I discuss this problem and suggest remedies to clarify the message.

2015-08-19 New publication:

Wuttge DM, Carlsen AL, Teku G, Steen SO, Wildt M, Vihinen M, Hesselstrand R, Heegaard NH, 2015
Specific autoantibody profiles and disease subgroups correlate with circulating micro-RNA in systemic sclerosis.
Rheumatology (Oxford). 2015 Jul 10. pii: kev234. [Epub ahead of print]

2015-08-19 New publication:

Niroula A, Vihinen M, 2015
Harmful somatic amino acid substitutions affect key pathways in cancers.
BMC Med Genomics: 8(1):53 doi:10.1186/s12920-015-0125-x

Cancer is characterized by the accumulation of large numbers of genetic variations and alterations of multiple biological phenomena. Cancer genomics has largely focused on the identification of such genetic alterations and the genes containing them, known as 'cancer genes'. However, the non-functional somatic variations out-number functional variations and remain as a major challenge. Recurrent somatic variations are thought to be cancer drivers but they are present in only a small fraction of patients.
We performed an extensive analysis of amino acid substitutions (AASs) from 6,861 cancer samples (whole genome or exome sequences) classified into 30 cancer types and performed pathway enrichment analysis. We also studied the overlap between the cancers based on proteins containing harmful AASs and pathways affected by them.
We found that only a fraction of AASs (39.88 %) are harmful even in known cancer genes. In addition, we found that proteins containing harmful AASs in cancers are often centrally located in protein interaction networks. Based on the proteins containing harmful AASs, we identified significantly affected pathways in 28 cancer types and indicate that proteins containing harmful AASs can affect pathways despite the frequency of AASs in them. Our cross-cancer overlap analysis showed that it would be more beneficial to identify affected pathways in cancers rather than individual genes and variations.
Pathways affected by harmful AASs reveal key processes involved in cancer development. Our approach filters out the putative benign AASs thus reducing the list of cancer variations allowing reliable identification of affected pathways. The pathways identified in individual cancer and overlap between cancer types open avenues for further experimental research and for developing targeted therapies and interventions.

2015-05-21 New publication:

Vihinen M, 2015
No more hidden solutions in bioinformatics.
Nature 521: 261 doi:10.1038/521261a

2015-04-23 New publication:

Vihinen M, 2015
The Importance of Proper Testing of Predictor Performance.
Hum Mutat. 36(5): iii-iv

2015-04-22 Update VariO:

Variation Ontology VariO was updated to version 1.03, with some minor changes and corrections.

2015-04-22 Update VariSNP datasets:

Due to the presence of some 'Pathogenic/Likely pathogenic' entries in the VariSNP benchmark datasets, updated 2015-04-09, these sets were updated by taking out these entries. The cds-indel, downstream-variant-500B, frameshift-variant and stop-lost sets were not affected.

2015-04-09 Update VariSNP datasets:

The VariSNP benchmark datasets were updated using the dbSNP xml datasets from the NCBI ftp website (NCBI), which were last modified in November 2014 and contain variants from update build 142 (GRCh38)

2015-03-31 New publication:
Väliaho J, Faisal I, Ortutay, C, Smith, CIE, Vihinen M, 2015
Characterization of all possible single nucleotide change-caused amino acid substitutions in the kinase domain of Bruton tyrosine kinase.
Hum Mutat. 36: 638-647

Knowledge about features distinguishing deleterious and neutral variations is crucial for interpretation of novel variants. Bruton tyrosine kinase (BTK) contains among the human protein kinases the highest number of unique disease-causing variations, still it is just 10% of all the possible single nucleotide substitution-caused amino acid variations. In the BTK kinase domain (BTK-KD) can appear altogether 1495 such variants. We investigated them all with bioinformatic and protein structure analysis methods. Most disease-causing variations affect conserved and buried residues disturbing protein stability. Minority of exposed residues is conserved, but strongly tied to pathogenicity. 67% of the variations are predicted to be harmful. In 39% of the residues, all the variants are likely harmful, while in 10% of sites all the substitutions are tolerated. Results indicate the importance of the entire kinase domain, involvement in numerous interactions, and intricate functional regulation by conformational change. These results can be extended to other protein kinases and organisms. This article is protected by copyright. All rights reserved.

2015-03-31 New publication:
Smith, TD, Vihinen M
Standard development at the Human Variome Project.
Database (2015) Vol. 2015: article ID bav024; doi:10.1093/database/bav024
Database 2015 bav024

The Human Variome Project (HVP) is a world organization working towards facilitating the collection, curation, interpretation and free and open sharing of genetic variation information. A key component of HVP activities is the development of standards and guidelines. HVP Standards are systems, procedures and technologies that the HVP Consortium has determined must be used by HVP-affiliated data sharing infrastructure and should be used by the broader community. HVP guidelines are considered to be beneficial for HVP affiliated data sharing infrastructure and the broader community to adopt. The HVP also maintains a process for assessing systems, processes and tools that implement HVP Standards and Guidelines. Recommended System Status is an accreditation process designed to encourage the adoption of HVP Standards and Guidelines. Here, we describe the HVP standards development process and discuss the accepted standards, guidelines and recommended systems as well as those under acceptance. Certain HVP Standards and Guidelines are already widely adopted by the community and there are committed users for the others.

Updated 2017-06-14 by Gerard Schaafsma

ImmunoDeficiency Resource (IDR)
Immunome Knowledge Base (IKB)

Bioinformatics services:
B-Cell Proteome
Bioinformatics benchmarks
PID classification

Standards and guidelines:
HVP Guidelines
Guidelines for prediction tools
Curating gene variant databases
HVP Country Nodes
Recommendations for LSDBs

Group members

News from PSB:

Lund University
Medical Faculty
Department of Experimental Medical Science

Lund University, Sweden 2017 ©