Systematic classification of PIDs, Methods

Methods

The systematic classification method applies advanced computational tools for clustering and network analysis. Altogether six methods were used to group PIDs based on characteristics that they share. Three of the methods were for clustering and three for network community analysis.

Data for PIDs was collected from the ImmunoDeficiency Resource (IDR), IDdiagnostics, IDbases, and literature. For each disease, all signs, symptoms and laboratory values mentioned in the literature were collected. Altogether 87 informative parameters were used with an equal weight in the analysis.

Cluster and network analyses were performed in the R statistical environment. Three different variations of K-means clustering were used to analyze the dataset. The Clustering Large Applications (clara) method computes a list representing the clustering of the data into k clusters. Partitioning Around Medoids (pam) partitions (clusters) the data into k clusters around medoids, which are representative objects of a dataset from which the distances to the other points in the cluster are computed. The Fuzzy Analysis Clustering (fanny) method computes a partition grouping of the data into k clusters.

Three methods were applied to find highly interconnected parts of the network. Community structure via short random walks is a walktrap community analysis, which searches for densely connected subgraphs, i.e. communities. When moving from one node to a connected one, short random walks tend to stay in the same community. The second method utilizes community structure detection based on the leading eigenvector of the community matrix. The method looks for densely connected subgraphs by calculating the leading non-negative eigenvector of the modularity matrix of the graph. The third method tries to find communities in graphs via a spin-glass model and simulated annealing.

To obtain the most reliable and robust grouping and a consistent and robust view of the disease grouping patterns, a consensus classification based on the co-occurrence of the diseases in four, five or six methods was generated.