ProCon algorithm

The method requires a reliable alignment of related sequences. Entropy and information calculations are used to quantify the type I and type II conservation according to

entropy = - and

information = ln (N)+ ,

where Pai is the probability of amino acid ai at position i, and N is the size of the amino acid alphabet (Shen and Vihinen 2004). For type II conservation, the amino acids are in six groups, i.e. hydrophobic (V, I, L, F, M, W, Y, C), negatively charged (D, E), positively charged (R, K, H), conformational (G, P), polar (N, Q, S), and (A, T) group. The information for type II conservation is scaled by (ln20/ln6). The modified mutual information (Clarke 1995) is used to determine the type III conservation.

The core algorithm of finding triplets is optimized in 4 nested loops, index counters i, j, k and l (shown in the following figure):

a. i starts from 0 to residue number in a sequence, in i loop, first check if element ms[1] contains any sites.

b. If yes, enter j loop (j starts from i+1 to total residue number); then enter k loop (k starts from 0 to number of sites in ms[i]). Check if element ms[j] contains any sites.

c. If yes, enter l loop (l starts from k+1 to number of sites in ms[i]), take the number l element in ms[1] and compare to all elements in ms[2]. This check is done with exponential search combined with binary search.

d. If a match is found, store the positions into the Triplet data structure.

*Flow chart for conserved amino acid network identification. *

Shen, B. and M. Vihinen (2004). "Conservation and covariance in PH domain sequences: Physicochemical profile and information theoretical analysis of XLA-causing mutations in the Btk PH domain." Protein Eng Des Sel.