In protein sequence alignment, residue similarity is usually evaluated by substitution matrix,
which scores all possible exchanges of one amino acid with another. Several matrices are
widely used in sequence alignment, including PAM matrices derived from homologous sequence
and BLOSUM matrices derived from aligned segments of BLOCKS. However, most
matrices have not addressed the high-order residue-residue interactions that are vital to the
bioproperties of protein.With consideration for the inherent correlation in residue triplet, we
present a new scoring scheme for sequence alignment. Protein sequence is treated as overlapping
and successive 3-residue segments. Two edge residues of a triplet are clustered into
hydrophobic or polar categories, respectively. Protein sequence is then rewritten into triplet
sequence with 2 · 20 · 2 = 80 alphabets. Using a traditional approach, we construct a new
scoring scheme named TLESUMhp (TripLEt SUbstitution Matrices with hydropobic and polar
information) for pairwise substitution of triplets, which characterizes the similarity of residue
triplets. The applications of this matrix led to marked improvements in multiple sequence
alignment and in searching structurally alike residue segments. The reason for the occurrence
of the ‘‘twilight zone,’’ i.e., structure explosion of lowidentity sequences, is also discussed.
Edit Comment