Using Machine Learning to Discover Diagnostic Sequence Motifs
Duration of the Project: February 1997 - March 1999
Performance was measured using a new cost function, Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity.
RA may be used to measure performance of a recognition model for any domain where 1) the proportion of positives in the set of examples is very small; 2) there is no benchmark recognition method and 3) there is no guarantee that all positives can be identified as such. In such domains, the proportion of positive examples in the population is not known and a set of negatives cannot be identified with complete confidence.
A general method was developed for assessing the significance of the difference between RA values obtained in comparative trials. RA is estimated by summing the estimate of performance on each test-set instance. The method uses a) identically distributed random variables representing the outcome for each instance; b) a sample mean which approaches the population mean in the limit and c) a relatively small sample variance.
The aim of this part of the project was to generate motifs that would be capable of accurately discriminating between protein sub-families automatically. (There are many examples where knowing which sub-family a new protein belongs to is the key factor in predicting its likely biological function.)
Progol was used to generate rules which classify the cyclase proteins in a data-set supplied by Smith-Kline Beecham as belonging to either the guanylyl cyclase or adenylyl cyclase sub-families with an estimated accuracy of 98.6% +/- 1.4%. Note that a rule which simply stated that every protein is guanylyl cyclase would only have accuracy of 42.0% on this data-set.