Project funded by Smith-Kline Beecham

Using Machine Learning to Discover Diagnostic Sequence Motifs

Duration of the Project: February 1997 - March 1999

People

Results

Neuropeptide Precursors

This part of the project investigated whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol was used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). This was the first real-world biological application of the positive-only learning framework of the ILP system Progol and the first attempt to acquire a grammar for a biological domain using ILP.

Performance was measured using a new cost function, Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity.

RA may be used to measure performance of a recognition model for any domain where 1) the proportion of positives in the set of examples is very small; 2) there is no benchmark recognition method and 3) there is no guarantee that all positives can be identified as such. In such domains, the proportion of positive examples in the population is not known and a set of negatives cannot be identified with complete confidence.

A general method was developed for assessing the significance of the difference between RA values obtained in comparative trials. RA is estimated by summing the estimate of performance on each test-set instance. The method uses a) identically distributed random variables representing the outcome for each instance; b) a sample mean which approaches the population mean in the limit and c) a relatively small sample variance.

Protein Sub-Families

The aim of this part of the project was to generate motifs that would be capable of accurately discriminating between protein sub-families automatically. (There are many examples where knowing which sub-family a new protein belongs to is the key factor in predicting its likely biological function.)

Progol was used to generate rules which classify the cyclase proteins in a data-set supplied by Smith-Kline Beecham as belonging to either the guanylyl cyclase or adenylyl cyclase sub-families with an estimated accuracy of 98.6% +/- 1.4%. Note that a rule which simply stated that every protein is guanylyl cyclase would only have accuracy of 42.0% on this data-set.

Datasets

Neuropeptide Precursors (NPPS) Data-set


Search www.comp.rgu.ac.uk for:

This page is maintained by Chris Bryant.
Last updated on 23 April 2007.