EPSRC RGU

Grant GR/S68682

Efficient Biological Grammar Acquisition

Duration of the Project: 15 March 2004 - 14 December 2006.

People

Background

All living things contain proteins and every protein is made from a chain. But the links in these chains are not made from metal, they are made from molecules, and not all the links are the same. In fact, there are 20 different types, each of which can be represented by a letter. This means that we can represent a protein by a sequence of letters, analogous to a sentence in English. In much the same way that we have a grammar for sentences of English, we can have a grammar for sets of protein sequences.

Biologists often need to search for proteins that serve a particular function. One approach to doing this is to search for protein sequences that can be successfully parsed by a grammar describing the biological function of interest. This approach has the advantage that a successful parse can indicate why a protein has a particular function.

Linguistic methods have provided some interesting results in biology. However hand-crafting grammars is difficult and, because it requires human expertise, expensive. Thus, given the enormous volume of data arising from genome projects, there is a need to automate the acquisition of grammars from sets of biological sequences. This is the task addressed by this project.

Results

The computational methods used on the project to acquire grammars must be provided with knowledge of grammatical constructs. Many grammatical constructs are already known for most natural languages. In contrast, very few are available for many biological sequences. One of the outcomes of this project are methods for obtaining different types of grammatical constructs for biological sequences. A statistical analysis was performed which revealed which of these types significantly improve the accuracy of the grammars.

All the types of grammatical construct tested in this project are either generic, in the sense that identical ones could be applied to any similar biological problem, or they are automatically generated ones which could be generated from any set of protein sequences. Hence the results should be of general interest to both biologists who want to hand-craft grammars and researchers concerned with developing computer programs that can learn to classify proteins.

Prior to this project, a computational method of learning biological grammars had been developed by Muggleton et al. (2001). However progress had been hindered by the inefficiency of the learning method. One of the outcomes of this project is a set of methods of acquiring such grammars which are more efficient.

Datasets and Software


Search www.comp.rgu.ac.uk for:

This page is maintained by Chris Bryant.
Last updated on 11 June 2007.