Grant GR/S68682
Efficient Biological Grammar
Acquisition
Duration of the Project: 15 March 2004 - 14 December 2006.
People
- Principal Investigator: Dr Chris Bryant
- Research Assistant: Dr Rachel Cavill
- Research Fellow: Dr Daniel Fredouille
-
Collaborators:
Background
All living things contain proteins and every protein is made from a
chain. But the links in these chains are not made from metal, they
are made from molecules, and not all the links are the same. In fact,
there are 20 different types, each of which can be represented by a
letter. This means that we can represent a protein by a sequence of
letters, analogous to a sentence in English. In much the same way
that we have a grammar for sentences of English, we can have a grammar
for sets of protein sequences.
Biologists often need to search for proteins that serve a particular
function. One approach to doing this is to search for protein
sequences that can be successfully parsed by a grammar describing the
biological function of interest. This approach has the advantage that
a successful parse can indicate why a protein has a particular
function.
Linguistic methods have provided some interesting results in
biology. However hand-crafting grammars is difficult and, because it
requires human expertise, expensive. Thus, given the enormous volume
of data arising from genome projects, there is a need to automate the
acquisition of grammars from sets of biological sequences. This is
the task addressed by this project.
Results
The computational methods used on the project to acquire grammars must
be provided with knowledge of grammatical constructs. Many
grammatical constructs are already known for most natural languages.
In contrast, very few are available for many biological sequences.
One of the outcomes of this project are methods for obtaining
different types of grammatical constructs for biological sequences. A
statistical analysis was performed which revealed which of these types
significantly improve the accuracy of the grammars.
All the types of grammatical construct tested in this project are
either generic, in the sense that identical ones could be applied to
any similar biological problem, or they are automatically generated
ones which could be generated from any set of protein sequences. Hence
the results should be of general interest to both biologists who want
to hand-craft grammars and researchers concerned with developing
computer programs that can learn to classify proteins.
Prior to this project, a computational method of learning biological
grammars had been developed by Muggleton et al. (2001). However
progress had been hindered by the inefficiency of the learning method.
One of the outcomes of this project is a set of methods of acquiring
such grammars which are more efficient.
- Parsing - 2005
Both a theoretical and empirical comparison was made of the use of
depth-first top-down parsers and Earley parsers for biological
context-free grammars. The running times of these parsers was found
to be very sensitive to a particular feature of biological grammars:
the notion of a gap. A gap is an element of a grammar designed to
match any subsequence of the parsed string (possibly with some length
constraints). A method was devised for optimising the management of
gaps by these parsers. It was shown that if this optimisation method
is not used, then the way the grammar encodes gaps has to be carefully
chosen to prevent increases in running times.
The speed at which ILP systems can generate biological grammars has
previously been a bottleneck. When learning a biological grammar,
training sequences must be parsed repeatedly during the search and
thus the ILP system speed is dependent upon the efficiency of the
parser. We have shown empirically that, by making our parsers
available to ILP systems written in Prolog, the overall induction time
is reduced by, on average, 60% of the time that ILP systems would
normally spend parsing.
- Refinement Operator - 2006
Most ILP systems search the hypothesis space in a top-down manner,
from general to specific hypotheses, using a refinement operator. We
have developed a novel refinement operator implementation, which is
specifically designed for inferring biological grammars with ILP
techniques. This implementation is shown to significantly speed-up
inference times compared to the use of the classical refinement
operator: time gains larger than 5-fold were observed in 80% of the
experiments, and the maximum observed gain was over 300-fold.
Datasets and Software
-
Materials
associated with the paper: "Speeding up Parsing of Biological
Context-Free Grammars" which was published in the Sixteenth Annual
Symposium on Combinatorial Pattern Matching, 2005.
-
Materials
associated with the paper: "A Parser for the Efficient Induction of
Biological Grammars" which was published in the late breaking papers
track of the 15th International Conference on Inductive Logic
Programming, 2005.
-
Materials associated with the paper: "Pertinent Background
Knowledge for Learning Protein Grammars" which was published in the
proceedings of 17th European Conference on Machine Learning, 2006.
-
Materials associated with the paper: "An ILP Refinement Operator
for Biological Grammar Learning" which has been accepted for
publication in the proceedings of the 16th International Conference on
Inductive Logic Programming.
This page is maintained by Chris Bryant.
Last updated on 11 June 2007.