Knowledge Discovery from Text (KDT)
The aim is to discover knowledge for classification, clustering, retrieval
and reuse of textual data.
KDT is an increasingly important area of research driven by the dramatic
growth in access to very large text stores such as the WWW, email and
domain-specific reports.
At the heart of KDT is the need for tools that discover and exploit
re-occurring patterns in text. Typically advances in Machine Learning
(ML) and statistics are exploited for pattern discovery. Knowledge acquisition
and representation techniques from Case-Based Reasoning (CBR) further
help transform discovered patterns into new knowledge structures. These
structures are more transparent and manageable. Importantly, are able
to provide higher level abstractions that facilitate the solution of text-related
tasks.
Projects
Textual Feature Extraction from Reports
Nirmalie Wiratunga,
Stewart Massie and
Susan Craw - nw@comp.rgu.ac.uk
Domain-specific reports are a good source of knowledge.
However reasoning is facilitated once structured features are extracted.
In a joint project with the European Space Agency (ESA) in Darmstadt,
Germany, CBR is being applied to support ESA’s Anomaly Report
Processing task. The first phase involves a textual to structured
case mapping.
We have currently developed a prototype, CAM, which implements a novel
unsupervised feature extraction technique to derive a structured case
representation from text data. Word co-occurrence patterns are analysed
to calculate word similarity and this similarity knowledge aids search
for representative but diverse seed words. Sparse representations
are avoided by learning to generalise seed words with feature extraction
rules.
We have also developed concept extraction tools that are applicable
in domains where reports are scarce, such as the SmartHouse domain. |
 |
Personal Email Trainer (PET)
Nirmalie Wiratunga
and Amandine Orecchioni–
pet@comp.rgu.ac.uk
 |
PET is partly funded by RGU’s commercialisation
fund. It is an email organisation plug-in for MS-Outlook. It applies
ML and CBR indexing techniques to learn a user’s email organisation
preferences. PET consists of the following functionality: train on
previously organised emails; automatically move incoming emails to
Inbox folders; operate in conservative or liberal mode and periodically
refine its knowledge so as to learn from mistakes. |
Propositional Semantic Indexing
(PSI)
Nirmalie Wiratunga
and Robert Lothian
– nw@comp.rgu.ac.uk
The aim of this project is to discover knowledge in
the form of propositional clauses for textual feature selection and
extraction.
We have developed both a supervised and unsupervised version of PSI:
supervised PSI selects features with boosted stumps and unsupervised
PSI selects features from word clusters using a footprint-based approach
from CBR. Feature extraction involves generalisation of selected features
and is achieved by learning feature extraction rules. |
 |
Knowledge Acquisition for Case-Retrieval
Nets
Robert Lothian,
Sutanu Chakraborti,
Nirmalie
Wiratunga and Stuart Watt – sc@comp.rgu.ac.uk
This project applies statistical techniques to automate acquisition of
similarity and relevance knowledge for Case Retrieval Networks (CRNs)
for textual data. Of specific interest is LSI, which is used to generate
revised representations that allow for better handling of polysemy and
synonymy. A supervised version of LSI called Sprinkled LSI (SLSI) has
been developed. SLSI achieves comparable performance to the state-of-the
art Support Vector Machines, while preserving the representation richness
of LSI. Retrieval efficiency is improved by eliminating redundant computations
at run time by restructuring the CRN as in an initial pre-computation
stage. Experiments show that Fast CRNs (FCRN) scale up much better compared
to CRNs for dense representations.
Sentiment Analysis from Textual Data
Nirmalie Wiratunga,
Rahman Mukras,
, David Harper and Robert
Lothian
Sentiment Analysis of text involves the study of user opinions. Increasingly
the Internet is used to publish user opinions (e.g. Blog spaces) and contains
useful information for commercial and political applications.
We are investigating two main approaches to tackling the problem of sentiment
analysis. The first adopts a linguistic approach by utilising the sentiment
properties in language structures. In contrast, the second exploits statistical
properties. We plan to extract complimentary knowledge from both approaches
and develop techniques to identify and characterise sentiment at the word,
phrase and sentence level.
Downloadable files
Intelligent Email Management
Nirmalie Wiratunga,
Amandine Orecchioni,
Susan
Craw and Stuart Watt
– ao@comp.rgu.ac.uk
Emails have evolved from a simple medium of communication to a medium
of managing complex tasks. The number of emails received daily by a single
person has increased exponentially in recent years. Consequently managing
email is a demanding manual process and involves multiple tasks: reading
and replying; prioritising the reading order; filtering spam and phish
emails; avoiding viruses and worms; organising emails into a meaningful
folder structure; managing a social network, managing a diary, etc.
The aim of this project is to use DM and ML techniques to automate some
of the email management tasks. In particular, we are investigating the
usefulness of a CBR approach to addressing the problem.
Intelligent Websites
Susan Craw, Ganesan
Bathumalai and Frank Hermann – Susan
Craw
This project seeks to add intelligence to a standard
website by exploiting Web Usage Mining. Web mining identifies users'
needs and information sources from both plentiful usage click-stream
data and knowledge of the website's content and structure. Mining
the usage data for a website identifies frequently followed paths
through the website.
The intelligent website can react autonomously to changes in access
patterns as well as the evolving website content and organisation.
The usage knowledge is applied to improve the web browsing experience
through the provision of a virtual website guide or by intelligent
manipulation of hyperlinks on the website. |
 |
|