rgu.ac.uk| AtoZ | Contact | Search |

computing logo roundel
RGU > School of Computing

Knowledge Discovery from Text (KDT)


The aim is to discover knowledge for classification, clustering, retrieval and reuse of textual data.
KDT is an increasingly important area of research driven by the dramatic growth in access to very large text stores such as the WWW, email and domain-specific reports.
At the heart of KDT is the need for tools that discover and exploit re-occurring patterns in text. Typically advances in Machine Learning (ML) and statistics are exploited for pattern discovery. Knowledge acquisition and representation techniques from Case-Based Reasoning (CBR) further help transform discovered patterns into new knowledge structures. These structures are more transparent and manageable. Importantly, are able to provide higher level abstractions that facilitate the solution of text-related tasks.

Projects

Textual Feature Extraction from Reports
Nirmalie Wiratunga, Stewart Massie and Susan Craw - nw@comp.rgu.ac.uk

Domain-specific reports are a good source of knowledge. However reasoning is facilitated once structured features are extracted.
In a joint project with the European Space Agency (ESA) in Darmstadt, Germany, CBR is being applied to support ESA’s Anomaly Report Processing task. The first phase involves a textual to structured case mapping.
We have currently developed a prototype, CAM, which implements a novel unsupervised feature extraction technique to derive a structured case representation from text data. Word co-occurrence patterns are analysed to calculate word similarity and this similarity knowledge aids search for representative but diverse seed words. Sparse representations are avoided by learning to generalise seed words with feature extraction rules.
We have also developed concept extraction tools that are applicable in domains where reports are scarce, such as the SmartHouse domain.


Personal Email Trainer (PET)
Nirmalie Wiratunga and Amandine Orecchioni– pet@comp.rgu.ac.uk

PET is partly funded by RGU’s commercialisation fund. It is an email organisation plug-in for MS-Outlook. It applies ML and CBR indexing techniques to learn a user’s email organisation preferences. PET consists of the following functionality: train on previously organised emails; automatically move incoming emails to Inbox folders; operate in conservative or liberal mode and periodically refine its knowledge so as to learn from mistakes.


Propositional Semantic Indexing (PSI)
Nirmalie Wiratunga and Robert Lothian – nw@comp.rgu.ac.uk

The aim of this project is to discover knowledge in the form of propositional clauses for textual feature selection and extraction.
We have developed both a supervised and unsupervised version of PSI: supervised PSI selects features with boosted stumps and unsupervised PSI selects features from word clusters using a footprint-based approach from CBR. Feature extraction involves generalisation of selected features and is achieved by learning feature extraction rules.

Knowledge Acquisition for Case-Retrieval Nets
Robert Lothian, Sutanu Chakraborti, Nirmalie Wiratunga and Stuart Watt – sc@comp.rgu.ac.uk
This project applies statistical techniques to automate acquisition of similarity and relevance knowledge for Case Retrieval Networks (CRNs) for textual data. Of specific interest is LSI, which is used to generate revised representations that allow for better handling of polysemy and synonymy. A supervised version of LSI called Sprinkled LSI (SLSI) has been developed. SLSI achieves comparable performance to the state-of-the art Support Vector Machines, while preserving the representation richness of LSI. Retrieval efficiency is improved by eliminating redundant computations at run time by restructuring the CRN as in an initial pre-computation stage. Experiments show that Fast CRNs (FCRN) scale up much better compared to CRNs for dense representations.

Sentiment Analysis from Textual Data

Nirmalie Wiratunga, Rahman Mukras, , David Harper and Robert Lothian

Sentiment Analysis of text involves the study of user opinions. Increasingly the Internet is used to publish user opinions (e.g. Blog spaces) and contains useful information for commercial and political applications.

We are investigating two main approaches to tackling the problem of sentiment analysis. The first adopts a linguistic approach by utilising the sentiment properties in language structures. In contrast, the second exploits statistical properties. We plan to extract complimentary knowledge from both approaches and develop techniques to identify and characterise sentiment at the word, phrase and sentence level.

Downloadable files

Intelligent Email Management
Nirmalie Wiratunga, Amandine Orecchioni, Susan Craw and Stuart Watt – ao@comp.rgu.ac.uk

Emails have evolved from a simple medium of communication to a medium of managing complex tasks. The number of emails received daily by a single person has increased exponentially in recent years. Consequently managing email is a demanding manual process and involves multiple tasks: reading and replying; prioritising the reading order; filtering spam and phish emails; avoiding viruses and worms; organising emails into a meaningful folder structure; managing a social network, managing a diary, etc.

The aim of this project is to use DM and ML techniques to automate some of the email management tasks. In particular, we are investigating the usefulness of a CBR approach to addressing the problem.

Intelligent Websites
Susan Craw, Ganesan Bathumalai and Frank Hermann – Susan Craw

This project seeks to add intelligence to a standard website by exploiting Web Usage Mining. Web mining identifies users' needs and information sources from both plentiful usage click-stream data and knowledge of the website's content and structure. Mining the usage data for a website identifies frequently followed paths through the website.

The intelligent website can react autonomously to changes in access patterns as well as the evolving website content and organisation. The usage knowledge is applied to improve the web browsing experience through the provision of a virtual website guide or by intelligent manipulation of hyperlinks on the website.

 


 

Disclaimer | Freedom of Information |  ©2006School of Computing, The Robert Gordon University