Classification and Retrieval of Software through their Description in Natural Language

© Maria del Rosario Girardi
Ph.D. Thesis No. 2782
Computer Science Department
University of Geneva, Switzerland

Abstract

This work explores information retrieval and natural language processing techniques to improve the effectiveness of software retrieval and to reduce the cost of software classification.

Its main contributions are a mechanism for the automatic classification of software using their descriptions in natural language, a mechanism for an effective retrieval of software through queries in free text, a knowledge-based internal representation of software components and associated indexing information and a similarity model to compute the closeness between user's queries and software components in a software base in order to establish an order of the retrieved candidates. A browsing mechanism based on the similarity model and on a clustering technique is also proposed for exploratory search.

The classification mechanism is supported by several linguistic strategies based on a software case formalism which provides an interpretation of each sentence in a software description into a set of semantic and nominal cases in a frame-based internal representation. The similarity model consists of a set of measures based on the lexical, syntactic and semantic information available in the internal representation of queries and software components and on the conceptual distance between simple terms.

Four case studies have been developed with different software collections to evaluate the classification strategies, the retrieval effectiveness and the usefulness of the browsing approach to identify reuse opportunities, using a prototype implemented in Prolog by BIM.


Note: the text of the thesis manuscript is available as Postscript files, one file per chapter. Since many (but not all) Web browsers offer automatic decompression of compressed (gzip) files, these Postscript files are available both in compressed and uncompressed form. Please use the compressed form to save network bandwidth if your browser can automatically decompress the compressed version. If you are not sure about the capacity of your browser to decompress files automatically, try the compressed version of the cover page. If it shows correctly on your screen, you can safely retrieve the compressed files.

For those who cannot print Postscript documents in A4 format, a letter size format has been added.


Compressed version, A4 format


Regular (uncompressed) version, A4 format


Compressed version, letter size format


Maria del Rosario Girardi Visual Programming and Software Engineering Group

Site Hosting: Bronco