A Similarity Measure for Retrieving Software Artifacts

1 Introduction

Reuse systems that index software components manually are difficult and expensive to set up. Automatic indexing is required to turn software retrieval systems cost-effective. On the other hand, the effectiveness of traditional keyword- based retrieval systems is limited by the so-called "keyword barrier" [7], i.e. these systems are unable to improve retrieval effectiveness beyond a certain performance threshold. This situation is particularly critical in software retrieval where users require high precision, i.e. they expect to retrieve only the best components for reuse, avoiding to have to select the best one from a list that can contain many irrelevant components. Natural language processing techniques, for the acquisition of lexical, syntactic and semantic information from software descriptions, are potentially useful to improve retrieval effectiveness and to reduce the cost of creating and maintaining software libraries.

This paper describes the retrieval mechanism of ROSA (Reuse Of Software Artifacts), a software reuse system (earlier described in [2][3][4]) based on the processing of the natural language descriptions of software artifacts. The main features of its classification mechanism [5] are also outlined.

The paper is organized as follows. Section 2 summarizes the main mechanisms in the current version of the reuse system. Section 3 outlines the semantic formalism used to identify, in a software description the knowledge needed to catalogue a component in a software base. Section 4 introduces the defined mechanisms for the analysis of descriptions (morpholexical, syntactic and semantic analysis) and the semantic structure of the Software Base. Section 5 presents the mechanism for query processing and retrieval with the measures used for the similarity analysis of the indexing structures. Section 6 describes an experiment conducted to evaluate the effectiveness of the proposed approach. Section 7 summarizes related work in the area of reuse systems. Section 8 concludes the paper with some remarks on planned experiments with the system and further research.

2 Overview of the reuse system

This is a section of a local copy of the paper A Similarity Measure for Retrieving Software Artifacts by M. R. Girardi and B. Ibrahim.

Site Hosting: Bronco