A Similarity Measure for Retrieving Software Artifacts

2 Overview of the reuse system

Figure 1 shows an overview of the current version of the reuse system. The system consists of a classification mechanism and a retrieval mechanism.

The classification system catalogues the software components in a software base through their descriptions in natural language. An acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue them in the software base. The system extracts lexical, syntactic and semantic information and this knowledge is used to create a frame-like internal representation for the software component.

The interpretation mechanism used for the analysis of a description does not pretend to understand the meaning of a description but to automatically acquire enough information to construct useful indexing units for software components. Semantic analysis of descriptions follows the rules of a semantic formalism. The formalism consists of a case system, constraints and heuristics to perform the translation of the description into an internal representation. Both syntactic and semantic rules are implemented in a grammar to parse descriptions into a set of frames. The semantic formalism is based on some semantic relationships between noun phrases and the verb in a sentence. These semantic relationships provide that similar software descriptions have similar internal representations. A classification scheme for software components derives from the semantic formalism, through a set of generic frames. The internal representation of a description constitutes the indexing unit for the software component, constructed as an instance of these generic frames.

The WordNet [8] lexicon is used to obtain morphological information, grammatical categories of terms and lexical relationships between terms.

The Knowledge base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated to the component (source code, executable examples, reuse attributes, etc).

A similar analysis mechanism to the one applied to software descriptions is used to map a query in free text to an internal representation. The retrieval system uses the set of frames generated for the query to identify similar ones in the Knowledge base.

The retrieval system looks for and selects components from the Knowledge base, based on the closeness between the frames associated to the query and software descriptions. Closeness measures are derived from the semantic formalism and from a conceptual distance measure between the terms in compared frames. Software components are scored according to their closeness value with the user query. The ones with a score higher than a controlled threshold become the candidates to retrieve.

As a first step, the system deals with imperative sentences for both queries and software component descriptions.

3 The semantic formalism

This is a section of a local copy of the paper A Similarity Measure for Retrieving Software Artifacts by M. R. Girardi and B. Ibrahim.

Site Hosting: Bronco