VP-SE Research Group (C)

A Classification Scheme for Software Artifacts (Position Paper)

M. R. Girardi and B. Ibrahim
University of Geneva, C.U.I., Geneva, Switzerland.
(E-mail: {girardi, bertrand}@cui.unige.ch)

In Proceedings of Fifth ASIS SIG/CR (American Society for Information Science, Classification Research workshop), Oct. 16, 1994, Alexandria, Virginia, USA.

A Postscript version of this document can be found here.

1 Introduction

Reuse systems that index software components manually are difficult and expensive to set up. Automatic indexing is required to turn software retrieval systems cost-effective. On the other hand, the effectiveness of traditional keyword-based retrieval systems is limited by the so-called "keyword barrier", i.e. these systems are unable to improve retrieval effectiveness beyond a certain performance threshold. This situation is particularly critical in software retrieval where users require high precision. Natural language processing techniques, for the acquisition of lexical, syntactic and semantic information from software descriptions, are potentially useful to improve retrieval effectiveness and to reduce the cost of creating and maintaining software libraries.

This paper briefly describes the classification scheme of a software reuse system [1][2][3][4][5] based on the processing of the descriptions in natural language of software components. Major requirements for the system are good retrieval effectiveness, cost-effectiveness and domain independence.

2 An overview of the reuse system

The current version of our reuse system consists of two major mechanisms: a classification mechanism and a retrieval mechanism.

The classification system [5] catalogues the software components in a software base through their descriptions in natural language. A knowledge acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue them in the software base. The system extracts lexical, syntactic and semantic information and this knowledge is used to create a frame-like internal representation for the software component.

Semantic analysis of descriptions follows the rules of a semantic formalism. The semantic formalism is based on semantic relationships between noun phrases and the verb in a sentence. These semantic relationships ensure that similar software descriptions produce similar internal representations.

A classification scheme for software components derives from the semantic formalism, through a set of generic frames. The internal representation of a description constitutes the indexing unit for the software component, constructed as an instance of these generic frames.

Public domain lexicons (Webster and WordNet) are used to get lexical and semantic information needed during the parsing process.

The Knowledge base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated with the component like source code, executable examples, reuse attributes, etc.

The same analysis mechanism applied to software descriptions is used to map a query in free text into a frame-like internal representation. The retrieval system uses the set of frames associated to the query to identify similar ones in the Knowledge Base.

The retrieval system [4] looks for and selects components from the software base, based on the closeness between the frames associated to the query and software descriptions. Closeness measures are derived from the semantic formalism and from the conceptual distance between terms in the query and software descriptions. Software components are scored according to their closeness measure with the user query. The ones with a score higher than a given threshold become the reuse candidates.

As a first step the system deals with imperative sentences for both queries and software component descriptions. Imperative sentences describe simple actions that are performed by a software component and perhaps the object manipulated by the action, the manner by which the action is performed and other semantic information related to the action.

3 The classification formalism

A semantic formalism establishes the rules to generate the internal representation of both queries and natural language descriptions of software components. The formalism consists of a case system for simple imperative sentences with some constraints and heuristics that are used to map a description into a frame-like internal representation.

The case system basically consists of a sequence of one or more semantic cases. Semantic cases are associated to some syntactic compounds of an imperative sentence. An imperative sentence consists of a verb (representing an action) possibly followed by a noun phrase (representing the direct object of the action) and perhaps some embedded prepositional phrases. For instance, the sentence `search a file for a string' consists of the verb `search', in the infinitive form, followed by the noun phrase `a file', which represents the object manipulated by the action, and followed by the prepositional phrase `for a string', which represents the goal of the `search' action. In the example, the semantic cases `Action', `Location' and `Goal' are respectively associated to the verb, direct object and prepositional phrase of the sentence.

Semantic cases show how noun phrases are semantically related to the verb in a sentence. For instance, in the sentence `search a file for a string', the semantic case `Goal' associated to the noun phrase `for a string' shows the target of the action `search'. We have defined a basic set of semantic cases for software descriptions by analyzing the short descriptions of Unix commands in manual pages. These semantic cases describe basically the functionality of the component (the action, the target of the action, the medium or location, the mode by which the action is performed, etc.).

A semantic case consists of a case generator (possibly omitted) followed by a nominal or verbal phrase. A case generator reveals the presence of a particular semantic case in a sentence. Case generators are mainly prepositions. For instance, in the sentence `search a file for a string', the preposition `for' in the prepositional phrase `for a string' suggests the `Goal' semantic case.

4 The classification process

Morpholexical, syntactic and semantic analysis of software descriptions is performed to map a description to a frame-like internal representation.

The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms, their grammatical categories and their semantic relationships with other words in a lexicon. Two semantic relations between terms are currently considered: synonymy and hyponymy/hypernymy.

Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed interactively by using a definite clause grammar. The defined grammar implements a subset of the grammar rules for imperative sentences in English. The grammar supports the case system and states domain-independent knowledge of the English language through a set of syntactic and semantic rules.

A set of semantic structures is generated as a result of the parsing process, representing the internal structures of software descriptions. A language for modelling these semantic structures is shown below.

The language defines a frame-like classification scheme for software components based on the defined semantic cases. The classification scheme consists of a hierarchical structure of generic frames (`IS-A-KIND-OF' relationship). Frames that are instances of these generic frames (`IS-A' relationship) implement the indexing units of software descriptions.

The generic frames model semantic structures associated to verb phrases, noun phrases and the information associated to software components, like name, description, source code, executable examples, etc.

Semantic cases are represented as slots in the frames. `Facets' are associated to each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (`value' facet); the type of the frame that describes its internal structure (`domain' facet) or the lexical category of the case (`category' facet). For instance, the `Location' slot in the verb phrase frame has a `domain' facet indicating that its constituents are described in a frame of type `noun phrase'.

Through the parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators.

References

(1): M. R. Girardi and B. Ibrahim,; "New Approaches for Reuse Systems,"; Position Paper Collection of the 2nd. International Workshop on Software Reuse (IWSR-2), Lucca, Italy, March 24-26, 1993.
(2): M. R. Girardi and B. Ibrahim,; "An Approach to Improve the Effectiveness of Software Retrieval,"; Proceedings of the Third Irvine Software Symposium (ISS'93), Irvine, California, April 30, 1993, pp. 89-100.
(3): M. R. Girardi and B. Ibrahim,; "A Software Reuse System based on Natural Language Specifications,"; Proceedings of the 5th International Conference on Computing and Information (ICCI'93), Sudbury, May 27-29, 1993, pp. 507-511.
(4): M. R. Girardi and B. Ibrahim,; "A Similarity Measure for Retrieving Software Artifacts,"; Proceedings of Sixth International Conference on Software Engineering and Knowledge Engineering (SEKE'94), Jurmala, Latvia, June 21-23, 1994, pp. 478-485.
(5): M. R. Girardi and B. Ibrahim,; "Automatic Indexing of Software Artifacts,"; Proceedings of 3rd International Conference on Software Reuse, Rio de Janeiro, Nov. 1-4, 1994. (To appear).

Site Hosting: Bronco