VP-SE Research Group (C)

An approach to improve the effectiveness of software retrieval

M.R. Girardi and B. Ibrahim
Centre Universitaire d'Informatique - Universite de Geneve
24, rue General Dufour, CH-1211 Geneve 4, Switzerland.
E-mail: {girardi, bertrand}@cui.unige.ch.

A Postscript version of this document can be found here.

Abstract

This paper introduces an approach for software retrieval based on partial natural language analysis of both queries and descriptions of software components. Retrieval is accomplished by matching the semantic structures associated to both the query and the natural language descriptions of software components in a knowledge base. Semantic structures are automatically created by identification of thematic roles in the user input through the application of syntactic techniques and some heuristics. A courseware development environment to be used in the evaluation of the retrieval system is also introduced. Related work in the area of reuse systems and on the application of natural language processing techniques in information retrieval systems is discussed.

1 Introduction

Most approaches to software retrieval propose the retrieval of software components through a set of keywords provided by a user, using either a controlled or free vocabulary [21][20]. These systems seem to be ineffective for users not familiar with the appropriate terminology. Moreover, in a particular domain, software libraries operates with a great number of common concepts (e.g. "file" in UNIX command descriptions) and retrieval through single keywords reduces the precision of the systems by retrieving a great number of irrelevant components.

The effectiveness of keyword-based retrieval systems is limited by the so-called "keyword barrier" [19], i.e. these systems are unable to improve retrieval effectiveness beyond a certain performance threshold. This situation is particularly critical in software retrieval where users require high precision. Promoting software reuse in practice demands minimizing the effort of selecting a component from the list of retrieved ones and thus the list should contain only the best suited components to the user requirements.

Some interviews [1], conducted to discover user needs and attitudes toward reuse, show that users consider reuse worthwhile, but most of them expect more from application generators or from tools for automatic programming than from reuse systems like browsers or keyword-based systems. Users require more friendly and effective support for reuse.

Queries in natural language allow for improving the precision of retrieval systems. Precision can be improved by allowing queries with multiple words or phrases extracted from the natural language specifications of software components.

Users also prefer natural language interfaces to retrieval systems than keyword-based interfaces [1]. It seems more friendly for a user to say "(I want a component to) search a file for a string" than to think in proper keywords, corresponding classification schemes and boolean combination of keywords.

In information retrieval, some improvements on retrieval effectiveness have been obtained with multi-keyword systems, using statistical methods to index documents based on the multiple occurrence of keywords in a text. These improvements are not enough and current research [23] recognizes that some form of natural language processing should be necessary to increase the effectiveness in text retrieval and other text processing operations. Thus, most recent information retrieval systems incorporate natural language processing techniques and this experience seems to be useful for the design of software retrieval systems.

Natural language processing techniques have been applied in information retrieval systems mainly at the lexical, syntactic and semantic level. Experiments with pure lexical and syntactic techniques have shown poor results and thus many current approaches experiment with partial understanding of text through semantic analysis [19].

In this paper we introduce work in progress for a software reuse system that aims to provide high precision in retrieval by applying natural language processing techniques to both queries and natural language descriptions of software components. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of software components) is discussed elsewhere [9].

The paper is organized in the following way. Section 2 describes the general architecture of the retrieval system and the mechanisms defined to improve retrieval effectiveness through semantic processing of both queries and descriptions of software components. Section 3 introduces the principal features of a courseware development environment we will use to evaluate this reuse approach. Section 4 summarizes related work in the area of reuse systems and on the application of natural language processing techniques in information retrieval systems. Section 5 concludes the paper with some remarks on current and further research.

2 Software retrieval from natural language specifications

The retrieval system is composed of three subsystems (Figure 1): a lexical and syntactic analysis system, a semantic analysis system and a matching and similarity analysis system. The first system uses lexical processing techniques to identify the grammatical classes of the individual words that appear in the query by looking up a machine-readable dictionary. Syntactic processing is applied to determine the structure of the sentence. The semantic analysis system maps the grammatically parsed text into a knowledge representation formalism. We have chosen a frame-like formalism to represent the meaning of both the query and natural language descriptions of software components. The third system matches the semantic structure associated with a query against the ones associated to software components in a knowledge base. Partial matching is allowed and a similarity measure establishes which are the most relevant candidates to be retrieved. An independent thesaurus containing synonyms, generalizations and specializations of terms is used to disambiguate alternative interpretations of the query and to better match different formulations of the same query.

As a first step we deal with simple imperative sentences (verb phrases) for both queries and software component descriptions. A verb phrase consists of a verb with an embedded noun phrase and possibly some embedded prepositional phrases. A prepositional phrase consist of a preposition and a noun phrase. A basic noun phrase consist of a noun, along with a determiner and perhaps one or more adjectives. The sentences describe simple actions that have to be performed by the computer (e.g. "search a file for a string"). Support for more complex software descriptions will be considered on further research by extension of the semantic structure used in the internal representation of the software descriptions.

We expect to improve retrieval effectiveness by doing semantic processing of both queries and natural language specifications of existing components. However, we want to minimize the cost of generating semantic knowledge by introducing mechanisms for its automatic acquisition.

Semantic knowledge in knowledge bases is usually entered completely "by hand". For example, in systems that use semantic representations like case frames, the case frames used to parse natural language queries have to be created manually. Knowledge coded by hand is time-consuming, tedious and error- prone. We expect to automatically generate the semantic structures by automatically deriving thematic roles from the user input, using syntactic information and a set of heuristics.

Thematic roles show how noun phrases are semantically related to the verbs in a sentence. In the example taken from [27], the sentence "Robbie made coffee for Suzie with a percolator" contains four noun phrases, each one would be associated with a thematic role (Table 1): object (coffee), agent (Robbie), instrument (a percolator) and beneficiary (Suzie), indicating what was made, who made it, with what was it made and for whom was it made, respectively.

        Noun phrase  Thematic role
        -----------  -------------
        Robbie       agent
        coffee       object
        Suzie        beneficiary
        a percolator instrument

Table 1 - Correspondence between noun phrases and thematic roles

We have defined typical thematic roles found in software component descriptions by analysing the short descriptions of the Unix commands. A partial list is shown in Table 2.

    Thematic Role   Description
    -------------   ------------------------------------------------
    AGENT           The software component that implements an action
    ACTION          The verb in a sentence other than the "be-verb"
    OBJECT          The thing affected by an action
    LOCATION        Where an action occurs
    TIME            When an action occurs
    PURPOSE         The reason for an action
    BENEFICIARY     The thing for which an action is performed
    DURATION        How long an action occurs
    INSTRUMENT      The thing with which an action is performed
    COMPARISON      The thing compared with the object
    DESTINATION     The destination of the object
    MANNER          The mode by which an action is performed
    SOURCE          The origin of the object
    CONDITION       The premise upon which an action is performed

Table 2 - Thematic roles found in natural language descriptions of software components

Syntactic components, mainly prepositions and certain conjunctions reveal the presence of thematic roles in a text (Table 3). For example, in the sentence "search a file for a string", "a file" is the direct object of the phrase that reveals the thematic object and the preposition "for" reveals the beneficiary thematic role (the thing for which an action is performed).

        Thematic role   Prepositional triggers
        -------------   ----------------------
        comparison	between,as
        location	in,on,to,into,within
        instrument	with,by
        destination	to
        beneficiary	for,to
        purpose		to,for
        manner		as,in,by,with,at
        source		from
        time		at,when
        duration	for
        condition	until

Table 3 Thematic roles and prepositional triggers found in descriptions of Unix commands

By using the information we have thus obtained through a simple syntactic analysis and some heuristics, we expect to generate automatically, for each query as well as for each software component description, one or more alternative frames representing plausible interpretations of the parsed phrase.

CLASS Simple_Action
   CASES
      AGENT: string            # software component name
      ACTION: verb             # the verb in the sentence
      OBJECT: noun_phrase      # the thing affected by the action
      LOCATION: noun_phrase    # where the action occurs
      TIME: noun_phrase        # when the action occurs
      PURPOSE: noun_phrase     # the reason for the action
      BENEFICIARY: noun_phrase # the thing for which the action is performed
      MANNER: noun_phrase      # the mode by which the action is performed
                .
                .
                .
   CLASS PROCEDURES
      Identify_thematic_role   #create instances by identifying thematic roles

   INSTANCE PROCEDURES
      Find_synonyms            # look for synonyms of a thematic role
      Find_generalizations     # look for generalizations of a thematic role
      Find_specializations     # look for specializations of a thematic role
                .
                .
                .
      Figure 2   - The generic definition of the semantic formalism

Figure 2 shows the frame-like semantic formalism we define to represent the meaning of either a query or natural language descriptions of software components (simple actions). A set of slot_value pairs provides the structural description of the unit, representing typical thematic roles found in descriptions of software components. Procedural methods are associated to the class level, implementing methods for automatic extraction of thematic roles from text input, and to the instance levels, implementing methods for the matching of the semantic structures. In Figure 3, "Simple_Action_1" and "Simple_Action_2" are instances created by the analysis of the sentences "search a file for a string" and "find references by keyword", respectively.

    SIMPLE_ACTION_1
        AGENT          "grep"
(a)     ACTION         "search"
        OBJECT         noun_phrase_1   # "a file"
        BENEFICIARY    noun_phrase_2   # "a string"

    SIMPLE_ACTION_2
        AGENT          "man"
(b)     ACTION         "find"
        OBJECT         noun_phrase_3   # "reference pages"
        MANNER         noun_phrase_4   # "keyword"

Figure 3 - Internal representation of both queries and software descriptions in a KB, some instances.: (a) The frame for the "grep" Unix command "search a file for a string"; (b) The frame for the "man" Unix command "find references pages by keyword"

In complement to the heuristics used to acquire thematic roles, we have semantic information that would help the system reduce the number of plausible case frames generated for a sentence, based on some other words in that sentence. For instance, the preposition "for" suggests two thematic roles: "purpose" (the reason for an action), and "beneficiary" (the thing for which an action is performed). A "purpose" is something immaterial, usually refering an action (e.g. "create tag files for use with ex and vi"), whereas a "beneficiary" is something material, usually refering an object (e.g. "generate a flow graph for a C program"). Therefore, depending on what comes after the preposition "for", the system chooses one thematic role over the other, instead of keeping in parallel the two possibilities. This semantic information will be obtained from some public domain thesaurus (e.g. WordNet) with possibly limited extensions that we will add manually for our domains of application..

3 Evaluation

Our goals include evaluating the defined approach in a system we have already developed for producing computer-based learning (CBL) material [11]. Our research group has created IDEAL (Interactive Development Environment for Active Learning), a development environment that makes use of modern Software Engineering techniques (semiformal graphical specifications and automatic code generation) to support as effectively as possible the creation of courseware.

The lesson designers draw the script of a lesson with a special purpose graphical editor. The lesson specifications are the input to an automatic programming system which produces the source code associated with the lessons in one of several commonly used general-purpose programming languages.The specification language combines a graphical formalism with natural language components and is specially attractive to courseware designers who are not programming experts. Given the semiformal aspect of the specification language, the automatic program generator that we have developed is not able to translate the whole specification into executable code. For this reason, we have included in our development environment a synchronous multi-window editor that simultaneously shows the specification while the automatically generated code is completed by hand. The natural language elements of the specification formalism are mainly directives to the coder and answer analysis criteria. The automatic program generator generates, for each directive, an external procedure and for each answer analysis criterion, a boolean function. The generator actually produces the declaration for these subprograms, but their bodies have to be written by a programmer.

The technique we have described in this paper will be used to increase the translation power of the IDEAL automatic program generator with a component reuse system. This reuse system will detect similarities between the natural language elements of a new specification and the description of existing code components. It will then suggest to the user all possible reuse opportunities.

4 Related work

4.1 Software reuse systems

Wood and Sommerville [28] propose a frame-based software component catalogue, pretending to capture the meaning of software components. The catalogue consist of a frame for each basic function that software performs, based around the action, with slots for the objects manipulated by the component. However, the frames are not used to parse input sentences. Instead, the retrieval system presents them as templates to be filled in by the user via menus. The time-consuming search through the possible fillers is left to the user. The indexing task, i.e. the construction of frames for the software component catalogue is performed manually.

Even though the indexing task on the faceted classification schema proposed by Prieto-Diaz [20] is not based on free-text analysis, the schema uses semantic information to describe software components similar to the one we associate to thematic roles. The schema consists of four facets: the function performed by the component, the objects manipulated by the function, the data structure where the function takes place (medium or location), and the system to which the function belongs. The three first facets are considered sufficient to describe the functionality of the component. For example, the following terms, corresponding to each one of the facets, describe a Unix component that "locate" (function facet), "line numbers" (objects facet) in a "file" (data structure facet) and are part of a "line editor" (system facet).

Relatively few software reuse systems take advantage of the rich source of conceptual information available in the natural language documentation of software components to improve retrieval effectiveness. Existing systems may be classified in two basic groups: free-text indexing reuse systems, and knowledge-based reuse systems.

4.1.1 Free-text indexing systems

Free-text indexing systems automatically extract keyword attributes from the natural language specifications provided by the user and those attributes are used to localize software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. Such systems attempt to characterize the specification rather than understand it. These systems work at the lexical level, so they ignore much of the useful syntactical, semantic and subject-matter specific contextual information available in a natural language specification. No semantic knowledge is used and no interpretation of the document is given. Some of the systems that follow this approach are described below.

The GURU system [17] classifies software components according to the attributes automatically extracted from their natural language documentation by using an indexing scheme based on the notions of lexical affinity and quantity of information. The retrieval system accepts queries in free style natural language and the same technique used to index software components is applied to queries. Even without attempting any understanding of the query and without using any kind of semantic knowledge, the system exhibits better precision than single keyword-based systems by constructing queries with pairs of words that occur in the text, according to the concept of lexical affinity.

The RSL (Reusable Software Library) [2] includes the RSL database and four subsystems: a library management system, a user query system, a software retrieval and evaluation subsystem and a computer- aided design system. The library management system, used to populate and maintain the software library, automatically scans source code files and extracts specially labelled comment statements with attributes such as keywords describing the functionality of components, author, date created, etc. The user query system has a natural language front-end and interacts with the user in order to refine the query and reduce the set of retrieved components. The software retrieval and evaluation subsystem retrieves and provides a score of software components through the specification of keywords that describe data-structure attributes. Finally, the computer-aided design system is used to include retrieved and selected components into an object-oriented design specification.

4.1.2 Knowledge-based reuse systems

Knowledge-based reuse systems make some kind of syntactic and semantic analysis of the natural language specification without pretending to do a complete understanding of the document. They are based upon a knowledge base which stores semantic information about the application domain and about the natural language itself. These systems can be more discriminating by analysing terms in their context rather than matching individual words. A thesaurus containing synonyms, specializations and generalizations of each term avoids the problem of having to master exact terminology. Some of the systems that follow this approach are described below.

LaSSIE (Large Software System Information Environment) [4] is composed of a knowledge base, a semantic analysis algorithm based on formal inference and a user interface incorporating a graphical browser and a natural language parser. LaSSIE helps programmers in searching useful information about a large software system and also supports the retrieval of components for reuse. The construction of the knowledge base is currently a manual process and future directions of the project point to the definition of knowledge acquisition mechanisms in order to automate that process.

NLH/E (Natural Language Help/English) [29] is a question-answering system that accepts queries in English. The evolution of NLH/E towards an expert system for reuse is one of the project goals. The system consists of a caseframe parser, a thesaurus and a caseframe grammar. The thesaurus and the grammar are specific to the Common Lisp domain. The input is restricted to imperative sentences or nominal phrases. Input sentences are matched against caseframes in a knowledge base. These frames are created manually. The parser simply returns the tree of matched caseframes. If there is no possible match that consumes all input, the parser returns the "best" matches, i.e. those that account for most of the input.

4.2 Natural language processing in information retrieval

NLP techniques have been applied in information retrieval systems mainly at the lexical, syntactic and semantic level.

The availability of machine readable dictionaries has allowed the use of lexical processing techniques in information retrieval. Stemming is a lexical technique frequently used. A stemming algorithm uses morphological information to remove suffixes from words, thus reducing variations on a root word to a single form or stem.

Machine-readable dictionaries have also been used in information retrieval to index text and queries by word senses instead of words themselves. These approaches aim to increase the precision in retrieval by providing a solution to the polysemy phenomenon, i.e. the fact that words have multiple meanings. Polysemy reduces precision because the words used to match relevant documents may have other meanings that can be present in irrelevant documents. The dictionary provides a definition for each sense of a word and the system tries to infer which sense is most likely to be correct by looking for confirmation of this evidence in the context in which the word appears. Experiments using this approach don't show considerable improvements in retrieval effectiveness but the authors believe that more experimentation is needed.

Another linguistic phenomenon that reduces the effectiveness in retrieval, particularly the recall, is synonymy, i.e having multiple words or phrases to describe the same concept. Thesauri containing synonyms, generalizations, specialization of terms or other related information have been used to approach this problem [14] [8].

NLP techniques for syntactic analysis have been used in information retrieval to index text by more complex elements than simple words, like noun phrases. Templates [5] And parsing techniques [25] [24] have been used to determine the boundaries of noun phrases and use them to index text and queries. Syntactic analysis in information retrieval is mostly restricted to noun phrases and so, is simpler than what is usually required in NLP applications. It has not been demonstrated, however, that any of these syntactic indexing techniques significantly improves retrieval performance.

One problem of indexing by noun phrases is that there are many different syntactic forms to express the same concept (e.g. `information retrieval and `retrieval of information'). Thus, finding similarities between two noun phrases that refer to a similar concept is a very difficult process. The phrase must have the same syntax structure in both the document and the query in order for the concept to be matched and this situation rarely happen.

The other problem is the fact that queries normally denote supersets of what the relevant document denote [10]. If the query is "software engineering methods" we want among others all the documents about "object oriented software engineering methods". The syntactic structures of the query and of the expression making up the documents will be different and we need some kind of syntactic correlation of the semantic superset/subset relation.

One approach to these problems is to normalize the syntax structures to get a canonical form. Noun phrases like "information retrieval" and "retrieval of information" give raise to the same canonical or dependency structure and can therefore be retrieved by the same query. This approach is taken by the CLARIT system [6] where text and queries are indexed by noun phrases in a normalized form. Input text is parsed and candidate noun phrases are identified. These noun phrases are normalized by mapping them into canonical frames (reflecting their lexical semantics), emphasizing modifiers and head relations. Resulting terms are matched against a thesaurus with the terminology (essentially a phrase list) of an specific domain. Through the match, terms are classified as either exact (identical to some phrase in the list), general (terms in the list are part of the candidate term) and novel (new terms not in the list which are potential new index terms). Except for a small subset, the thesaurus is constructed automatically from documents in a specific domain and using CLARIT-based processing.

Experiments on retrieval by syntactic phrases have shown poor results. Furthermore, experiments on retrieval by statistical phrases exhibit better effectiveness than purely syntactic methods for information retrieval [7]. Salton [23] stated that "syntactic systems not based on contextual and other semantic considerations cannot be completely successful in eliminating ambiguities and constructing useful indexing units". Even if the application of pure syntactic analysis techniques in information retrieval has not shown real improvements in retrieval effectiveness, syntactic techniques have demonstrated their usefulness as auxiliary mechanisms for semantic processing like caseframe parsing. Syntactic methods are used in order to take advantage of the text structure, to link words (e.g. how an adjective is linked to a noun, etc) and to resolve some problems related to negations, conjunctions, etc.

Former approaches of semantic processing in information retrieval were based on the definition of a formal language (unambiguous, with simple rules for interpretation and having a logical structure) into which input text could be turned. More recently, knowledge representation frameworks from artificial intelligence, like semantic networks and frames have been used to represent knowledge by specifying primitives or simple concepts and then combining them to define more complex concepts.

Most information retrieval system that use semantic analysis techniques are conceptual information retrieval systems [13] [19]. In conceptual information retrieval the user requests information, and is given the information directly, not just a reference to where it may be found. The most common semantic representation structures used in information retrieval are case frames [13] [26] [12] and scripts [16].

The main difficulty of semantic analysis is the large amount of knowledge needed to process the meaning of a sentence. For that reason, semantic analysis usually is done only in a restricted domain using a knowledge base manageable in size and complexity.

5 Concluding remarks

An approach to software retrieval based on partial analysis of free-text queries and descriptions of software components has been proposed. The system includes mechanisms for semantic processing based on the automatic acquisition of the thematic roles from the text input through the application of syntactic processing and some heuristics. The approach aims to provide better effectiveness than keyword-based reuse systems and a more friendly user interface.

A first prototype of the retrieval system is being constructed in BIMProlog [22] and using the WordNet public domain thesaurus. Our purpose is to make an analysis of the effectiveness of our retrieval approach by using alternative matching algorithms and alternative similarity measures. We also want to compare the behavior of our system with the one exhibited by traditional keyword-based retrieval systems.

Although we plan to experiment initially with retrieval from natural language specifications that describe simple actions performed by a computer, we expect to construct flexible mechanisms that can be applied to the retrieval of more complex code components and software development information with higher leverage, like design and requirements specifications.

References

Site Hosting: Bronco