VP-SE Research Group (C)

A Software Reuse System Based on Natural Language Specifications

M. R. Girardi and B. Ibrahim
University of Geneva, C.U.I., Geneva, Switzerland.
E-mail: {girardi, bertrand}@cui.unige.ch

A Postscript version of this document can be found here.

Abstract

Promoting software reuse practice requires more effective support. In this paper we discuss some problems in current software reuse systems and how current research in retrieval through natural language specifications addresses them. We introduce work in progress for a software reuse system that aims to provide high precision in retrieval by processing both queries in natural language and descriptions of components in a software library. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of reusable components) is also discussed.

1 Introduction

A survey of application programmers [1], conducted to discover user needs and attitudes toward reuse, shows that users consider reuse worthwhile, but most of them (especially those without object-oriented experience) expect more from application generators or from tools for automatic programming than from reuse systems like browsers.

A browser's usefulness actually depends on the features of software libraries as well as on the expertise of users. Such tools are useful in the case of small libraries; in the case of users that have good knowledge of the library content and extensive experience in its utilization or as complementary mechanisms of faster retrieval systems, like keyword-based reuse systems [23].

Research on more friendly and effective reuse systems is ongoing. A considerable number of tools and mechanisms for supporting reuse activities in software development have been proposed. They provide assistance either to application developers for retrieving [12][13][16][19], understanding [2][10], customizing [2][10][11][13] and composing [22] components in a software library, or to library managers to create [11][14], organize [12][17][24] and reorganize [6] reusable development information in the software library.

Most software retrieval systems usually retrieve a set of reusable candidates ranked by similarity with user requirements. However, users don't want to invest a great effort in selecting a component. So, in most cases, the list of retrieved candidates is not completely analysed even if the highly ranked components are discarded. The user assumes that the best suited components are the ones on the top of the list of candidates (e.g. the first to the third). Thus, to select a component from the list, the user examines only the associated information for those first components. If none of them satisfies his requirements, maybe he will try to refine or rewrite the original query but, in most cases he will abandon the search.Therefore, retrieval systems should exhibit more precision in their answers, by discarding some obviously unwanted components from the set of candidates and by retrieving only the ones that satisfy more precisely the requirements of the user.

Most software retrieval systems retrieve components through a set of keywords provided by a user employing either a controlled or a free vocabulary. These systems are simple and effective for experienced users. The effectiveness breaks down for users not familiar with the proper terminology. Such users may not know a proper keyword, and therefore may use a synonym, a related term, or a more general or specialized term. In such cases, keyword-based systems fail because they don't provide an answer or they retrieve a great number of irrelevant software components.

The survey introduced in [1] also shows that most of the interviewed users prefer natural language interfaces to retrieval systems than keyword-based interfaces. It seems more friendly for a user to say "(I want a component to) traverse a tree in x order or in y order" than to think in proper keywords, corresponding classification schemes and boolean combinations of keywords. This is yet more critical for users not familiar with common terminology or even worse with keywords systems based on a controlled vocabulary.

Queries in natural language allow for improving the precision of retrieval systems. Precision can be improved by allowing queries and indexing software components with multiple words or phrases extracted from the natural language descriptions. In a particular domain, software libraries operate with a great number of common concepts (e.g. file in UNIX system) and retrieval through single words reduces the precision of the system by retrieving a great number of components that are mostly irrelevant to the final requirements of the user. To achieve high precision is more crucial in software domains than in typical domains of information retrieval systems. In software retrieval systems the main purpose is not to retrieve all the material in which a particular pattern exists but rather those that best fit the desired functionality.

Another problem of software reuse is that users often expect the reuse system will suggest them the easiest components to adapt. They also expect enough information to understand the component and guidelines that help them to customize the component. Some reuse systems include browsing facilities that allow examining information associated to the reusable components, like natural language documentation and executable examples [23]. Very often this information is not enough. Users often expect information about the effort required to modify the component as well as guidelines to select and apply the appropriate design and implementation decisions.

Another factor that limits the usefulness of current reuse systems is the lack of reusable information. Reusable development information includes not only code components but also generic software specifications like generic requirements and design specifications. How to design and maintain reusable development information is an open research issue [14].

In this paper we describe work in progress and research directions for a software reuse system. By processing both queries and descriptions of software components in natural language, we expect to achieve good precision in retrieval and a more friendly user interface. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of reusable components) is also discussed.

The paper is organized in the following way. Section 2 discusses current work on software reuse systems supporting retrieval based on natural language specifications. Section 3 describes our software reuse approach and section 4 introduces an environment for its evaluation. Section 5 concludes the paper with some remarks on current and future work.

2 Related work

Proposals for software reuse systems appear frequently in the literature in the form of browsers or keyword-based query systems. Relatively few software reuse systems take advantage of the rich source of conceptual information available in the natural language documentation of software components to improve retrieval effectiveness. Existing systems may be classified in two basic groups: free-text indexing reuse systems, and knowledge-based reuse systems.

Free-text indexing systems automatically extract keyword attributes from the natural language specifications provided by the user and those attributes are used to localize software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. Such systems attempt to characterize the specification rather than to understand it. These systems work at the lexical level, so they ignore much of the useful syntactical, semantical and subject-matter specific contextual information available in a natural language specification. No semantic knowledge is used and no interpretation of the document is given. The GURU [19] and RSL (Reusable Software Library) [4] systems follow this approach.

Knowledge-based reuse systems make some kind of syntactic and semantic analysis of the natural language specification without pretending to do a complete understanding of the document. They are based upon a knowledge base which stores semantic information about the application domain and about the natural language itself. These systems can be more discriminating by analysing terms in their context rather than matching individual words. A thesaurus containing synonyms, specializations and generalizations of each term avoids the problem of having to master exact terminology. LaSSIE (Large Software System Information Environment) [8] and NLH/E (Natural Language Help/English) [25] follow this approach.

3 An overview of our reuse system

The general purpose of our research is to contribute with concrete solutions to controlling software development and maintenance cost by turning easier the practice of software reuse. Particularly, the research aims at defining a tool for retrieving software by describing it in free-text. The system must be cost-effective, precise in its answers and domain- independent (i.e. independent of the application domain of the software libraries).

The tool consists of four complementary systems: a knowledge acquisition system, a software retrieval system, a customization system and a system to assist in the construction of general purpose code components. Following sections summarize the main features of the tool. For a more detailed description see [15].

3.1 The knowledge acquisition system

The knowledge acquisition system automatically extracts from the descriptions of software components the knowledge needed to classify them in the knowledge base. The system extracts lexical, syntactic and semantic knowledge from the descriptions in natural language and this knowledge is used to generate the internal indexes of the components. We used a frame-like internal representation.

Our purpose is not to understand the natural language description of a component but to use the lexical, syntactic and semantic information in the description to construct indexing units that can ensure good precision in the retrieval process.

The formalism we use to translate a software description into a frame-like internal representation is based on the linguistic case theory of Fillmore [9]. In [15] we describe this formalism and we present some concrete examples.

3.2 The retrieval system

The retrieval system looks for and selects from the knowledge base the most relevant and flexible components for reuse, based on the requirements of the desired component stated in a query in natural language. The retrieval algorithm is based on the detection of similarities between the user query and the description of software components in the knowledge base.

Figure 1 shows an overview of the retrieval system [15]. A lexical, syntactic and semantic analysis is performed on the query to translate it into a frame-like internal representation. This is essentially the same process used by the knowledge acquisition system to classify the software components. The translation process uses lexical and semantic information in a thesaurus [20] and a semantic grammar, with constraints and heuristics, we especially define to support the formalism that we use for the internal representation of the descriptions.

The internal representation of the query is matched against the frames associated to software components in the knowledge base. Both perfect or partial matching is considered. A similarity analysis is performed and the most relevant components are selected from the knowledge base.

The similarity measure is a function of the conceptual distance between the terms of the internal representations of the query and natural language descriptions of software components in the knowledge base and the number of semantic cases in the frames that are matched. The conceptual distance is computed by considering synonyms, generalizations and specializations of terms in a thesaurus.

The reuse attributes of the retrieved candidates (e.g. frequency of reusing and commonality) are measured to estimate the effort of reuse them. The purpose of the reuse effort computation is to avoid the selection of components that are difficult to adapt already before the customization process.

3.3 The customization system

The customization system provides assistance to tailor a retrieved generic component to specific requirements. The reuse of a code component as a "black-box" (without modifying it) is not as common as the reuse as "white-box" (by modifying it). The adaptation of a generic component to specific requirements is not an easy process. Therefore, some reuse guidelines are associated to the component to help the user modify it. These guidelines are captured from past development and reuse experience, in the form of design decisions [14]. Such guidelines suggest alternative design decisions (e.g. generalization, specialization, composition or decomposition of software components) to adapt software components to particular requirements and are updated to reflect the customization difficulties found in the process of using the system in order to facilitate future adaptation of components.

3.4 The component builder

The code component builder provides the functionality to assist in the population of the knowledge base with generic code components and all the additional information (e.g. natural language descriptions) required to describe and reuse the components. How to design reusable software (from scratch or from existing components) is not an easy task. Good knowledge of the application domain and past experience in the development of applications in the domain are required. Existing components must be properly qualified according to their reusable attributes before their redesign and inclusion in the knowledge base [5][14]. Reuse guidelines are identified and attached to the components in the knowledge base in order to suggest typical uses of the component and design guidelines to adapt them to a new context.

Generic code components will be created from scratch or by an abstraction process from specific ones. Learning by analogy from past development experience and reverse engineering are proposed techniques to deal with this abstraction process. In the techniques for learning by analogy, the idea is to transform solutions of past problems into potential plans to solve new generated problems [21]. Reverse Engineering techniques may provide heuristics to abstract generic specifications from existing software at a particular specification level, and from the implementation level to the design level to the requirements specification level [3][7]. Some measures of the level of reusability of the specifications extracted from existing applications will be done before their inclusion in the software library. Common functionality, ease of modification, correctness, readability and other attributes of reusability may be measured by applying a set of metrics and models [5].

4 An environment to evaluate the approach

The project's goals include evaluating the defined approach in a system we have already developed for producing computer-based learning (CBL) material [18]. Our research group has created IDEAL (Interactive Development Environment for Active Learning), a development environment that makes use of modern Software Engineering techniques (semiformal graphical specifications and automatic code generation) to support as effectively as possible the creation of courseware. The main goal of IDEAL is to facilitate the task of teachers and pedagogues who specify the lessons and the task of programmers who implement, test and maintain the source code associated with the specifications.

The lesson designers draw the script of a lesson with a special purpose graphical editor. The lesson specifications are the input to an automatic programming system which produces the source code associated with the lessons in one of several commonly used general-purpose programming languages.The specification language combines a graphical formalism with natural language components and is specially attractive to courseware designers who are not programming experts.

Given the semiformal aspect of the specification language, the automatic program generator that we have developed is not able to translate the whole specification into executable code. For this reason, we have included in our development environment a synchronous multi- window editor that simultaneously shows the specification while the automatically generated code is completed by hand. The natural language elements of the specification formalism are mainly directives to the coder and answer analysis criteria. Directives to the coder are generally a short piece of text describing an action that the computer should do. By its nature, the text of a directive is usually formed of very few short imperative sentences. The automatic program generator generates, for each of these directives, an external procedure and for each answer analysis criterion, a boolean function. The generator actually produces the declaration for these subprograms, but their bodies have to be written by a programmer.

With the approach we describe earlier, the research aims at finding solutions to increase the translation power of the IDEAL automatic programming generator by retrieving formerly written courseware code components that could be used in the generation of the source code that will satisfy the natural language specifications of the lesson designers.

5 Concluding remarks

We have introduced the main features of a software reuse system. The system is user-friendly, by allowing queries in natural language; precise, by constructing the indexing units of software components with lexical, syntactic and semantic information extracted from the descriptions of software components; domain-independent and cost-effective, by the automatic acquisition of the knowledge required for the retrieval system.

Current work is concentrating in the retrieval and knowledge acquisition systems. A first prototype is being developed in BIMprolog on a SUN platform. We have chosen this language not only for its facilities for rapid prototyping but also for its facilities for parsing and for the construction of the user interface.

Our purpose is to make a comparison of the effectiveness of the retrieval system by using alternative search methods and similarity measures. We want to experiment with retrieval by syntactic and semantic analysis of phrases and compare the behaviour of the system with the one exhibited by simple keyword-based retrieval systems. Thus, we expect to evaluate the potential improvements in the retrieval effectiveness introduced by the semantic analysis techniques.

Future work includes the integration of the reuse system in the IDEAL environment and the population of the knowledge base with knowledge and software components from the courseware domain. This will allow us to evaluate the effectiveness and experiment with the retrieval system in a real environment. Existing courseware developed using IDEAL will be used as a source of potential code components to be redesigned and included in the knowledge base. Concrete applications will be developed by using the environment in order to measure the expected reductions in cost and development time through software reuse.

6 References

Site Hosting: Bronco