VP-SE Research Group (C)

New Approaches for Reuse Systems (Position Paper)

M. R. Girardi and B. Ibrahim
University of Geneva - Centre Universitaire d'Informatique
24, rue du General Dufour, CH 1211 Geneve 4, Switzerland.
(E-mail: girardi@cui.unige.ch)

Keywords:

Software Reuse, Software Retrieval, Software Libraries

A Postscript version of this document can be found here.

1 Introduction

Most software retrieval systems retrieve a set of reusable candidates ranked by similarity with user requirements. Very often, users don't want to invest a great effort in selecting a component so, in most cases, the list of retrieved candidates is not completely analysed even if the highly ranked components are discarded. The user assumes that the best suited components are the ones on the top of the retrieved list (e.g. the first to the third). Thus, to select a component from the list, the user examines only the associated information for those first components. If none of them satisfies his requirements, maybe he will try to rewrite the original query but, in most cases he will abandon the search.Therefore, retrieval systems should exhibit more precision in their answers, by discarding some obviously unwanted components from the set of candidates and by retrieving only the best ones for reuse.

Most software retrieval systems retrieve components through a set of keywords provided by a user employing either a controlled or a free vocabulary. These systems are simple and effective for experienced users. The effectiveness breaks down for users not familiar with the proper terminology. Such users may not know a proper keyword, and therefore may use a synonym, a related term, or a more general or specialized term. In such cases, keyword-based systems fail because they don't provide an answer or they retrieve a great number of irrelevant software components.

Some experiments [1] show that users prefer retrieval systems with natural language interfaces instead of keyword-based interfaces. It seems more friendly for a user to say "(I want a component to) traverse a tree in x order or in y order" than to think in proper keywords, corresponding classification schemes and boolean combinations of keywords.

Queries in natural language also allow for improving the precision of retrieval systems. Precision can be improved by allowing queries with multiple words or phrases extracted from the natural language specifications. In a particular domain, software libraries operates with a great number of common concepts (e.g. file in UNIX system) and retrieval through single words reduces the precision of the system by retrieving a great number of irrelevant components.

In this paper we introduce work in progress for a software reuse system that aims to provide high precision in retrieval by allowing queries in natural language. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of reusable components) is also discussed.

2 Related work

Relatively few software reuse systems take advantage of the rich source of conceptual information available in the natural language documentation of software components to improve retrieval effectiveness. Existing systems may be classified in two basic groups: free-text indexing reuse systems and knowledge-based reuse systems.

Free-text indexing systems automatically extract keyword attributes from the natural language specifications provided by the user and those attributes are used to localize software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. These systems work at the lexical level. No semantic knowledge is used and no interpretation of the document is given. The GURU system [5] and RSL (Reusable Software Library) [2] follow this approach.

Knowledge-based reuse systems make some kind of syntactic and semantic analysis of the natural language specification without pretending to do a complete understanding of the document. They are based upon a knowledge base which stores semantic information about the application domain and about the natural language itself. These systems can be more discriminating by analysing terms in their context rather than matching individual words. A thesaurus containing synonyms, specializations and generalizations of each term avoids the problem of having to master exacts terminology. LaSSIE (Large Software System Information Environment) [3] and NLH/E (Natural Language Help/English) [8] follow this approach.

3 A knowledge-based software reuse system based on natural language specifications

In this section we outline a software reuse system based on natural language specifications.The project's goals include evaluating the defined approach in a system we have already developed for producing computer- based learning (CBL) material [4]. Our research group has created IDEAL (Interactive Development Environment for Active Learning), a development environment that makes use of modern Software Engineering techniques (semiformal graphical specifications and automatic code generation) to support as effectively as possible the creation of courseware.

The lesson designers draw the script of a lesson with a special purpose graphical editor. The lesson specifications are the input to an automatic programming system which produces the source code associated with the lessons in one of several commonly used general-purpose programming languages.The specification language combines a graphical formalism with natural language components and is specially attractive to courseware designers who are not programming experts. Given the semiformal aspect of the specification language, the automatic program generator that we have developed is not able to translate the whole specification into executable code. For this reason, we have included in our development environment a synchronous multi-window editor that simultaneously shows the specification while the automatically generated code is completed by hand. The natural language elements of the specification formalism are mainly directives to the coder and answer analysis criteria. The automatic program generator generates, for each directive, an external procedure and for each answer analysis criterion, a boolean function. The generator actually produces the declaration for these subprograms, but their bodies have to be written by a programmer.

With the approach we describe below, the research aims to find solutions to increase the translation power of the IDEAL automatic program generator by retrieving courseware code components that could be used in the generation of the source code that satisfies the natural language specifications of the lesson designers.

3.1 An overview of the reuse system

The system will be composed of a knowledge base and support for code generation.The knowledge base will basically contain: domain knowledge of courseware applications and of selected subject matter commonly taught and code components of commonly used courseware.

The generation support will be composed of four complementary subsystems: a software retrieval system from natural language specifications; a customization system; a domain knowledge acquisition system and a system to assist in the construction of general purpose code components.

The software retrieval system will extend the translation power of the current IDEAL automatic code generator by providing the source code associated to the natural language components of the script of a lesson. The generation strategy will be based on the selection and reuse of code components from the knowledge base. The selection will consist in finding similarities between a natural language specification of a script component and the natural language descriptions of code components in the knowledge base.

A very limited intervention of the coder will be needed to select a component for reuse from the list of retrieved candidates. A coder, or even a final user, intervention could be required through the use of a question- answer feedback mechanism to reduce the set of retrieved components by identifying unwanted features and discarding components that exhibit them. The purpose is to minimize the user effort in the selection of a component for reuse from a set of retrieved components that, in some cases, can be very large.

There will be cases, however, in which the retrieval mechanism could fail because similar natural language descriptions are not found. This can happen because there are not similar components in the knowledge base or because the natural language descriptions of some components don't describe them correctly or sufficiently. In such cases, the coder will code the desired functionality by hand. The failures of the system will be registered to identify information needed for the reorganization of the knowledge base (more domain knowledge, new components that could be reused in similar situations, etc.) in order to provide an improvement of the system effectiveness on later activations of the retrieval mechanism.

We plan to approach the software retrieval problem by integrating information retrieval techniques and syntactic and semantic natural language processing techniques. We expect to establish a theoretical and flexible framework for the retrieval system allowing experiments with alternative similarity measures, classification techniques, etc.

The customization system will provide assistance to tailor a retrieved generic component to specific requirements. Some reuse guidelines will be associated to the component to help the user modify it. These guidelines can be captured from past development and reuse experience and will suggest alternative design decisions to adapt software components to particular requirements.

The domain knowledge acquisition system will provide the functionality to capture and organize application domain knowledge from experts in the area. General courseware domain and selected domains of typical subject matters commonly taught will be considered. A knowledge acquisition technique will be defined to capture the domain knowledge needed in the retrieval system, by integrating techniques from Artificial Intelligence to extracting expert knowledge used in expert systems [6] and techniques from Domain Analysis [7].

The code component builder will provide the functionality to assist in the population of the knowledge base with generic code components reusable in courseware development and all the additional information (e.g. natural language descriptions) required to describe and reuse the components. Existing components must be properly qualified according to their reusable attributes before their redesign and inclusion in the knowledge base . Reuse guidelines should be identified and attached to the components in the knowledge base in order to suggest typical uses of the component and design guidelines to adapt them to a new context. Generic code components will be created from scratch or by an abstraction process from specific ones by using techniques like learning by analogy from past development experience and reverse engineering.

4 Concluding remarks

Current work is concentrating in the detailed specification of the retrieval system. Our purpose is to make a comparison of the effectiveness of the retrieval system by using alternative search methods and similarity measures. We want to experiment with retrieval by syntactic and semantic analysis of phrases and compare the behaviour of the system with the one exhibited by simple keyword-based retrieval systems.

References

Site Hosting: Bronco