Most software retrieval systems retrieve components through a set of keywords provided by a user employing either a controlled or a free vocabulary. These systems are simple and effective for experienced users. The effectiveness breaks down for users not familiar with the proper terminology. Such users may not know a proper keyword, and therefore may use a synonym, a related term, or a more general or specialized term. In such cases, keyword-based systems fail because they don't provide an answer or they retrieve a great number of irrelevant software components.
Some experiments [1] show that users prefer retrieval systems with natural language interfaces instead of keyword-based interfaces. It seems more friendly for a user to say "(I want a component to) traverse a tree in x order or in y order" than to think in proper keywords, corresponding classification schemes and boolean combinations of keywords.
Queries in natural language also allow for improving the precision of retrieval systems. Precision can be improved by allowing queries with multiple words or phrases extracted from the natural language specifications. In a particular domain, software libraries operates with a great number of common concepts (e.g. file in UNIX system) and retrieval through single words reduces the precision of the system by retrieving a great number of irrelevant components.
In this paper we introduce work in progress for a software reuse system that aims to provide high precision in retrieval by allowing queries in natural language. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of reusable components) is also discussed.
Free-text indexing systems automatically extract keyword attributes from the natural language specifications provided by the user and those attributes are used to localize software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. These systems work at the lexical level. No semantic knowledge is used and no interpretation of the document is given. The GURU system [5] and RSL (Reusable Software Library) [2] follow this approach.
Knowledge-based reuse systems make some kind of syntactic and semantic analysis of the natural language specification without pretending to do a complete understanding of the document. They are based upon a knowledge base which stores semantic information about the application domain and about the natural language itself. These systems can be more discriminating by analysing terms in their context rather than matching individual words. A thesaurus containing synonyms, specializations and generalizations of each term avoids the problem of having to master exacts terminology. LaSSIE (Large Software System Information Environment) [3] and NLH/E (Natural Language Help/English) [8] follow this approach.
The lesson designers draw the script of a lesson with a special purpose graphical editor. The lesson specifications are the input to an automatic programming system which produces the source code associated with the lessons in one of several commonly used general-purpose programming languages.The specification language combines a graphical formalism with natural language components and is specially attractive to courseware designers who are not programming experts. Given the semiformal aspect of the specification language, the automatic program generator that we have developed is not able to translate the whole specification into executable code. For this reason, we have included in our development environment a synchronous multi-window editor that simultaneously shows the specification while the automatically generated code is completed by hand. The natural language elements of the specification formalism are mainly directives to the coder and answer analysis criteria. The automatic program generator generates, for each directive, an external procedure and for each answer analysis criterion, a boolean function. The generator actually produces the declaration for these subprograms, but their bodies have to be written by a programmer.
With the approach we describe below, the research aims to find solutions to increase the translation power of the IDEAL automatic program generator by retrieving courseware code components that could be used in the generation of the source code that satisfies the natural language specifications of the lesson designers.
The generation support will be composed of four complementary subsystems: a software retrieval system from natural language specifications; a customization system; a domain knowledge acquisition system and a system to assist in the construction of general purpose code components.
The software retrieval system will extend the translation power of the current IDEAL automatic code generator by providing the source code associated to the natural language components of the script of a lesson. The generation strategy will be based on the selection and reuse of code components from the knowledge base. The selection will consist in finding similarities between a natural language specification of a script component and the natural language descriptions of code components in the knowledge base.
A very limited intervention of the coder will be needed to select a component for reuse from the list of retrieved candidates. A coder, or even a final user, intervention could be required through the use of a question- answer feedback mechanism to reduce the set of retrieved components by identifying unwanted features and discarding components that exhibit them. The purpose is to minimize the user effort in the selection of a component for reuse from a set of retrieved components that, in some cases, can be very large.
There will be cases, however, in which the retrieval mechanism could fail because similar natural language descriptions are not found. This can happen because there are not similar components in the knowledge base or because the natural language descriptions of some components don't describe them correctly or sufficiently. In such cases, the coder will code the desired functionality by hand. The failures of the system will be registered to identify information needed for the reorganization of the knowledge base (more domain knowledge, new components that could be reused in similar situations, etc.) in order to provide an improvement of the system effectiveness on later activations of the retrieval mechanism.
We plan to approach the software retrieval problem by integrating information retrieval techniques and syntactic and semantic natural language processing techniques. We expect to establish a theoretical and flexible framework for the retrieval system allowing experiments with alternative similarity measures, classification techniques, etc.
The customization system will provide assistance to tailor a retrieved generic component to specific requirements. Some reuse guidelines will be associated to the component to help the user modify it. These guidelines can be captured from past development and reuse experience and will suggest alternative design decisions to adapt software components to particular requirements.
The domain knowledge acquisition system will provide the functionality to capture and organize application domain knowledge from experts in the area. General courseware domain and selected domains of typical subject matters commonly taught will be considered. A knowledge acquisition technique will be defined to capture the domain knowledge needed in the retrieval system, by integrating techniques from Artificial Intelligence to extracting expert knowledge used in expert systems [6] and techniques from Domain Analysis [7].
The code component builder will provide the functionality to assist in the population of the knowledge base with generic code components reusable in courseware development and all the additional information (e.g. natural language descriptions) required to describe and reuse the components. Existing components must be properly qualified according to their reusable attributes before their redesign and inclusion in the knowledge base . Reuse guidelines should be identified and attached to the components in the knowledge base in order to suggest typical uses of the component and design guidelines to adapt them to a new context. Generic code components will be created from scratch or by an abstraction process from specific ones by using techniques like learning by analogy from past development experience and reverse engineering.
Site Hosting: Bronco