 VP-SE Research Group
(C)
VP-SE Research Group
(C)
A browser's usefulness actually depends on the features of software libraries as well as on the expertise of users. Such tools are useful in the case of small libraries; in the case of users that have good knowledge of the library content and extensive experience in its utilization or as complementary mechanisms of faster retrieval systems, like keyword-based reuse systems [23].
Research on more friendly and effective reuse systems is ongoing. A considerable number of tools and mechanisms for supporting reuse activities in software development have been proposed. They provide assistance either to application developers for retrieving [12][13][16][19], understanding [2][10], customizing [2][10][11][13] and composing [22] components in a software library, or to library managers to create [11][14], organize [12][17][24] and reorganize [6] reusable development information in the software library.
Most software retrieval systems usually retrieve a set of reusable candidates ranked by similarity with user requirements. However, users don't want to invest a great effort in selecting a component. So, in most cases, the list of retrieved candidates is not completely analysed even if the highly ranked components are discarded. The user assumes that the best suited components are the ones on the top of the list of candidates (e.g. the first to the third). Thus, to select a component from the list, the user examines only the associated information for those first components. If none of them satisfies his requirements, maybe he will try to refine or rewrite the original query but, in most cases he will abandon the search.Therefore, retrieval systems should exhibit more precision in their answers, by discarding some obviously unwanted components from the set of candidates and by retrieving only the ones that satisfy more precisely the requirements of the user.
Most software retrieval systems retrieve components through a set of keywords provided by a user employing either a controlled or a free vocabulary. These systems are simple and effective for experienced users. The effectiveness breaks down for users not familiar with the proper terminology. Such users may not know a proper keyword, and therefore may use a synonym, a related term, or a more general or specialized term. In such cases, keyword-based systems fail because they don't provide an answer or they retrieve a great number of irrelevant software components.
The survey introduced in [1] also shows that most of the interviewed users prefer natural language interfaces to retrieval systems than keyword-based interfaces. It seems more friendly for a user to say "(I want a component to) traverse a tree in x order or in y order" than to think in proper keywords, corresponding classification schemes and boolean combinations of keywords. This is yet more critical for users not familiar with common terminology or even worse with keywords systems based on a controlled vocabulary.
Queries in natural language allow for improving the precision of retrieval systems. Precision can be improved by allowing queries and indexing software components with multiple words or phrases extracted from the natural language descriptions. In a particular domain, software libraries operate with a great number of common concepts (e.g. file in UNIX system) and retrieval through single words reduces the precision of the system by retrieving a great number of components that are mostly irrelevant to the final requirements of the user. To achieve high precision is more crucial in software domains than in typical domains of information retrieval systems. In software retrieval systems the main purpose is not to retrieve all the material in which a particular pattern exists but rather those that best fit the desired functionality.
Another problem of software reuse is that users often expect the reuse system will suggest them the easiest components to adapt. They also expect enough information to understand the component and guidelines that help them to customize the component. Some reuse systems include browsing facilities that allow examining information associated to the reusable components, like natural language documentation and executable examples [23]. Very often this information is not enough. Users often expect information about the effort required to modify the component as well as guidelines to select and apply the appropriate design and implementation decisions.
Another factor that limits the usefulness of current reuse systems is the lack of reusable information. Reusable development information includes not only code components but also generic software specifications like generic requirements and design specifications. How to design and maintain reusable development information is an open research issue [14].
In this paper we describe work in progress and research directions for a software reuse system. By processing both queries and descriptions of software components in natural language, we expect to achieve good precision in retrieval and a more friendly user interface. Additional support for application developers (for understanding and adapting software components) and for library managers (for creation, organization and reorganization of reusable components) is also discussed.
The paper is organized in the following way. Section 2 discusses current work on software reuse systems supporting retrieval based on natural language specifications. Section 3 describes our software reuse approach and section 4 introduces an environment for its evaluation. Section 5 concludes the paper with some remarks on current and future work.
Free-text indexing systems automatically extract keyword attributes from the natural language specifications provided by the user and those attributes are used to localize software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. Such systems attempt to characterize the specification rather than to understand it. These systems work at the lexical level, so they ignore much of the useful syntactical, semantical and subject-matter specific contextual information available in a natural language specification. No semantic knowledge is used and no interpretation of the document is given. The GURU [19] and RSL (Reusable Software Library) [4] systems follow this approach.
Knowledge-based reuse systems make some kind of syntactic and semantic analysis of the natural language specification without pretending to do a complete understanding of the document. They are based upon a knowledge base which stores semantic information about the application domain and about the natural language itself. These systems can be more discriminating by analysing terms in their context rather than matching individual words. A thesaurus containing synonyms, specializations and generalizations of each term avoids the problem of having to master exact terminology. LaSSIE (Large Software System Information Environment) [8] and NLH/E (Natural Language Help/English) [25] follow this approach.
The tool consists of four complementary systems: a knowledge acquisition system, a software retrieval system, a customization system and a system to assist in the construction of general purpose code components. Following sections summarize the main features of the tool. For a more detailed description see [15].
Our purpose is not to understand the natural language description of a component but to use the lexical, syntactic and semantic information in the description to construct indexing units that can ensure good precision in the retrieval process.
The formalism we use to translate a software description into a frame-like internal representation is based on the linguistic case theory of Fillmore [9]. In [15] we describe this formalism and we present some concrete examples.
Figure 1 shows an overview of the retrieval system [15]. A lexical, syntactic and semantic analysis is performed on the query to translate it into a frame-like internal representation. This is essentially the same process used by the knowledge acquisition system to classify the software components. The translation process uses lexical and semantic information in a thesaurus [20] and a semantic grammar, with constraints and heuristics, we especially define to support the formalism that we use for the internal representation of the descriptions.
The similarity measure is a function of the conceptual distance between the terms of the internal representations of the query and natural language descriptions of software components in the knowledge base and the number of semantic cases in the frames that are matched. The conceptual distance is computed by considering synonyms, generalizations and specializations of terms in a thesaurus.
The reuse attributes of the retrieved candidates (e.g. frequency of reusing and commonality) are measured to estimate the effort of reuse them. The purpose of the reuse effort computation is to avoid the selection of components that are difficult to adapt already before the customization process.
Generic code components will be created from scratch or by an abstraction process from specific ones. Learning by analogy from past development experience and reverse engineering are proposed techniques to deal with this abstraction process. In the techniques for learning by analogy, the idea is to transform solutions of past problems into potential plans to solve new generated problems [21]. Reverse Engineering techniques may provide heuristics to abstract generic specifications from existing software at a particular specification level, and from the implementation level to the design level to the requirements specification level [3][7]. Some measures of the level of reusability of the specifications extracted from existing applications will be done before their inclusion in the software library. Common functionality, ease of modification, correctness, readability and other attributes of reusability may be measured by applying a set of metrics and models [5].
The lesson designers draw the script of a lesson with a special purpose graphical editor. The lesson specifications are the input to an automatic programming system which produces the source code associated with the lessons in one of several commonly used general-purpose programming languages.The specification language combines a graphical formalism with natural language components and is specially attractive to courseware designers who are not programming experts.
Given the semiformal aspect of the specification language, the automatic program generator that we have developed is not able to translate the whole specification into executable code. For this reason, we have included in our development environment a synchronous multi- window editor that simultaneously shows the specification while the automatically generated code is completed by hand. The natural language elements of the specification formalism are mainly directives to the coder and answer analysis criteria. Directives to the coder are generally a short piece of text describing an action that the computer should do. By its nature, the text of a directive is usually formed of very few short imperative sentences. The automatic program generator generates, for each of these directives, an external procedure and for each answer analysis criterion, a boolean function. The generator actually produces the declaration for these subprograms, but their bodies have to be written by a programmer.
With the approach we describe earlier, the research aims at finding solutions to increase the translation power of the IDEAL automatic programming generator by retrieving formerly written courseware code components that could be used in the generation of the source code that will satisfy the natural language specifications of the lesson designers.
Current work is concentrating in the retrieval and knowledge acquisition systems. A first prototype is being developed in BIMprolog on a SUN platform. We have chosen this language not only for its facilities for rapid prototyping but also for its facilities for parsing and for the construction of the user interface.
Our purpose is to make a comparison of the effectiveness of the retrieval system by using alternative search methods and similarity measures. We want to experiment with retrieval by syntactic and semantic analysis of phrases and compare the behaviour of the system with the one exhibited by simple keyword-based retrieval systems. Thus, we expect to evaluate the potential improvements in the retrieval effectiveness introduced by the semantic analysis techniques.
Future work includes the integration of the reuse system in the IDEAL environment and the population of the knowledge base with knowledge and software components from the courseware domain. This will allow us to evaluate the effectiveness and experiment with the retrieval system in a real environment. Existing courseware developed using IDEAL will be used as a source of potential code components to be redesigned and included in the knowledge base. Concrete applications will be developed by using the environment in order to measure the expected reductions in cost and development time through software reuse.
Site Hosting: Bronco