In Proceedings of Fifth ASIS SIG/CR (American Society for Information Science, Classification Research workshop), Oct. 16, 1994, Alexandria, Virginia, USA.
This paper briefly describes the classification scheme of a software reuse system [1][2][3][4][5] based on the processing of the descriptions in natural language of software components. Major requirements for the system are good retrieval effectiveness, cost-effectiveness and domain independence.
The classification system [5] catalogues the software components in a software base through their descriptions in natural language. A knowledge acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue them in the software base. The system extracts lexical, syntactic and semantic information and this knowledge is used to create a frame-like internal representation for the software component.
Semantic analysis of descriptions follows the rules of a semantic formalism. The semantic formalism is based on semantic relationships between noun phrases and the verb in a sentence. These semantic relationships ensure that similar software descriptions produce similar internal representations.
A classification scheme for software components derives from the semantic formalism, through a set of generic frames. The internal representation of a description constitutes the indexing unit for the software component, constructed as an instance of these generic frames.
Public domain lexicons (Webster and WordNet) are used to get lexical and semantic information needed during the parsing process.
The Knowledge base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated with the component like source code, executable examples, reuse attributes, etc.
The same analysis mechanism applied to software descriptions is used to map a query in free text into a frame-like internal representation. The retrieval system uses the set of frames associated to the query to identify similar ones in the Knowledge Base.
The retrieval system [4] looks for and selects components from the software base, based on the closeness between the frames associated to the query and software descriptions. Closeness measures are derived from the semantic formalism and from the conceptual distance between terms in the query and software descriptions. Software components are scored according to their closeness measure with the user query. The ones with a score higher than a given threshold become the reuse candidates.
As a first step the system deals with imperative sentences for both queries and software component descriptions. Imperative sentences describe simple actions that are performed by a software component and perhaps the object manipulated by the action, the manner by which the action is performed and other semantic information related to the action.
The case system basically consists of a sequence of one or more semantic cases. Semantic cases are associated to some syntactic compounds of an imperative sentence. An imperative sentence consists of a verb (representing an action) possibly followed by a noun phrase (representing the direct object of the action) and perhaps some embedded prepositional phrases. For instance, the sentence `search a file for a string' consists of the verb `search', in the infinitive form, followed by the noun phrase `a file', which represents the object manipulated by the action, and followed by the prepositional phrase `for a string', which represents the goal of the `search' action. In the example, the semantic cases `Action', `Location' and `Goal' are respectively associated to the verb, direct object and prepositional phrase of the sentence.
Semantic cases show how noun phrases are semantically related to the verb in a sentence. For instance, in the sentence `search a file for a string', the semantic case `Goal' associated to the noun phrase `for a string' shows the target of the action `search'. We have defined a basic set of semantic cases for software descriptions by analyzing the short descriptions of Unix commands in manual pages. These semantic cases describe basically the functionality of the component (the action, the target of the action, the medium or location, the mode by which the action is performed, etc.).
A semantic case consists of a case generator (possibly omitted) followed by a nominal or verbal phrase. A case generator reveals the presence of a particular semantic case in a sentence. Case generators are mainly prepositions. For instance, in the sentence `search a file for a string', the preposition `for' in the prepositional phrase `for a string' suggests the `Goal' semantic case.
The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms, their grammatical categories and their semantic relationships with other words in a lexicon. Two semantic relations between terms are currently considered: synonymy and hyponymy/hypernymy.
Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed interactively by using a definite clause grammar. The defined grammar implements a subset of the grammar rules for imperative sentences in English. The grammar supports the case system and states domain-independent knowledge of the English language through a set of syntactic and semantic rules.
A set of semantic structures is generated as a result of the parsing process, representing the internal structures of software descriptions. A language for modelling these semantic structures is shown below.
Case_frame --> FRAME Frame_name Hierarchical_link CASES Case_list.
Hierarchical_link --> IS_A Frame_name | IS_A_KIND_OF Frame_name
Case_list --> Case (Case_list)
Case --> Case_name Facet
Case_name --> Semantic_case | Other_case
Semantic_case --> Action | Agent | Comparison | Condition | Destination |
Duration | Goal | Instrument | Location | Manner | Purpose | Source | Time
Other_case --> Modifier | Head | Adjective_modifier | Participle_modifier |
Noun_modifier
Facet --> VALUE Value | DOMAIN Frame_name | CATEGORY
Lexical_category
Value --> string | Frame_name
Lexical_category --> verb | adj | noun | adv | component_id | string
The language defines a frame-like classification scheme for software components based on the defined semantic cases. The classification scheme consists of a hierarchical structure of generic frames (`IS-A-KIND-OF' relationship). Frames that are instances of these generic frames (`IS-A' relationship) implement the indexing units of software descriptions.
The generic frames model semantic structures associated to verb phrases, noun phrases and the information associated to software components, like name, description, source code, executable examples, etc.
Semantic cases are represented as slots in the frames. `Facets' are associated to each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (`value' facet); the type of the frame that describes its internal structure (`domain' facet) or the lexical category of the case (`category' facet). For instance, the `Location' slot in the verb phrase frame has a `domain' facet indicating that its constituents are described in a frame of type `noun phrase'.
Through the parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators.
Site Hosting: Bronco