5 Similarity analysis

A similar analysis mechanism to the one applied to software descriptions is used to map the user's query (in the form of a verbal or nominal phrase) to a frame-like internal representation (Figure 1). The internal representation is then compared with the ones associated to software descriptions in the Knowledge base. A similarity analysis is performed and matching components are scored according to their closeness with the user query.

Two mechanisms of retrieval are proposed: a major retrieval mechanism, based on the similarities of the semantic structures, and a complementary mechanism, based on the matching of the noun phrases in the semantic structures.

The main retrieval mechanism provides good level of precision by reducing the number of irrelevant components that are retrieved. The retrieval is based on the detection of similar semantic structures, i.e. structures that share the same semantic cases and where there is some lexical relationships between the terms in the shared semantic cases.

Precision is controlled by basing the similarity analysis on the syntactic and semantic information available on the internal representation of both query and software descriptions; by scoring retrieved candidates and establishing a threshold for the required similarity. The threshold controls the number of retrieved documents and thus, the level of precision. Recall is increased by allowing partial matching of the semantic structures and considering synonyms, hyponyms and hypernyms of the terms in the semantic cases.

The complementary retrieval mechanism is based on the simple matching of noun phrases identified in both query and software descriptions. The approach is independent of the semantic formalism we used in the main retrieval mechanism but makes use both of the instance frames associated to noun phrases and the closeness measures defined for the main retrieval mechanism. This mechanism has been defined to be used in comparative experimental work with the main retrieval mechanism, to evaluate the expected improvements in precision results, obtained through the processing of semantic information, in relation to the processing of mere syntactic information, through the matching of noun phrases.

5.1 Similarity computation

Definition 1 Similarity between software descriptions
Similarity analysis is performed by comparing the internal representation of the query with the ones of software components descriptions in the Knowledge Base and computing a similarity measure (Figure 1).

The function SCase (Fq, Fd), below measures the similarity between the frame Fq (the internal representation of the user query Q) and the frame Fd (the internal representation of a description D of a software component).

The measure is based both on the number of the semantic cases that have a match and on the closeness among the terms in the matched cases. Semantic cases are weighed by a factor (wj) that represents the relative importance of a semantic case in describing the functionality of the component. Closeness between semantic cases is a function of the conceptual distance between the terms in the semantic cases, according to its `domain' or `category' facet (single terms, noun phrases or verb phrases).

SCase (Fq, Fd) = wj . sc_closeness(cqj, cdj)

with,

wj = 1
where,

sc_closeness(cqj, cdj) =

SCq = {j| j is a semantic case in the frame of Q}
cqj = the term of the semantic case `j' in the frame of Q
cdj = the term of the semantic case `j' in the frame of D
wj = the weight of the semantic case `j'

5.1.1 Closeness between single terms

Definition 2 Conceptual distance between single terms
The function dist(x,y) measures the conceptual distance between the single terms or collocations `x' and `y', by considering the distance of the terms in a lexicon, according to the lexical relations of synonymy, hyponymy and hypernymy described in section 4 (see example in Table 1).

Definition 3 Closeness between single terms
The function closeness(x,y) measures the closeness between the single terms or collocations `x' and `y'. The function is inversely proportional to the conceptual distance between the terms in Definition 2 (see example in Table 1).

Closeness(x,y) = vdist(x,y) with 0<v<1
Table 1 - The distance and closeness value between some single terms or collocations and the collocation 'personal computer'
ydistclosenessremarks
v = 0.5
'PC'01synonym('personal computer', 'PC')
'desktop computer'10.5hyponym('personal computer', 'desktop computer', 1)
'laptop'20.25hyponym('personal computer', 'laptop', 2)
'digital computer'10.5hypernym('personal computer', 'digital computer', 1)
'computer'20.25hypernym('personal computer', 'computer', 2)
Table 1 - The distance and closeness value between some single terms or collocations and the collocation 'personal computer'

5.1.2 Closeness between noun phrases

An instance of a noun phrase frame `x' consists of a head `hx'(the main noun in the phrase) with a list `Mx'of zero, one or more modifiers (adjectives, nouns, participles).

Two assumptions are considered to compute the conceptual distance between noun phrases:

Definition 4 Conceptual distance between noun phrases
The function np_dist (x,y) (Figure 6) computes the conceptual distance between two noun phrase frames `x' and `y', as the sum of the conceptual distance between the head nouns and the conceptual distance between the lists of modifers.

np_dist (x,y) = dist (hx, hy) + ml_dist (Mx, My)

where,

x, y - noun phrase instance frames
hx, hy - the head noun in frames x and y, respectively
Mx, My - the list of modifiers of the frames x and y, respectively.

Definition 5 Conceptual distance between modifiers
The matrix M_DIST(Mx,My), m x m, where m = max [length(Mx), length(My)], defines the conceptual distance between pairs of modifiers (mxi, myj) from lists Mx, and My.

The conceptual distance between modifiers is computed by taking the minimum value of either the single term distance value between the modifiers, according to the measure in Definition 2, or a distance value of `2', according to the assumption that a modifier specializes a head noun. As illustrated in Figure 8, a maximum distance of `2' between the lists mx1 hx and my1 hx is obtained by considering that my1 hx is an specialization of hx and that hx is a generalization of mx1 hx. For instance,

m_dist(computer,computer) = min (0,2) = 0
m_dist(PC,computer) = min (1,2) = 1
m_dist(person,computer) = min (,2) = 2
In the case of lists of modifiers that have different length, the matrix is completed by 1-distance values, reflecting that each modifier in the greater list can specialize the other one. For instance, the `1' distance values in the matrix M_DIST(mx1, my1 my2) in Figure 8 show that either mx1 my1 hx or mx1 my2 hx should be considered as a specialization of mx1 hx.

Definition 6 Conceptual distance between lists of modifiers
The function ml_dist (Mx, My) computes the conceptual distance between the lists of modifiers Mx = mx1 mx2 ... mx.length(Mx) and My = my1 my2 ... my.length(My) (Figure 7). The distance is computed by considering all the sets built with pairs of different modifiers from Mx and My, and taking the minimum sum of the distances between the pairs of modifiers for all those sets. Figure 8 illustrates an example of the computation of the conceptual distance between the lists of modifiers mx1 mx2 and my1 my2, as the minimum value of the sums of the elements of the main and secondary diagonal of the associated distance matrix.

Definition 7 Closeness between noun phrases
The function np_closeness (x,y) measures the closeness between two noun phrase frames `x' and `y'. The function is inversely proportional to the conceptual distance measure between noun phrase frames in Definition 4. Closeness is computed by either considering or discarding modifiers (u =1) in the noun phrase, to evaluate the relative effects of modifiers in effectiveness results (see example in Table 2).

np_closeness (x,y) = uml_dist(Mx,My) . closeness (hx, hy) with 0 < u <= 1

Table 2 - The closeness value between some noun pharses and the noun phrase 'computer program'
xnp_cl.np_cl.remarks
u = 1,u = 0.5,
v = 0.5v = 0.8
'computer program'11
'program'10.5
'machine program'10.5hyponym('machine', 'computer', 1)
'PC software'0.50.4hypernym('PC', 'computer', 1),
hyponym('software', 'program', 1)
'conference program'10.25
Table 2 - The closeness value between some noun phrases and the noun phrase 'computer program'

5.2 Noun phrase similarity

Definition 8 Similarity between descriptions by considering syntactic phrases
The function SPhrase (NPq, NPd) measures the similarity between the set of noun phrases NPq in the frame Fq, associated to a user query Q and the set of noun phrases NPd in the frame Fd, associated to a description D of a software component.

The similarity measure uses the closeness function in Definition 7 to compute the maximum closeness between each noun phrase in the query and the noun phrases in a software description. The average closeness value for all noun phrases in the query is then computed.

where

NPq = list of noun phrase instance frames of Q
NPd = list of noun phrase instance frames of D

6 Experimental comparison


This is a section of a local copy of the paper A Similarity Measure for Retrieving Software Artifacts by M. R. Girardi and B. Ibrahim.

Site Hosting: Bronco