A Similarity Measure for Retrieving Software Artifacts

5 Similarity analysis

A similar analysis mechanism to the one applied to software descriptions is used to map the user's query (in the form of a verbal or nominal phrase) to a frame-like internal representation (Figure 1). The internal representation is then compared with the ones associated to software descriptions in the Knowledge base. A similarity analysis is performed and matching components are scored according to their closeness with the user query.

Two mechanisms of retrieval are proposed: a major retrieval mechanism, based on the similarities of the semantic structures, and a complementary mechanism, based on the matching of the noun phrases in the semantic structures.

The main retrieval mechanism provides good level of precision by reducing the number of irrelevant components that are retrieved. The retrieval is based on the detection of similar semantic structures, i.e. structures that share the same semantic cases and where there is some lexical relationships between the terms in the shared semantic cases.

Precision is controlled by basing the similarity analysis on the syntactic and semantic information available on the internal representation of both query and software descriptions; by scoring retrieved candidates and establishing a threshold for the required similarity. The threshold controls the number of retrieved documents and thus, the level of precision. Recall is increased by allowing partial matching of the semantic structures and considering synonyms, hyponyms and hypernyms of the terms in the semantic cases.

The complementary retrieval mechanism is based on the simple matching of noun phrases identified in both query and software descriptions. The approach is independent of the semantic formalism we used in the main retrieval mechanism but makes use both of the instance frames associated to noun phrases and the closeness measures defined for the main retrieval mechanism. This mechanism has been defined to be used in comparative experimental work with the main retrieval mechanism, to evaluate the expected improvements in precision results, obtained through the processing of semantic information, in relation to the processing of mere syntactic information, through the matching of noun phrases.

5.1 Similarity computation

Definition 1 Similarity between software descriptions
Similarity analysis is performed by comparing the internal representation of the query with the ones of software components descriptions in the Knowledge Base and computing a similarity measure (Figure 1).

The function S_Case (F_q, F_d), below measures the similarity between the frame F_q (the internal representation of the user query Q) and the frame F_d (the internal representation of a description D of a software component).

The measure is based both on the number of the semantic cases that have a match and on the closeness among the terms in the matched cases. Semantic cases are weighed by a factor (wj) that represents the relative importance of a semantic case in describing the functionality of the component. Closeness between semantic cases is a function of the conceptual distance between the terms in the semantic cases, according to its `domain' or `category' facet (single terms, noun phrases or verb phrases).

S_Case (F_q, F_d) = w_j . sc_closeness(c_qj, c_dj)

with,

w_j = 1
where,

sc_closeness(c_qj, c_dj) =

SC_q = {j| j is a semantic case in the frame of Q}
c_qj = the term of the semantic case `j' in the frame of Q
c_dj = the term of the semantic case `j' in the frame of D
w_j = the weight of the semantic case `j'

5.1.1 Closeness between single terms

Definition 2 Conceptual distance between single terms
The function dist(x,y) measures the conceptual distance between the single terms or collocations `x' and `y', by considering the distance of the terms in a lexicon, according to the lexical relations of synonymy, hyponymy and hypernymy described in section 4 (see example in Table 1).

Definition 3 Closeness between single terms
The function closeness(x,y) measures the closeness between the single terms or collocations `x' and `y'. The function is inversely proportional to the conceptual distance between the terms in Definition 2 (see example in Table 1).

Closeness(x,y) = v^dist(x,y) with 0<v<1

Table 1 - The distance and closeness value between some single terms or collocations and the collocation 'personal computer'
y dist closeness remarks
v = 0.5
'PC' 0 1 synonym('personal computer', 'PC')
'desktop computer' 1 0.5 hyponym('personal computer', 'desktop computer', 1)
'laptop' 2 0.25 hyponym('personal computer', 'laptop', 2)
'digital computer' 1 0.5 hypernym('personal computer', 'digital computer', 1)
'computer' 2 0.25 hypernym('personal computer', 'computer', 2)
Table 1 - The distance and closeness value between some single terms or collocations and the collocation 'personal computer'

**Table 1** - The distance and closeness value between some single terms or collocations and the collocation 'personal computer'
y	dist	closeness	remarks
v = 0.5
'PC'	0	1	synonym('personal computer', 'PC')
'desktop computer'	1	0.5	hyponym('personal computer', 'desktop computer', 1)
'laptop'	2	0.25	hyponym('personal computer', 'laptop', 2)
'digital computer'	1	0.5	hypernym('personal computer', 'digital computer', 1)
'computer'	2	0.25	hypernym('personal computer', 'computer', 2)

5.1.2 Closeness between noun phrases

An instance of a noun phrase frame `x' consists of a head `h_x'(the main noun in the phrase) with a list `M_x'of zero, one or more modifiers (adjectives, nouns, participles).

Two assumptions are considered to compute the conceptual distance between noun phrases:

a list of modifiers specializes a head noun and the length of the list gives the distance between the head and the noun phrase (Figure 5);
the distance between noun phrases is the distance between their head nouns increased by the distance between the two lists of modifiers (Figure 6).

Definition 4 Conceptual distance between noun phrases
The function np_dist (x,y) (Figure 6) computes the conceptual distance between two noun phrase frames `x' and `y', as the sum of the conceptual distance between the head nouns and the conceptual distance between the lists of modifers.

: np_dist (x,y) = dist (h_x, h_y) + ml_dist (M_x, M_y)

where,

x, y - noun phrase instance frames
h_x, h_y - the head noun in frames x and y, respectively
M_x, M_y - the list of modifiers of the frames x and y, respectively.

Definition 5 Conceptual distance between modifiers
The matrix M_DIST(M_x,M_y), m x m, where m = max [length(M_x), length(M_y)], defines the conceptual distance between pairs of modifiers (m_xi, m_yj) from lists M_x, and M_y.

The conceptual distance between modifiers is computed by taking the minimum value of either the single term distance value between the modifiers, according to the measure in Definition 2, or a distance value of `2', according to the assumption that a modifier specializes a head noun. As illustrated in Figure 8, a maximum distance of `2' between the lists m_x1 h_x and m_y1 h_x is obtained by considering that m_y1 h_x is an specialization of h_x and that h_x is a generalization of m_x1 h_x. For instance,

: m_dist(computer,computer) = min (0,2) = 0; m_dist(PC,computer) = min (1,2) = 1; m_dist(person,computer) = min (,2) = 2

In the case of lists of modifiers that have different length, the matrix is completed by 1-distance values, reflecting that each modifier in the greater list can specialize the other one. For instance, the `1' distance values in the matrix M_DIST(m_x1, m_y1 m_y2) in Figure 8 show that either m_x1 m_y1 h_x or m_x1 m_y2 h_x should be considered as a specialization of m_x1 h_x.

Definition 6 Conceptual distance between lists of modifiers
The function ml_dist (M_x, M_y) computes the conceptual distance between the lists of modifiers M_x = m_x1 m_x2 ... m_{x.length(M_x)} and M_y = m_y1 m_y2 ... m_{y.length(M_y)} (Figure 7). The distance is computed by considering all the sets built with pairs of different modifiers from M_x and M_y, and taking the minimum sum of the distances between the pairs of modifiers for all those sets. Figure 8 illustrates an example of the computation of the conceptual distance between the lists of modifiers m_x1 m_x2 and m_y1 m_y2, as the minimum value of the sums of the elements of the main and secondary diagonal of the associated distance matrix.

Definition 7 Closeness between noun phrases
The function np_closeness (x,y) measures the closeness between two noun phrase frames `x' and `y'. The function is inversely proportional to the conceptual distance measure between noun phrase frames in Definition 4. Closeness is computed by either considering or discarding modifiers (u =1) in the noun phrase, to evaluate the relative effects of modifiers in effectiveness results (see example in Table 2).

np_closeness (x,y) = u^{ml_dist(M_x,M_y) . closeness (h_x,
h_y)} with 0 u 1

Table 2 - The closeness value between some noun pharses and the noun phrase 'computer program'
x np_cl. np_cl. remarks
u = 1, u = 0.5,
v = 0.5 v = 0.8
'computer program' 1 1
'program' 1 0.5
'machine program' 1 0.5 hyponym('machine', 'computer', 1)
'PC software' 0.5 0.4 hypernym('PC', 'computer', 1),
hyponym('software', 'program', 1)
'conference program' 1 0.25
Table 2 - The closeness value between some noun phrases and the noun phrase 'computer program'

**Table 2** - The closeness value between some noun pharses and the noun phrase 'computer program'
x	np_cl.	np_cl.	remarks
u = 1,	u = 0.5,
v = 0.5	v = 0.8
'computer program'	1	1
'program'	1	0.5
'machine program'	1	0.5	hyponym('machine', 'computer', 1)
'PC software'	0.5	0.4	hypernym('PC', 'computer', 1),
hyponym('software', 'program', 1)
'conference program'	1	0.25

5.2 Noun phrase similarity

Definition 8 Similarity between descriptions by considering syntactic phrases
The function S_Phrase (NP_q, NP_d) measures the similarity between the set of noun phrases NP_q in the frame F_q, associated to a user query Q and the set of noun phrases NP_d in the frame F_d, associated to a description D of a software component.

The similarity measure uses the closeness function in Definition 7 to compute the maximum closeness between each noun phrase in the query and the noun phrases in a software description. The average closeness value for all noun phrases in the query is then computed.

where

NP_q = list of noun phrase instance frames of Q
NP_d = list of noun phrase instance frames of D

6 Experimental comparison

This is a section of a local copy of the paper A Similarity Measure for Retrieving Software Artifacts by M. R. Girardi and B. Ibrahim.

Site Hosting: Bronco