본문 바로가기

Lucene

[Lucene] similiarity simple java code

 static double cosine_similarity(Map<String, Double> v1, Map<String, Double> v2) {
           
Set<String> both = Sets.newHashSet(v1.keySet());
            both
.retainAll(v2.keySet());
           
double sclar = 0, norm1 = 0, norm2 = 0;
           
for (String k : both) sclar += v1.get(k) * v2.get(k);
           
for (String k : v1.keySet()) norm1 += v1.get(k) * v1.get(k);
           
for (String k : v2.keySet()) norm2 += v2.get(k) * v2.get(k);
           
return sclar / Math.sqrt(norm1 * norm2);
   
}

http://stackoverflow.com/questions/1997750/cosine-similarity/1997775#1997775 
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

http://comscience.tistory.com/6
 
--------------------------------------------------------------------------------------

Have a look at: http://en.wikipedia.org/wiki/Cosine_similarity.

If you have vectors A and B.

The similarity is defined as:

cosine(theta) = A . B / ||A|| ||B||

For a vector A = (a1, a2), ||A|| is defined as sqrt(a1^2 + a2^2)

For vector A = (a1, a2) and B = (b1, b2), A . B is defined as a1 b1 + a2 b2;

So for vector A = (a1, a2) and B = (b1, b2), the cosine similarity is given as:

(a1 b1 + a2 b2) / sqrt(a1^2 + a2^2) sqrt(b1^2 + b2^2)

Example:

A = (1, 0.5), B = (0.5, 1)

cosine(theta) = (0.5 + 0.5) / sqrt(5/4) sqrt(5/4) = 4/5