본문 바로가기

Lucene

lucene tfldf weights


i used lucene-2.4.0 to get tf-idf. i'm not sure if the newer versions have
direct methods to get tf-idfs as well.
this is lengthy but might help.

  // Get Term Enum that contains all the terms in the index using
FilterIndexReader
   TermEnum e = freader.terms();

 // find total number of docs
 int noOfDocs = freader.numDocs();

// Get TermDocs Enum
 TermDocs td = freader.termDocs();

   // seeking through all the terms
   while(e.next()){

     // get the term
     Term t = e.term();
     // only search the contents field
     Term term = new Term("contents",t.text());
     // Move to the <document, frequency>  pairs for term from the TermDocs
Enum containing the term t
     td.seek(term);


     // loop through each document containing the term.
     while(td.next())  {
       double weight = td.freq()*Math.log(noOfDocs/e.docFreq());
     // do something with the weight
     //......

    }
}