Gerard Sychay wrote:
Hmm, the term vector does not have to consist of only term frequencies,
does it? To give weight to rare terms, could you create a term vector of
(TF*IDF) values for each term? Then, a distance function would measure
how many terms two vectors have in common, giving weight to how many
rare terms two vectors have in common.
Yeah, but if you're gonna do that why not just form a query with all
words in the source document, and let the Lucene engine do the idf/tf
calculations? I've done this and it seems to work fine.
Here's code I've used. It could be done better by avoiding QueryParser,
and odds are it could hit that exception for too many clauses in a
boolean expression unless you configure lucene from its default, but
this is the idea. srch is the entire body of the source document.
public static Query formSimilarQuery( String srch, Analyzer a)
throws org.apache.lucene.queryParser.ParseException, IOException
{
StringBuffer sb = new StringBuffer();
TokenStream ts = a.tokenStream( foo, new StringReader( srch));
org.apache.lucene.analysis.Token t;
while ( (t = ts.next()) != null)
{
sb.append( t.termText() + );
}
return QueryParser.parse( sb.toString(),DFields.CONTENTS, a);
}
David Spencer [EMAIL PROTECTED] 06/01/04 08:25PM
Erik Hatcher wrote:
On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
Well, a question again, how does Lucene compute the score between a
document and a query?
And I might add, thus, this approach to similarity gives more weight to
rare terms that match, which one might want for this kind of similarity
measure.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]