Re: similarity of two texts - another question

2004-06-02 Thread Gerard Sychay
Hmm, the term vector does not have to consist of only term frequencies,
does it? To give weight to rare terms, could you create a term vector of
(TF*IDF) values for each term?  Then, a distance function would measure
how many terms two vectors have in common, giving weight to how many
rare terms two vectors have in common.

 David Spencer [EMAIL PROTECTED] 06/01/04 08:25PM 
Erik Hatcher wrote:

 On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:

 Well, a question again, how does Lucene compute the score between a 

 document and a query?


And I might add, thus, this approach to similarity gives more weight to

rare terms that match, which one might want for this kind of similarity

measure.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: similarity of two texts - another question

2004-06-02 Thread David Spencer
Gerard Sychay wrote:
Hmm, the term vector does not have to consist of only term frequencies,
does it? To give weight to rare terms, could you create a term vector of
(TF*IDF) values for each term?  Then, a distance function would measure
how many terms two vectors have in common, giving weight to how many
rare terms two vectors have in common.
Yeah, but if you're gonna do that why not just form a query with all 
words in the source document, and let the Lucene engine do the idf/tf 
calculations? I've done this and it seems to work fine.

Here's code I've used. It could be done better by avoiding QueryParser, 
and odds are it could hit that exception for too many clauses in a 
boolean expression unless you configure lucene from its default, but 
this is the idea. srch is the entire body of the source document.

public static Query formSimilarQuery( String srch, Analyzer a)
throws org.apache.lucene.queryParser.ParseException, IOException
{
StringBuffer sb = new StringBuffer();
TokenStream ts = a.tokenStream( foo, new StringReader( srch));
org.apache.lucene.analysis.Token t; 
while ( (t = ts.next()) != null)
{
sb.append( t.termText() +  );
}
return QueryParser.parse( sb.toString(),DFields.CONTENTS, a);
}


David Spencer [EMAIL PROTECTED] 06/01/04 08:25PM 
Erik Hatcher wrote:

On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:

Well, a question again, how does Lucene compute the score between a 

document and a query?

And I might add, thus, this approach to similarity gives more weight to
rare terms that match, which one might want for this kind of similarity
measure.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: similarity of two texts - another question

2004-06-01 Thread David Spencer
Erik Hatcher wrote:
On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
Well, a question again, how does Lucene compute the score between a  
document and a query?

And I might add, thus, this approach to similarity gives more weight to 
rare terms that match, which one might want for this kind of similarity 
measure.

Using the equation here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
Similarity.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]