Hi,

I've done several projects where term frequency yields bad result sets and 
worse relevancy. These projects all had one similarity; user-generated content 
with a competitive edge. The latter means classifieds web sites such as e-bay 
etc. The internet is something similar. It contains edited content, classifieds 
and spam or other garbage. 

What do you do with tf in your wide internet index? Do you impose a threshold 
or are you emitting 1.0f for each match?
For now i emit 1.0f for each match and rely on matches in multiple fields with 
varying boosts to improve relevancy and various other methods. 

Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen 
great relevancy with good content but really bad relevance in several cases.

Thanks!

Reply via email to