Hi, I've done several projects where term frequency yields bad result sets and worse relevancy. These projects all had one similarity; user-generated content with a competitive edge. The latter means classifieds web sites such as e-bay etc. The internet is something similar. It contains edited content, classifieds and spam or other garbage.
What do you do with tf in your wide internet index? Do you impose a threshold or are you emitting 1.0f for each match? For now i emit 1.0f for each match and rely on matches in multiple fields with varying boosts to improve relevancy and various other methods. Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen great relevancy with good content but really bad relevance in several cases. Thanks!

