On Wednesday 27 July 2011 18:31:57 lewis john mcgibbney wrote: > Hi Markus, > > I am getting you until the last parts of your comments. > > "cope with non-edited..." edited by whom? and for what purpose? To give a > better relative tf score...
Wtih edited content i mean content written by editors and other persons creating proper content. > > To comment on the first part, and please ignore or correct me if I am > wrong, but do we not give each page and therefore each document an initial > score of 1.0 which is then subsequently used by whichever scoring > algorithm we plugin? If this is the case then how are we specifying score > for a page and tf of some term with a document or tf-idf of that term over > the entire document collection to determine relevance? How can be > accurately > disambiguate between these entities? Link score is only a small part of the math. It's multiplied with tf, idf, norms, boosts, functions etc. > > As I said I'm loosing you towards the end however it would be good > discussion to explore behind the surface architecture. > > > On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma > > <[email protected]>wrote: > > Hi, > > > > I've done several projects where term frequency yields bad result sets > > and worse relevancy. These projects all had one similarity; > > user-generated content > > with a competitive edge. The latter means classifieds web sites such as > > e-bay > > etc. The internet is something similar. It contains edited content, > > classifieds > > and spam or other garbage. > > > > What do you do with tf in your wide internet index? Do you impose a > > threshold > > or are you emitting 1.0f for each match? > > For now i emit 1.0f for each match and rely on matches in multiple fields > > with > > varying boosts to improve relevancy and various other methods. > > > > Can tf*idf cope with non-edited (and untrusted) documents at all? I've > > seen great relevancy with good content but really bad relevance in > > several cases. > > > > Thanks! -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

