Re: REPOST from another list: Question related to improving search results

2009-05-02 Thread Vaijanathrao
Hi Aditya, You can you any HTML parser if you are getting/crawling an page from wikipedia and ignore those sections which are repetitive. If you are using Jericho parser here is what you can do. URL u = new URL("any english wikipedia page"); Source src = new Source(u.openConnecti

Re: REPOST from another list: Question related to improving search results

2009-05-02 Thread Michael McCandless
Why not remove that content from every doc during indexing? Or, if that's too harsh, you could massively reduce the score for hits in that section, eg during indexing store payloads on those term occurrences falling within the common section, and then use BoostingTermQuery to down-weight those hit