Hi Aditya,
You can you any HTML parser if you are getting/crawling an page from wikipedia
and ignore those sections which are repetitive.
If you are using Jericho parser here is what you can do.
URL u = new URL("any english wikipedia page");
Source src = new Source(u.openConnecti
Why not remove that content from every doc during indexing?
Or, if that's too harsh, you could massively reduce the score for hits
in that section, eg during indexing store payloads on those term
occurrences falling within the common section, and then use
BoostingTermQuery to down-weight those hit