Alexander, We can already remove boilerplate from HTML pages thanks to Boilerpipe in Tika (there is an open issue on JIRA for this). Markus is looking for a way to classify an entire page as content-rich vs mostly links. Markus : don't know any specific litterature on the subject but determining a ratio of tool words (determiners, conjunctions, etc...) vs size of the text or number of links sounds like a good approach. I think that the new scoring API (see wiki) could also be used / extended to do this kind of task
Jul On 5 July 2011 06:52, Alexander Aristov <[email protected]> wrote: > I have successfully used some of algorithms which sort out useful text from > html pages. > > this page gave me ideas. > http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html > > Best Regards > Alexander Aristov > > > On 4 July 2011 22:55, Markus Jelsma <[email protected]> wrote: > > > Hi, > > > > Because most of the internet is garbage, i'd like not to index garbage. > > There > > is a huge number of pages that consist just links and almost no text. > > > > To filter these pages out i intend to build an indexing filter. The > problem > > is > > how to detect whether a page is considered a link page. From what i've > seen > > there should be a distinct ratio between amount of text and number of > > outlinks > > to the same and other domains. > > > > My question, has anyone come across literature on this topic? Or does > > someone > > already has such an ratio defined? > > > > Thanks! > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

