In that article author uses approach when he extracts text (with links), splits whole text into chunks (by strings in the simpliest case or by paragraph) and then compares chunks with a number of links or grabge text.
You can take these figures as input and discard a page if the ratio is not good. Best Regards Alexander Aristov On 5 July 2011 12:41, Markus Jelsma <[email protected]> wrote: > Thanks, both of you. > I'll do some research on the corpus i have. And Sujit's page is always a > nice > read! > > > Alexander, > > > > We can already remove boilerplate from HTML pages thanks to Boilerpipe in > > Tika (there is an open issue on JIRA for this). Markus is looking for a > way > > to classify an entire page as content-rich vs mostly links. > > Markus : don't know any specific litterature on the subject but > determining > > a ratio of tool words (determiners, conjunctions, etc...) vs size of the > > text or number of links sounds like a good approach. I think that the new > > scoring API (see wiki) could also be used / extended to do this kind of > > task > > > > Jul > > > > On 5 July 2011 06:52, Alexander Aristov <[email protected]> > wrote: > > > I have successfully used some of algorithms which sort out useful text > > > from html pages. > > > > > > this page gave me ideas. > > > > http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm > > > l > > > > > > Best Regards > > > Alexander Aristov > > > > > > On 4 July 2011 22:55, Markus Jelsma <[email protected]> > wrote: > > > > Hi, > > > > > > > > Because most of the internet is garbage, i'd like not to index > garbage. > > > > There > > > > is a huge number of pages that consist just links and almost no text. > > > > > > > > To filter these pages out i intend to build an indexing filter. The > > > > > > problem > > > > > > > is > > > > how to detect whether a page is considered a link page. From what > i've > > > > > > seen > > > > > > > there should be a distinct ratio between amount of text and number of > > > > outlinks > > > > to the same and other domains. > > > > > > > > My question, has anyone come across literature on this topic? Or does > > > > someone > > > > already has such an ratio defined? > > > > > > > > Thanks! >

