Alexander,

We can already remove boilerplate from HTML pages thanks to Boilerpipe in
Tika (there is an open issue on JIRA for this). Markus is looking for a way
to classify an entire page as content-rich vs mostly links.
Markus : don't know any specific litterature on the subject but determining
a ratio of tool words (determiners, conjunctions, etc...) vs size of the
text or number of links sounds like a good approach. I think that the new
scoring API (see wiki) could  also be used / extended to do this kind of
task

Jul


On 5 July 2011 06:52, Alexander Aristov <[email protected]> wrote:

> I have successfully used some of algorithms which sort out useful text from
> html pages.
>
> this page gave me ideas.
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.html
>
> Best Regards
> Alexander Aristov
>
>
> On 4 July 2011 22:55, Markus Jelsma <[email protected]> wrote:
>
> > Hi,
> >
> > Because most of the internet is garbage, i'd like not to index garbage.
> > There
> > is a huge number of pages that consist just links and almost no text.
> >
> > To filter these pages out i intend to build an indexing filter. The
> problem
> > is
> > how to detect whether a page is considered a link page. From what i've
> seen
> > there should be a distinct ratio between amount of text and number of
> > outlinks
> > to the same and other domains.
> >
> > My question, has anyone come across literature on this topic? Or does
> > someone
> > already has such an ratio defined?
> >
> > Thanks!
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to