In that article author uses approach when he extracts text (with links),
splits whole text into chunks (by strings in the simpliest case or by
paragraph) and then compares chunks with a number of links or grabge text.

You can take these figures as input and discard a page if the ratio is not
good.

Best Regards
Alexander Aristov


On 5 July 2011 12:41, Markus Jelsma <[email protected]> wrote:

> Thanks, both of you.
> I'll do some research on the corpus i have. And Sujit's page is always a
> nice
> read!
>
> > Alexander,
> >
> > We can already remove boilerplate from HTML pages thanks to Boilerpipe in
> > Tika (there is an open issue on JIRA for this). Markus is looking for a
> way
> > to classify an entire page as content-rich vs mostly links.
> > Markus : don't know any specific litterature on the subject but
> determining
> > a ratio of tool words (determiners, conjunctions, etc...) vs size of the
> > text or number of links sounds like a good approach. I think that the new
> > scoring API (see wiki) could  also be used / extended to do this kind of
> > task
> >
> > Jul
> >
> > On 5 July 2011 06:52, Alexander Aristov <[email protected]>
> wrote:
> > > I have successfully used some of algorithms which sort out useful text
> > > from html pages.
> > >
> > > this page gave me ideas.
> > >
> http://sujitpal.blogspot.com/2009/11/extracting-useful-text-from-html.htm
> > > l
> > >
> > > Best Regards
> > > Alexander Aristov
> > >
> > > On 4 July 2011 22:55, Markus Jelsma <[email protected]>
> wrote:
> > > > Hi,
> > > >
> > > > Because most of the internet is garbage, i'd like not to index
> garbage.
> > > > There
> > > > is a huge number of pages that consist just links and almost no text.
> > > >
> > > > To filter these pages out i intend to build an indexing filter. The
> > >
> > > problem
> > >
> > > > is
> > > > how to detect whether a page is considered a link page. From what
> i've
> > >
> > > seen
> > >
> > > > there should be a distinct ratio between amount of text and number of
> > > > outlinks
> > > > to the same and other domains.
> > > >
> > > > My question, has anyone come across literature on this topic? Or does
> > > > someone
> > > > already has such an ratio defined?
> > > >
> > > > Thanks!
>

Reply via email to