See https://issues.apache.org/jira/browse/NUTCH-961

On 29 April 2011 15:46, Hannes Carl Meyer <[email protected]> wrote:

> Hi,
>
> have look at boilerpipe http://code.google.com/p/boilerpipe/
>
> BR
>
> Hannes
>
> On Fri, Apr 29, 2011 at 11:26 AM, jotta <[email protected]> wrote:
>
> > Hi!
> >
> > I have to crawl couple of sites (each of them is different). Problem is
> > that
> > most of the crawled content is rubbish (page headers, menus, adverts,
> > footer
> > etc).
> > I want to ask you in what way are you getting right content from this? Is
> > it
> > a good way to parse html in own plugin and get content only from this
> > html's
> > tags which I want? And then index this content to a new field.
> >
> > Regards,
> > Jotta--
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Getting-content-from-crawling-site-s-tp2878602p2878602.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to