AW: Tell Nutch to only crawl parts of document

Christian Kunz Thu, 02 Feb 2017 08:05:01 -0800

Hi Markus,

thanks, I will check this.


Regards,
Christian


-----Ursprüngliche Nachricht-----
Von: Markus Jelsma [mailto:[email protected]] 
Gesendet: Donnerstag, 2. Februar 2017 16:36
An: [email protected]
Betreff: RE: Tell Nutch to only crawl parts of document

Hello Christian- you are probably talking about text extraction, which is done 
in the parse step. Nutch' Tika parser has support for boilerpipe text 
extraction, it is not very accurate in some cases but it's the open source 
solution that is available. Check nuch-default for its settings.

Regards,
Markus

 
 
-----Original message-----
> From:Christian Kunz <[email protected]>
> Sent: Thursday 2nd February 2017 15:23
> To: [email protected]
> Subject: Tell Nutch to only crawl parts of document
> 
> Hi everybody,
> 
> we've got a problem using Nutch: On the website that has to be crawled, there 
> is a navigation on top of each page. Nutch crawls the navigation of each page 
> which leads to the situation that for certain queries (that are included in 
> the navigation) every page is delivered as a result.
> 
> Is there a way to tell Nutch to only crawl parts of a page like only the main 
> content?
> 
> Thanks in advance and regards,
> Christian 
>

AW: Tell Nutch to only crawl parts of document

Reply via email to