Hi Markus, thanks, I will check this.
Regards, Christian -----Ursprüngliche Nachricht----- Von: Markus Jelsma [mailto:[email protected]] Gesendet: Donnerstag, 2. Februar 2017 16:36 An: [email protected] Betreff: RE: Tell Nutch to only crawl parts of document Hello Christian- you are probably talking about text extraction, which is done in the parse step. Nutch' Tika parser has support for boilerpipe text extraction, it is not very accurate in some cases but it's the open source solution that is available. Check nuch-default for its settings. Regards, Markus -----Original message----- > From:Christian Kunz <[email protected]> > Sent: Thursday 2nd February 2017 15:23 > To: [email protected] > Subject: Tell Nutch to only crawl parts of document > > Hi everybody, > > we've got a problem using Nutch: On the website that has to be crawled, there > is a navigation on top of each page. Nutch crawls the navigation of each page > which leads to the situation that for certain queries (that are included in > the navigation) every page is delivered as a result. > > Is there a way to tell Nutch to only crawl parts of a page like only the main > content? > > Thanks in advance and regards, > Christian >

