Hi Mark, Maybe you can use boilerplate algorithm. Tika has a support. Look at https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-961
Talat On May 14, 2015 5:32 PM, "Mark Wilson" <[email protected]> wrote: > Hi everyone. > > I wonder if anyone can help me. > > I am crawling our site with nutch 1.9, and would like to be able to parse > the pages but not the headers, navbar and footer. > > The reason for this is because when you post it to Solr, the content field > starts with the same text for all pages, and if you query for text that is > in the navbar for instance, it includes all your pages. > > It there any way of configuring Nutch to do this? > > Kind Regards > > Mark > > >

