Re: parsing pages but removing headers and footers

Talat Uyarer Thu, 14 May 2015 07:59:15 -0700

Hi Mark,

Maybe you can use boilerplate algorithm.  Tika has a support. Look at
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-961


Talat
On May 14, 2015 5:32 PM, "Mark Wilson" <[email protected]> wrote:

> Hi everyone.
>
> I wonder if anyone can help me.
>
> I am crawling our site with nutch 1.9, and would like to be able to parse
> the pages but not the headers, navbar and footer.
>
> The reason for this is because when you post it to Solr, the content field
> starts with the same text for all pages, and if you query for text that is
> in the navbar for instance, it includes all your pages.
>
> It there any way of configuring Nutch to do this?
>
> Kind Regards
>
> Mark
>
>
>

Re: parsing pages but removing headers and footers

Reply via email to