Hi Mark, 

You can use the boilerpipe feature that comes with Tika that will try to detect 
the main content (text) of the page and ignore all the noise around it, 
although this is supported by current versions of Tika Nutch doesn't expose a 
configuration option to enable, you could apply/use the patch in [1], this 
patch needs update to work the Nutch 1.9 source code but it shouldn't be that 
hard. One more option is using [2] you'll also need to apply a patch and then 
configure a property "parser.html.NodesToExclude" in your nutch-site.xml file, 
and then you can set a list of nodes separated by | that will not be indexed; 
in the JIRA you can check the format of this configuration.

Regards,

[1] https://issues.apache.org/jira/browse/NUTCH-961
[2] https://issues.apache.org/jira/browse/NUTCH-585


----- Original Message ----- 
From: "Mark Wilson" <[email protected]> 
To: [email protected] 
Sent: Wednesday, June 3, 2015 11:43:46 AM 
Subject: [MASSMAIL]Crawling pages but ignoring header and footer 

Does anyone know of a way to crawl a website, but ignore headers and footers, 
or include just the main content of a site by say only including content in a 
<div class="main">, for example. 

I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 
1.9 but I can't get it to work. 

Any ideas greatly appreciated. 

Thanks 

Regards 

Mark Wilson 
[email protected] 




Reply via email to