Re: [MASSMAIL]Crawling pages but ignoring header and footer

Mark Wilson Fri, 05 Jun 2015 01:38:31 -0700

Thanks for the info Jorge. I'll take a look at these. Cheers Mark


On 4 Jun 2015, at 02:29, Jorge Luis Betancourt González wrote:

> Hi Mark, 
> 
> You can use the boilerpipe feature that comes with Tika that will try to 
> detect the main content (text) of the page and ignore all the noise around 
> it, although this is supported by current versions of Tika Nutch doesn't 
> expose a configuration option to enable, you could apply/use the patch in 
> [1], this patch needs update to work the Nutch 1.9 source code but it 
> shouldn't be that hard. One more option is using [2] you'll also need to 
> apply a patch and then configure a property "parser.html.NodesToExclude" in 
> your nutch-site.xml file, and then you can set a list of nodes separated by | 
> that will not be indexed; in the JIRA you can check the format of this 
> configuration.
> 
> Regards,
> 
> [1] https://issues.apache.org/jira/browse/NUTCH-961
> [2] https://issues.apache.org/jira/browse/NUTCH-585
> 
> 
> ----- Original Message ----- 
> From: "Mark Wilson" <[email protected]> 
> To: [email protected] 
> Sent: Wednesday, June 3, 2015 11:43:46 AM 
> Subject: [MASSMAIL]Crawling pages but ignoring header and footer 
> 
> Does anyone know of a way to crawl a website, but ignore headers and footers, 
> or include just the main content of a site by say only including content in a 
> <div class="main">, for example. 
> 
> I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch 
> 1.9 but I can't get it to work. 
> 
> Any ideas greatly appreciated. 
> 
> Thanks 
> 
> Regards 
> 
> Mark Wilson 
> [email protected] 
> 
> 
> 
> 

Mark Wilson
[email protected]

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [MASSMAIL]Crawling pages but ignoring header and footer

Reply via email to