Thanks for the info Jorge. I'll take a look at these. Cheers Mark
On 4 Jun 2015, at 02:29, Jorge Luis Betancourt González wrote: > Hi Mark, > > You can use the boilerpipe feature that comes with Tika that will try to > detect the main content (text) of the page and ignore all the noise around > it, although this is supported by current versions of Tika Nutch doesn't > expose a configuration option to enable, you could apply/use the patch in > [1], this patch needs update to work the Nutch 1.9 source code but it > shouldn't be that hard. One more option is using [2] you'll also need to > apply a patch and then configure a property "parser.html.NodesToExclude" in > your nutch-site.xml file, and then you can set a list of nodes separated by | > that will not be indexed; in the JIRA you can check the format of this > configuration. > > Regards, > > [1] https://issues.apache.org/jira/browse/NUTCH-961 > [2] https://issues.apache.org/jira/browse/NUTCH-585 > > > ----- Original Message ----- > From: "Mark Wilson" <[email protected]> > To: [email protected] > Sent: Wednesday, June 3, 2015 11:43:46 AM > Subject: [MASSMAIL]Crawling pages but ignoring header and footer > > Does anyone know of a way to crawl a website, but ignore headers and footers, > or include just the main content of a site by say only including content in a > <div class="main">, for example. > > I have tried using https://github.com/BayanGroup/nutch-custom-search in Nutch > 1.9 but I can't get it to work. > > Any ideas greatly appreciated. > > Thanks > > Regards > > Mark Wilson > [email protected] > > > > Mark Wilson [email protected]
signature.asc
Description: Message signed with OpenPGP using GPGMail

