On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander < [email protected]> wrote:
> Hi, > > The patch seems to fulfil my needs, but how do I use it with Nutch 1.7 >From your local trunk checkout, run these commands in shell: *wget https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch <https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch>* *patch -p0 < blacklist_whitelist_plugin.patch* *ant clean runtime* Now you have successfully applied that patch to your local copy of nutch codebase. That patch is old and I am not sure if it would compile correctly so you have to look in the codebase and tweak it. Thanks, Tejas ? Is the patch not release yet? > > Markus Källander > > Mobile +46 73 622 0547 > > > > > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: den 11 februari 2014 17:44 > To: [email protected] > Subject: Re: HTML tag filtering > > Hi Markus, > > in short, you have to write a parse filter plugin which does in the > filter(...) method: > 1. traverse the DOM tree and constructs a "clean" text by skipping certain > content. See o.a.n.utils.NodeWalker > o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html > plugin) 2. then replace the old plain text in ParseResult by new "clean" > text > > Maybe this issue can help (there is also a patch but I'm not sure whether > it's working and fulfills your needs): > https://issues.apache.org/jira/browse/NUTCH-585 > > Sebastian > > On 02/11/2014 04:24 PM, Markus Källander wrote: > > Hi, > > > > How do I skip indexing of HTML tags with certain id:s or css classes? I > am using Nutch 1.7. > > > > Thanks > > Markus > > > >

