Hi, Trying to run the patch command and get this error:
$ patch -p0 < blacklist_whitelist_plugin.patch (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistIndexer.java (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistParser.java (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/README.txt (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/build.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/ivy.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/index-blacklist-whitelist/plugin.xml (Stripping trailing CRs from patch; use --binary to disable.) patching file src/plugin/build.xml Hunk #1 FAILED at 62 (different line endings). 1 out of 1 hunk FAILED -- saving rejects to file src/plugin/build.xml.rej Any hints? I try to patch it in the source for the tagged 1.7 release. Markus Källander Mobile +46 73 622 0547 -----Original Message----- From: Tejas Patil [mailto:[email protected]] Sent: den 13 februari 2014 01:40 To: [email protected] Subject: Re: HTML tag filtering On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander < [email protected]> wrote: > Hi, > > The patch seems to fulfil my needs, but how do I use it with Nutch 1.7 >From your local trunk checkout, run these commands in shell: *wget https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch <https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch>* *patch -p0 < blacklist_whitelist_plugin.patch* *ant clean runtime* Now you have successfully applied that patch to your local copy of nutch codebase. That patch is old and I am not sure if it would compile correctly so you have to look in the codebase and tweak it. Thanks, Tejas ? Is the patch not release yet? > > Markus Källander > > Mobile +46 73 622 0547 > > > > > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: den 11 februari 2014 17:44 > To: [email protected] > Subject: Re: HTML tag filtering > > Hi Markus, > > in short, you have to write a parse filter plugin which does in the > filter(...) method: > 1. traverse the DOM tree and constructs a "clean" text by skipping > certain content. See o.a.n.utils.NodeWalker > o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of > parse-html > plugin) 2. then replace the old plain text in ParseResult by new "clean" > text > > Maybe this issue can help (there is also a patch but I'm not sure > whether it's working and fulfills your needs): > https://issues.apache.org/jira/browse/NUTCH-585 > > Sebastian > > On 02/11/2014 04:24 PM, Markus Källander wrote: > > Hi, > > > > How do I skip indexing of HTML tags with certain id:s or css > > classes? I > am using Nutch 1.7. > > > > Thanks > > Markus > > > >

