RE: HTML tag filtering

Markus Källander Thu, 13 Feb 2014 00:03:33 -0800

Hi,

Trying to run the patch command and get this error:

$ patch -p0 < blacklist_whitelist_plugin.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file 
src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistIndexer.java
(Stripping trailing CRs from patch; use --binary to disable.)
patching file 
src/plugin/index-blacklist-whitelist/src/java/at/scintillation/nutch/BlacklistWhitelistParser.java
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/plugin/index-blacklist-whitelist/README.txt
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/plugin/index-blacklist-whitelist/build.xml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/plugin/index-blacklist-whitelist/ivy.xml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/plugin/index-blacklist-whitelist/plugin.xml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/plugin/build.xml
Hunk #1 FAILED at 62 (different line endings).
1 out of 1 hunk FAILED -- saving rejects to file src/plugin/build.xml.rej

Any hints? I try to patch it in the source for the tagged 1.7 release.

Markus Källander

Mobile +46 73 622 0547

-----Original Message-----
From: Tejas Patil [mailto:[email protected]] 
Sent: den 13 februari 2014 01:40
To: [email protected]
Subject: Re: HTML tag filtering

On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander < 
[email protected]> wrote:

> Hi,
>
> The patch seems to fulfil my needs, but how do I use it with Nutch 1.7

>From your local trunk checkout, run these commands in shell:
*wget
https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch
<https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch>*
*patch -p0 < blacklist_whitelist_plugin.patch* *ant clean runtime*

Now you have successfully applied that patch to your local copy of nutch 
codebase. That patch is old and I am not sure if it would compile correctly so 
you have to look in the codebase and tweak it.

Thanks,
Tejas

? Is the patch not release yet?
>
> Markus Källander
>
> Mobile +46 73 622 0547
>
>
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: den 11 februari 2014 17:44
> To: [email protected]
> Subject: Re: HTML tag filtering
>
> Hi Markus,
>
> in short, you have to write a parse filter plugin which does in the
> filter(...) method:
> 1. traverse the DOM tree and constructs a "clean" text by skipping 
> certain content. See  o.a.n.utils.NodeWalker
>  o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of 
> parse-html
> plugin) 2. then replace the old plain text in ParseResult by new "clean"
> text
>
> Maybe this issue can help (there is also a patch but I'm not sure 
> whether it's working and fulfills your needs):
>  https://issues.apache.org/jira/browse/NUTCH-585
>
> Sebastian
>
> On 02/11/2014 04:24 PM, Markus Källander wrote:
> > Hi,
> >
> > How do I skip indexing of HTML tags with certain id:s or css 
> > classes? I
> am using Nutch 1.7.
> >
> > Thanks
> > Markus
> >
>
>

RE: HTML tag filtering

Reply via email to