Re: HTML tag filtering

Tejas Patil Wed, 12 Feb 2014 16:40:13 -0800

On Wed, Feb 12, 2014 at 6:04 AM, Markus Källander <
[email protected]> wrote:


> Hi,
>
> The patch seems to fulfil my needs, but how do I use it with Nutch 1.7


>From your local trunk checkout, run these commands in shell:
*wget
https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch
<https://issues.apache.org/jira/secure/attachment/12495393/blacklist_whitelist_plugin.patch>*
*patch -p0 < blacklist_whitelist_plugin.patch*
*ant clean runtime*

Now you have successfully applied that patch to your local copy of nutch
codebase. That patch is old and I am not sure if it would compile correctly
so you have to look in the codebase and tweak it.

Thanks,
Tejas

? Is the patch not release yet?
>
> Markus Källander
>
> Mobile +46 73 622 0547
>
>
>
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: den 11 februari 2014 17:44
> To: [email protected]
> Subject: Re: HTML tag filtering
>
> Hi Markus,
>
> in short, you have to write a parse filter plugin which does in the
> filter(...) method:
> 1. traverse the DOM tree and constructs a "clean" text by skipping certain
> content. See  o.a.n.utils.NodeWalker
>  o.a.n.parse.html.DOMContentUtils.getTextHelper(...) (part of parse-html
> plugin) 2. then replace the old plain text in ParseResult by new "clean"
> text
>
> Maybe this issue can help (there is also a patch but I'm not sure whether
> it's working and fulfills your needs):
>  https://issues.apache.org/jira/browse/NUTCH-585
>
> Sebastian
>
> On 02/11/2014 04:24 PM, Markus Källander wrote:
> > Hi,
> >
> > How do I skip indexing of HTML tags with certain id:s or css classes? I
> am using Nutch 1.7.
> >
> > Thanks
> > Markus
> >
>
>

Re: HTML tag filtering

Reply via email to