thanks;

In my case I don't want to save the content of the page in segments, 
,, to save the disk space from save unneeded data !!

I guess it's simpler while indexing, by implement an index-filter to skip the 
document that include that words !!

Regards;





________________________________
From: Scott Gonyea <[email protected]>
To: [email protected]
Sent: Mon, August 23, 2010 7:04:33 PM
Subject: Re: nutch plugin to filter indexing by content!

Not to my knowledge.  You may want to look for where the
"regex-normalize.xml" is being used and can write a plugin there.  It would
be useful, certainly.  I'm looking to eventually do the same, but at index
time.

Scott

On Mon, Aug 23, 2010 at 8:11 AM, Ahmad Al-Amri <[email protected]> wrote:

>
> hello;
>
> I want to check if the web-page contains certain words; and DON'T index it
> -
> while crawling -, and to prevent the url to added to my carwldb ...
>
> I just want to ask if there is a plug-in to do such a thing or similar to
> it; to
> start from it.
>
> thank you;
>
>
>



      

Reply via email to