>From what I remember of earlier advise, you really want to use the
Automaton filter if at all possible, rather than series of straight
regex.  Using the Automaton should be linear with respect to the
number of characters in the URL.  Building the actual automaton could
be fairly time consuming, but as you'll re-using it often, likely
worth the cost.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html

A series of Java Regex's should also be linear with the number of
characters in the URL assuming you avoid specific constructs (the
things that cause back tracking, where it effectively tries to ensure
that one group/subgroup is equal to a later group/subgroup is the
primary culprit).  Each Regex will add to the constant multiple in
front of the number of characters.

I've used the Automaton library, and if you can work within the
limitations (it is a classic regex matcher with limited operators
relative to say Perl 5 Compatible Regular Expressions).

I don't have any practical experience with Nutch for a large scale
crawl, but based upon my experience with using regular expressions and
the Automaton Library, I know it is much faster.  I recall Andrej
talking about it being much faster.  It might also be worth while for
Nutch to look into Lucene's optimized versions of Automaton (they
ported over several critical operations for use in Lucene and the
Fuzzy matching when computing the Levenshtien distance).

I can't seem to find the thread where I saw that advice given, but you
can see the thread where they discuss adding the Automaton URL filter
back in Nutch 0.8 and it seems to agree with my experience in using
both.

http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html

Kirby



On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote:
> What will be the impact of a growing big regex-urlfilter ?
>
> I ask this because there are more & more  sites that I want to filter out,
> it will limit the # of unecessary pages at a cost of lots of url
> verification.
> Side question since I already have pages from those sites in the crawldb,
> will they be removed ever ? What would be the method to remove them ?
>
> --
> -MilleBii-
>

Reply via email to