>From what I remember of earlier advise, you really want to use the Automaton filter if at all possible, rather than series of straight regex. Using the Automaton should be linear with respect to the number of characters in the URL. Building the actual automaton could be fairly time consuming, but as you'll re-using it often, likely worth the cost.
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html A series of Java Regex's should also be linear with the number of characters in the URL assuming you avoid specific constructs (the things that cause back tracking, where it effectively tries to ensure that one group/subgroup is equal to a later group/subgroup is the primary culprit). Each Regex will add to the constant multiple in front of the number of characters. I've used the Automaton library, and if you can work within the limitations (it is a classic regex matcher with limited operators relative to say Perl 5 Compatible Regular Expressions). I don't have any practical experience with Nutch for a large scale crawl, but based upon my experience with using regular expressions and the Automaton Library, I know it is much faster. I recall Andrej talking about it being much faster. It might also be worth while for Nutch to look into Lucene's optimized versions of Automaton (they ported over several critical operations for use in Lucene and the Fuzzy matching when computing the Levenshtien distance). I can't seem to find the thread where I saw that advice given, but you can see the thread where they discuss adding the Automaton URL filter back in Nutch 0.8 and it seems to agree with my experience in using both. http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html Kirby On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote: > What will be the impact of a growing big regex-urlfilter ? > > I ask this because there are more & more sites that I want to filter out, > it will limit the # of unecessary pages at a cost of lots of url > verification. > Side question since I already have pages from those sites in the crawldb, > will they be removed ever ? What would be the method to remove them ? > > -- > -MilleBii- >

