Yes I remember reading that a few years ago. But frankly I can't design by hand such finite automaton which will ever changing by the way.
Even adding regexes by hand, is most likely a daunting task for me. 2011/6/2 Kirby Bohling <[email protected]> > From what I remember of earlier advise, you really want to use the > Automaton filter if at all possible, rather than series of straight > regex. Using the Automaton should be linear with respect to the > number of characters in the URL. Building the actual automaton could > be fairly time consuming, but as you'll re-using it often, likely > worth the cost. > > > http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html > > A series of Java Regex's should also be linear with the number of > characters in the URL assuming you avoid specific constructs (the > things that cause back tracking, where it effectively tries to ensure > that one group/subgroup is equal to a later group/subgroup is the > primary culprit). Each Regex will add to the constant multiple in > front of the number of characters. > > I've used the Automaton library, and if you can work within the > limitations (it is a classic regex matcher with limited operators > relative to say Perl 5 Compatible Regular Expressions). > > I don't have any practical experience with Nutch for a large scale > crawl, but based upon my experience with using regular expressions and > the Automaton Library, I know it is much faster. I recall Andrej > talking about it being much faster. It might also be worth while for > Nutch to look into Lucene's optimized versions of Automaton (they > ported over several critical operations for use in Lucene and the > Fuzzy matching when computing the Levenshtien distance). > > I can't seem to find the thread where I saw that advice given, but you > can see the thread where they discuss adding the Automaton URL filter > back in Nutch 0.8 and it seems to agree with my experience in using > both. > > > http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html > > Kirby > > > > On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote: > > What will be the impact of a growing big regex-urlfilter ? > > > > I ask this because there are more & more sites that I want to filter > out, > > it will limit the # of unecessary pages at a cost of lots of url > > verification. > > Side question since I already have pages from those sites in the crawldb, > > will they be removed ever ? What would be the method to remove them ? > > > > -- > > -MilleBii- > > > -- -MilleBii-

