Re: Big regex-urlfilter size

MilleBii Thu, 02 Jun 2011 13:47:49 -0700

Yes I remember reading that a few years ago.
But frankly I can't design by hand such finite automaton which will ever
changing by the way.


Even adding regexes by hand, is most likely a daunting task for me.

2011/6/2 Kirby Bohling <[email protected]>

> From what I remember of earlier advise, you really want to use the
> Automaton filter if at all possible, rather than series of straight
> regex.  Using the Automaton should be linear with respect to the
> number of characters in the URL.  Building the actual automaton could
> be fairly time consuming, but as you'll re-using it often, likely
> worth the cost.
>
>
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>
> A series of Java Regex's should also be linear with the number of
> characters in the URL assuming you avoid specific constructs (the
> things that cause back tracking, where it effectively tries to ensure
> that one group/subgroup is equal to a later group/subgroup is the
> primary culprit).  Each Regex will add to the constant multiple in
> front of the number of characters.
>
> I've used the Automaton library, and if you can work within the
> limitations (it is a classic regex matcher with limited operators
> relative to say Perl 5 Compatible Regular Expressions).
>
> I don't have any practical experience with Nutch for a large scale
> crawl, but based upon my experience with using regular expressions and
> the Automaton Library, I know it is much faster.  I recall Andrej
> talking about it being much faster.  It might also be worth while for
> Nutch to look into Lucene's optimized versions of Automaton (they
> ported over several critical operations for use in Lucene and the
> Fuzzy matching when computing the Levenshtien distance).
>
> I can't seem to find the thread where I saw that advice given, but you
> can see the thread where they discuss adding the Automaton URL filter
> back in Nutch 0.8 and it seems to agree with my experience in using
> both.
>
>
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>
> Kirby
>
>
>
> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote:
> > What will be the impact of a growing big regex-urlfilter ?
> >
> > I ask this because there are more & more  sites that I want to filter
> out,
> > it will limit the # of unecessary pages at a cost of lots of url
> > verification.
> > Side question since I already have pages from those sites in the crawldb,
> > will they be removed ever ? What would be the method to remove them ?
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Reply via email to