Re: Big regex-urlfilter size

Kirby Bohling Sat, 04 Jun 2011 07:09:35 -0700

The underlying Automaton documentation is here:

http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html


I am pretty sure they just extract the RegExp and had it to that
library, and don't do any pre-processing or parsing of it in Nutch
from a quick scan of the code.

Kirby


On Sat, Jun 4, 2011 at 4:45 AM, MilleBii <[email protected]> wrote:
> I could not find any documentation on the syntax automaton url filter
> accepts ?
> Any idea, I will update the wiki accordingly.
>
> 2011/6/4 MilleBii <[email protected]>
>
>> Regexes where not optimized against backtracking as I did not know. A
>> typical one looked like this :
>>
>> -.*\.domain\.*.*
>>
>> I guess something like this would be better and give less backtracking :
>>
>> http:\/\/www\.domain\..*
>>
>> As for the automaton no reason a priori since I really never looked at it.
>> Does it used  TRIE like pattern matching ? Which would be very fast and
>> appropriate I guess.
>> I will have a go at it and see how it helps. Thx.
>>
>>
>> 2011/6/4 Julien Nioche <[email protected]>
>>
>>> As Kirby pointed out the automaton-based filter should be far more
>>> efficent
>>> + its syntax is more restricted than the regex one but not dissimilar.
>>> What do your filters look like? Any reason why you can't use the automaton
>>> instead?
>>>
>>> Julien
>>>
>>> On 4 June 2011 09:44, MilleBii <[email protected]> wrote:
>>>
>>> > Just for the record the impact can be very very bad if you add too many
>>> > regexes. I just finished a test and I got a factor 20 slower just for
>>> the
>>> > generate step by adding 30 or so regexes in the filter. So beware.
>>> >
>>> > 2011/6/3 MilleBii <[email protected]>
>>> >
>>> > > Indeed I'm running a vertical search engine too however I want to
>>> improve
>>> > > it as well on several front.
>>> > > Scoring is one way but it does not prevent uninteresting content to
>>> > > creep-in the crawld which eventually grows big/too big wasting
>>> ressources
>>> > > for nothing. I
>>> > >
>>> > > Filtering is another way at the cost of a lot of regexes, hence this
>>> > > question.
>>> > >
>>> > > Third I see crawldb pruning  you want to ditch all urls that are below
>>> a
>>> > > certain score. A question I asked a long time ago and the answer was
>>> > write
>>> > > your own mapred for that, a bit too far fetch for me insofar.
>>> > >
>>> > > What would be top for me is to be able to extract properties about
>>> > > pages/urls in whatever phase scoring, indexing and be able to use
>>> those
>>> > > properties during the generate phase as a kind of feedback loop. It is
>>>  a
>>> > > real pain to be forced to try to merge this information into a score
>>> > >
>>> > >
>>> > >
>>> > > 2011/6/2 Kirby Bohling <[email protected]>
>>> > >
>>> > >> I see from your e-mails that you are modifying the Scoring algorithm,
>>> > >> the only other option I see is to write the scoring algorithm which
>>> > >> detects that this is content you don't want to crawl, and this lowers
>>> > >> the score.  As I recall, links with the highest score are crawled
>>> > >> first, so in the end that might be easier.  Which sounds like it'd be
>>> > >> writing a Vertical Search engine of some type (either that, or Spam
>>> > >> detector with your personal/custom definition of Spam).
>>> > >>
>>> > >> I know several people on the this list or the dev list are writing
>>> > >> vertical search engines, maybe they would have more thoughts or info.
>>> > >>
>>> > >> Kirby
>>> > >>
>>> > >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <[email protected]> wrote:
>>> > >> > Yes I remember reading that a few years ago.
>>> > >> > But frankly I can't design by hand such finite automaton which will
>>> > ever
>>> > >> > changing by the way.
>>> > >> >
>>> > >> > Even adding regexes by hand, is most likely a daunting task for me.
>>> > >> >
>>> > >> > 2011/6/2 Kirby Bohling <[email protected]>
>>> > >> >
>>> > >> >> From what I remember of earlier advise, you really want to use the
>>> > >> >> Automaton filter if at all possible, rather than series of
>>> straight
>>> > >> >> regex.  Using the Automaton should be linear with respect to the
>>> > >> >> number of characters in the URL.  Building the actual automaton
>>> could
>>> > >> >> be fairly time consuming, but as you'll re-using it often, likely
>>> > >> >> worth the cost.
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >>
>>> >
>>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
>>> > >> >>
>>> > >> >> A series of Java Regex's should also be linear with the number of
>>> > >> >> characters in the URL assuming you avoid specific constructs (the
>>> > >> >> things that cause back tracking, where it effectively tries to
>>> ensure
>>> > >> >> that one group/subgroup is equal to a later group/subgroup is the
>>> > >> >> primary culprit).  Each Regex will add to the constant multiple in
>>> > >> >> front of the number of characters.
>>> > >> >>
>>> > >> >> I've used the Automaton library, and if you can work within the
>>> > >> >> limitations (it is a classic regex matcher with limited operators
>>> > >> >> relative to say Perl 5 Compatible Regular Expressions).
>>> > >> >>
>>> > >> >> I don't have any practical experience with Nutch for a large scale
>>> > >> >> crawl, but based upon my experience with using regular expressions
>>> > and
>>> > >> >> the Automaton Library, I know it is much faster.  I recall Andrej
>>> > >> >> talking about it being much faster.  It might also be worth while
>>> for
>>> > >> >> Nutch to look into Lucene's optimized versions of Automaton (they
>>> > >> >> ported over several critical operations for use in Lucene and the
>>> > >> >> Fuzzy matching when computing the Levenshtien distance).
>>> > >> >>
>>> > >> >> I can't seem to find the thread where I saw that advice given, but
>>> > you
>>> > >> >> can see the thread where they discuss adding the Automaton URL
>>> filter
>>> > >> >> back in Nutch 0.8 and it seems to agree with my experience in
>>> using
>>> > >> >> both.
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >>
>>> >
>>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
>>> > >> >>
>>> > >> >> Kirby
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]>
>>> wrote:
>>> > >> >> > What will be the impact of a growing big regex-urlfilter ?
>>> > >> >> >
>>> > >> >> > I ask this because there are more & more  sites that I want to
>>> > filter
>>> > >> >> out,
>>> > >> >> > it will limit the # of unecessary pages at a cost of lots of url
>>> > >> >> > verification.
>>> > >> >> > Side question since I already have pages from those sites in the
>>> > >> crawldb,
>>> > >> >> > will they be removed ever ? What would be the method to remove
>>> them
>>> > ?
>>> > >> >> >
>>> > >> >> > --
>>> > >> >> > -MilleBii-
>>> > >> >> >
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > --
>>> > >> > -MilleBii-
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > -MilleBii-
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > -MilleBii-
>>> >
>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

Re: Big regex-urlfilter size

Reply via email to