The underlying Automaton documentation is here: http://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html
I am pretty sure they just extract the RegExp and had it to that library, and don't do any pre-processing or parsing of it in Nutch from a quick scan of the code. Kirby On Sat, Jun 4, 2011 at 4:45 AM, MilleBii <[email protected]> wrote: > I could not find any documentation on the syntax automaton url filter > accepts ? > Any idea, I will update the wiki accordingly. > > 2011/6/4 MilleBii <[email protected]> > >> Regexes where not optimized against backtracking as I did not know. A >> typical one looked like this : >> >> -.*\.domain\.*.* >> >> I guess something like this would be better and give less backtracking : >> >> http:\/\/www\.domain\..* >> >> As for the automaton no reason a priori since I really never looked at it. >> Does it used TRIE like pattern matching ? Which would be very fast and >> appropriate I guess. >> I will have a go at it and see how it helps. Thx. >> >> >> 2011/6/4 Julien Nioche <[email protected]> >> >>> As Kirby pointed out the automaton-based filter should be far more >>> efficent >>> + its syntax is more restricted than the regex one but not dissimilar. >>> What do your filters look like? Any reason why you can't use the automaton >>> instead? >>> >>> Julien >>> >>> On 4 June 2011 09:44, MilleBii <[email protected]> wrote: >>> >>> > Just for the record the impact can be very very bad if you add too many >>> > regexes. I just finished a test and I got a factor 20 slower just for >>> the >>> > generate step by adding 30 or so regexes in the filter. So beware. >>> > >>> > 2011/6/3 MilleBii <[email protected]> >>> > >>> > > Indeed I'm running a vertical search engine too however I want to >>> improve >>> > > it as well on several front. >>> > > Scoring is one way but it does not prevent uninteresting content to >>> > > creep-in the crawld which eventually grows big/too big wasting >>> ressources >>> > > for nothing. I >>> > > >>> > > Filtering is another way at the cost of a lot of regexes, hence this >>> > > question. >>> > > >>> > > Third I see crawldb pruning you want to ditch all urls that are below >>> a >>> > > certain score. A question I asked a long time ago and the answer was >>> > write >>> > > your own mapred for that, a bit too far fetch for me insofar. >>> > > >>> > > What would be top for me is to be able to extract properties about >>> > > pages/urls in whatever phase scoring, indexing and be able to use >>> those >>> > > properties during the generate phase as a kind of feedback loop. It is >>> a >>> > > real pain to be forced to try to merge this information into a score >>> > > >>> > > >>> > > >>> > > 2011/6/2 Kirby Bohling <[email protected]> >>> > > >>> > >> I see from your e-mails that you are modifying the Scoring algorithm, >>> > >> the only other option I see is to write the scoring algorithm which >>> > >> detects that this is content you don't want to crawl, and this lowers >>> > >> the score. As I recall, links with the highest score are crawled >>> > >> first, so in the end that might be easier. Which sounds like it'd be >>> > >> writing a Vertical Search engine of some type (either that, or Spam >>> > >> detector with your personal/custom definition of Spam). >>> > >> >>> > >> I know several people on the this list or the dev list are writing >>> > >> vertical search engines, maybe they would have more thoughts or info. >>> > >> >>> > >> Kirby >>> > >> >>> > >> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <[email protected]> wrote: >>> > >> > Yes I remember reading that a few years ago. >>> > >> > But frankly I can't design by hand such finite automaton which will >>> > ever >>> > >> > changing by the way. >>> > >> > >>> > >> > Even adding regexes by hand, is most likely a daunting task for me. >>> > >> > >>> > >> > 2011/6/2 Kirby Bohling <[email protected]> >>> > >> > >>> > >> >> From what I remember of earlier advise, you really want to use the >>> > >> >> Automaton filter if at all possible, rather than series of >>> straight >>> > >> >> regex. Using the Automaton should be linear with respect to the >>> > >> >> number of characters in the URL. Building the actual automaton >>> could >>> > >> >> be fairly time consuming, but as you'll re-using it often, likely >>> > >> >> worth the cost. >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >>> > >>> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html >>> > >> >> >>> > >> >> A series of Java Regex's should also be linear with the number of >>> > >> >> characters in the URL assuming you avoid specific constructs (the >>> > >> >> things that cause back tracking, where it effectively tries to >>> ensure >>> > >> >> that one group/subgroup is equal to a later group/subgroup is the >>> > >> >> primary culprit). Each Regex will add to the constant multiple in >>> > >> >> front of the number of characters. >>> > >> >> >>> > >> >> I've used the Automaton library, and if you can work within the >>> > >> >> limitations (it is a classic regex matcher with limited operators >>> > >> >> relative to say Perl 5 Compatible Regular Expressions). >>> > >> >> >>> > >> >> I don't have any practical experience with Nutch for a large scale >>> > >> >> crawl, but based upon my experience with using regular expressions >>> > and >>> > >> >> the Automaton Library, I know it is much faster. I recall Andrej >>> > >> >> talking about it being much faster. It might also be worth while >>> for >>> > >> >> Nutch to look into Lucene's optimized versions of Automaton (they >>> > >> >> ported over several critical operations for use in Lucene and the >>> > >> >> Fuzzy matching when computing the Levenshtien distance). >>> > >> >> >>> > >> >> I can't seem to find the thread where I saw that advice given, but >>> > you >>> > >> >> can see the thread where they discuss adding the Automaton URL >>> filter >>> > >> >> back in Nutch 0.8 and it seems to agree with my experience in >>> using >>> > >> >> both. >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >>> > >>> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html >>> > >> >> >>> > >> >> Kirby >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> >>> wrote: >>> > >> >> > What will be the impact of a growing big regex-urlfilter ? >>> > >> >> > >>> > >> >> > I ask this because there are more & more sites that I want to >>> > filter >>> > >> >> out, >>> > >> >> > it will limit the # of unecessary pages at a cost of lots of url >>> > >> >> > verification. >>> > >> >> > Side question since I already have pages from those sites in the >>> > >> crawldb, >>> > >> >> > will they be removed ever ? What would be the method to remove >>> them >>> > ? >>> > >> >> > >>> > >> >> > -- >>> > >> >> > -MilleBii- >>> > >> >> > >>> > >> >> >>> > >> > >>> > >> > >>> > >> > >>> > >> > -- >>> > >> > -MilleBii- >>> > >> > >>> > >> >>> > > >>> > > >>> > > >>> > > -- >>> > > -MilleBii- >>> > > >>> > >>> > >>> > >>> > -- >>> > -MilleBii- >>> > >>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> >> >> >> >> -- >> -MilleBii- >> > > > > -- > -MilleBii- >

