I see from your e-mails that you are modifying the Scoring algorithm, the only other option I see is to write the scoring algorithm which detects that this is content you don't want to crawl, and this lowers the score. As I recall, links with the highest score are crawled first, so in the end that might be easier. Which sounds like it'd be writing a Vertical Search engine of some type (either that, or Spam detector with your personal/custom definition of Spam).
I know several people on the this list or the dev list are writing vertical search engines, maybe they would have more thoughts or info. Kirby On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <[email protected]> wrote: > Yes I remember reading that a few years ago. > But frankly I can't design by hand such finite automaton which will ever > changing by the way. > > Even adding regexes by hand, is most likely a daunting task for me. > > 2011/6/2 Kirby Bohling <[email protected]> > >> From what I remember of earlier advise, you really want to use the >> Automaton filter if at all possible, rather than series of straight >> regex. Using the Automaton should be linear with respect to the >> number of characters in the URL. Building the actual automaton could >> be fairly time consuming, but as you'll re-using it often, likely >> worth the cost. >> >> >> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html >> >> A series of Java Regex's should also be linear with the number of >> characters in the URL assuming you avoid specific constructs (the >> things that cause back tracking, where it effectively tries to ensure >> that one group/subgroup is equal to a later group/subgroup is the >> primary culprit). Each Regex will add to the constant multiple in >> front of the number of characters. >> >> I've used the Automaton library, and if you can work within the >> limitations (it is a classic regex matcher with limited operators >> relative to say Perl 5 Compatible Regular Expressions). >> >> I don't have any practical experience with Nutch for a large scale >> crawl, but based upon my experience with using regular expressions and >> the Automaton Library, I know it is much faster. I recall Andrej >> talking about it being much faster. It might also be worth while for >> Nutch to look into Lucene's optimized versions of Automaton (they >> ported over several critical operations for use in Lucene and the >> Fuzzy matching when computing the Levenshtien distance). >> >> I can't seem to find the thread where I saw that advice given, but you >> can see the thread where they discuss adding the Automaton URL filter >> back in Nutch 0.8 and it seems to agree with my experience in using >> both. >> >> >> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html >> >> Kirby >> >> >> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote: >> > What will be the impact of a growing big regex-urlfilter ? >> > >> > I ask this because there are more & more sites that I want to filter >> out, >> > it will limit the # of unecessary pages at a cost of lots of url >> > verification. >> > Side question since I already have pages from those sites in the crawldb, >> > will they be removed ever ? What would be the method to remove them ? >> > >> > -- >> > -MilleBii- >> > >> > > > > -- > -MilleBii- >

