Indeed I'm running a vertical search engine too however I want to improve it as well on several front. Scoring is one way but it does not prevent uninteresting content to creep-in the crawld which eventually grows big/too big wasting ressources for nothing. I
Filtering is another way at the cost of a lot of regexes, hence this question. Third I see crawldb pruning you want to ditch all urls that are below a certain score. A question I asked a long time ago and the answer was write your own mapred for that, a bit too far fetch for me insofar. What would be top for me is to be able to extract properties about pages/urls in whatever phase scoring, indexing and be able to use those properties during the generate phase as a kind of feedback loop. It is a real pain to be forced to try to merge this information into a score 2011/6/2 Kirby Bohling <[email protected]> > I see from your e-mails that you are modifying the Scoring algorithm, > the only other option I see is to write the scoring algorithm which > detects that this is content you don't want to crawl, and this lowers > the score. As I recall, links with the highest score are crawled > first, so in the end that might be easier. Which sounds like it'd be > writing a Vertical Search engine of some type (either that, or Spam > detector with your personal/custom definition of Spam). > > I know several people on the this list or the dev list are writing > vertical search engines, maybe they would have more thoughts or info. > > Kirby > > On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <[email protected]> wrote: > > Yes I remember reading that a few years ago. > > But frankly I can't design by hand such finite automaton which will ever > > changing by the way. > > > > Even adding regexes by hand, is most likely a daunting task for me. > > > > 2011/6/2 Kirby Bohling <[email protected]> > > > >> From what I remember of earlier advise, you really want to use the > >> Automaton filter if at all possible, rather than series of straight > >> regex. Using the Automaton should be linear with respect to the > >> number of characters in the URL. Building the actual automaton could > >> be fairly time consuming, but as you'll re-using it often, likely > >> worth the cost. > >> > >> > >> > http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html > >> > >> A series of Java Regex's should also be linear with the number of > >> characters in the URL assuming you avoid specific constructs (the > >> things that cause back tracking, where it effectively tries to ensure > >> that one group/subgroup is equal to a later group/subgroup is the > >> primary culprit). Each Regex will add to the constant multiple in > >> front of the number of characters. > >> > >> I've used the Automaton library, and if you can work within the > >> limitations (it is a classic regex matcher with limited operators > >> relative to say Perl 5 Compatible Regular Expressions). > >> > >> I don't have any practical experience with Nutch for a large scale > >> crawl, but based upon my experience with using regular expressions and > >> the Automaton Library, I know it is much faster. I recall Andrej > >> talking about it being much faster. It might also be worth while for > >> Nutch to look into Lucene's optimized versions of Automaton (they > >> ported over several critical operations for use in Lucene and the > >> Fuzzy matching when computing the Levenshtien distance). > >> > >> I can't seem to find the thread where I saw that advice given, but you > >> can see the thread where they discuss adding the Automaton URL filter > >> back in Nutch 0.8 and it seems to agree with my experience in using > >> both. > >> > >> > >> > http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html > >> > >> Kirby > >> > >> > >> > >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote: > >> > What will be the impact of a growing big regex-urlfilter ? > >> > > >> > I ask this because there are more & more sites that I want to filter > >> out, > >> > it will limit the # of unecessary pages at a cost of lots of url > >> > verification. > >> > Side question since I already have pages from those sites in the > crawldb, > >> > will they be removed ever ? What would be the method to remove them ? > >> > > >> > -- > >> > -MilleBii- > >> > > >> > > > > > > > > -- > > -MilleBii- > > > -- -MilleBii-

