Re: Big regex-urlfilter size

MilleBii Thu, 02 Jun 2011 15:06:07 -0700

Indeed I'm running a vertical search engine too however I want to improve it
as well on several front.
Scoring is one way but it does not prevent uninteresting content to creep-in
the crawld which eventually grows big/too big wasting ressources for
nothing. I


Filtering is another way at the cost of a lot of regexes, hence this
question.

Third I see crawldb pruning  you want to ditch all urls that are below a
certain score. A question I asked a long time ago and the answer was write
your own mapred for that, a bit too far fetch for me insofar.

What would be top for me is to be able to extract properties about
pages/urls in whatever phase scoring, indexing and be able to use those
properties during the generate phase as a kind of feedback loop. It is  a
real pain to be forced to try to merge this information into a score



2011/6/2 Kirby Bohling <[email protected]>

> I see from your e-mails that you are modifying the Scoring algorithm,
> the only other option I see is to write the scoring algorithm which
> detects that this is content you don't want to crawl, and this lowers
> the score.  As I recall, links with the highest score are crawled
> first, so in the end that might be easier.  Which sounds like it'd be
> writing a Vertical Search engine of some type (either that, or Spam
> detector with your personal/custom definition of Spam).
>
> I know several people on the this list or the dev list are writing
> vertical search engines, maybe they would have more thoughts or info.
>
> Kirby
>
> On Thu, Jun 2, 2011 at 3:47 PM, MilleBii <[email protected]> wrote:
> > Yes I remember reading that a few years ago.
> > But frankly I can't design by hand such finite automaton which will ever
> > changing by the way.
> >
> > Even adding regexes by hand, is most likely a daunting task for me.
> >
> > 2011/6/2 Kirby Bohling <[email protected]>
> >
> >> From what I remember of earlier advise, you really want to use the
> >> Automaton filter if at all possible, rather than series of straight
> >> regex.  Using the Automaton should be linear with respect to the
> >> number of characters in the URL.  Building the actual automaton could
> >> be fairly time consuming, but as you'll re-using it often, likely
> >> worth the cost.
> >>
> >>
> >>
> http://nutch.apache.org/apidocs-1.2/org/apache/nutch/urlfilter/automaton/package-summary.html
> >>
> >> A series of Java Regex's should also be linear with the number of
> >> characters in the URL assuming you avoid specific constructs (the
> >> things that cause back tracking, where it effectively tries to ensure
> >> that one group/subgroup is equal to a later group/subgroup is the
> >> primary culprit).  Each Regex will add to the constant multiple in
> >> front of the number of characters.
> >>
> >> I've used the Automaton library, and if you can work within the
> >> limitations (it is a classic regex matcher with limited operators
> >> relative to say Perl 5 Compatible Regular Expressions).
> >>
> >> I don't have any practical experience with Nutch for a large scale
> >> crawl, but based upon my experience with using regular expressions and
> >> the Automaton Library, I know it is much faster.  I recall Andrej
> >> talking about it being much faster.  It might also be worth while for
> >> Nutch to look into Lucene's optimized versions of Automaton (they
> >> ported over several critical operations for use in Lucene and the
> >> Fuzzy matching when computing the Levenshtien distance).
> >>
> >> I can't seem to find the thread where I saw that advice given, but you
> >> can see the thread where they discuss adding the Automaton URL filter
> >> back in Nutch 0.8 and it seems to agree with my experience in using
> >> both.
> >>
> >>
> >>
> http://lucene.472066.n3.nabble.com/Much-faster-RegExp-lib-needed-in-nutch-td623308.html
> >>
> >> Kirby
> >>
> >>
> >>
> >> On Thu, Jun 2, 2011 at 2:42 PM, MilleBii <[email protected]> wrote:
> >> > What will be the impact of a growing big regex-urlfilter ?
> >> >
> >> > I ask this because there are more & more  sites that I want to filter
> >> out,
> >> > it will limit the # of unecessary pages at a cost of lots of url
> >> > verification.
> >> > Side question since I already have pages from those sites in the
> crawldb,
> >> > will they be removed ever ? What would be the method to remove them ?
> >> >
> >> > --
> >> > -MilleBii-
> >> >
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Re: Big regex-urlfilter size

Reply via email to