Hi Otis

In my case the crawl would be wide, which is good for URL distribution, but
> bad
> for DNS.
> What's recommended for DNA caching?  I do see
> http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting
>  up a
> local DNS server (e.g. bind) or something like pdnsdr or something  else?
>

I used bind for local DNS caching when running a 400-node cluster on EC2 for
Similarpages, am sure there are other tools which work just as well

[...]


> > The time spent in generate and update is proportional to  the size of the
> > crawldb. Might take half the time at one point but will take  more than
> that.
> > The best option would probably be to generate multiple  segments in one
> go
> > (see options for the Generator), fetch all the segments  one by one, then
> > merge them with the crawldb in a single call to  update
>
> Right.
> But with time (or, more precisely, as crawldb grows) this generation will
> start
> taking more and more time, and there is no way around that, right?
>

nope. Nutch 2.0 will be faster for the updates compared to 1.x but the
generation will still be proportional to the size of the crawldb


>
> > You will also inevitably hit  slow servers which will have an impact the
> > fetchrate - although not as bad as  before the introduction of the
> timeout on
> > fetching.
>
> Right, I remember this problem.  So now one can specify how long each fetch
> should last and fetching will stop when that time is reached?
>

exactly - you give it say 60 mins and it will stop fetching after that


>
> How does one guess what that time limit to pick, especially since fetch
> runs can
> vary in terms of how fast they are depending on what hosts are in it?
>

empirically :-) take a largish value, observe the fetch and at which point
it is starting to slow down and reduce accordingly
Sounds a bit like a recipe, doesn't it?


>
> Wouldn't it be better to express this in requests/second instead of time,
> so
> that you can say "when fetching goes below N requests per second and stays
> like
> that for M minutes, abort fetch"?
>

this would be a nice feature indeed. The timeout is an efficient but
somewhat crude mechanism, but it proved useful though as fetches could hang
on a single host for a looooooooooooooong time which on a large cluster
means big money


>
> What if you have a really fast fetch run going on, but the time is still
> reached
> and fetch aborted?  What do you do?  Restart the fetch with the same list
> of
> generated URLs as before?  Somehow restart with only unfetched URLs?
>  Generate a
> whole new fetchlist (which ends up being slow)?
>

you won't need to restart the fetch with the same list. The unfetched one
should end up in the next round of generation


> As a result what your crawl will just be  churning URLs generated
> > automatically from adult sites and despite the fact  that your crawldb
> will
> > contain loads of URLs there will be very little useful  ones.
>
> One man's trash is another man's...
>

even if you adult sites is what you really want to crawl for there is still
a need for filtering / normalisation strategies.


> > Anyway, it's not just a matter of pages / seconds. Doing large,  open
> crawls
> > brings up a lot of interesting challenges  :-)
>
> Yup.  Thanks Julien!
>
>
You are welcome.

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to