> I was saying that based on what the previous poster stated.  Also the fact
> that I have read through quite a bit of posts stating that the problem with
> crawling in a vertical environment has to do with the way fetcher2 was
> built.  The fetches are grouped by domain name and if you have a lot of
> urls
> from the same domain then you are not able to do quick mapreduce jobs.
>

Nutch's default behaviour is to be polite to the hosts it visits. If you own
the hosts (or have an agreement with the owner) you can of course hit them
as hard as you want and set a higher number of threads per host or time
between hits. If you don't own the hosts then you simply should not do that
and use the defaults used in Nutch as a matter of courtesy. (moreover if you
are too aggressive in your choice of parameters then you'll probably be
blacklisted by the target servers and won't be allowed to fetch any content)

Let's be completely clear once and for all : there is no particular issue
with using Nutch for vertical crawls - loads of people have done and still
do that.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to