> I was saying that based on what the previous poster stated. Also the fact > that I have read through quite a bit of posts stating that the problem with > crawling in a vertical environment has to do with the way fetcher2 was > built. The fetches are grouped by domain name and if you have a lot of > urls > from the same domain then you are not able to do quick mapreduce jobs. >
Nutch's default behaviour is to be polite to the hosts it visits. If you own the hosts (or have an agreement with the owner) you can of course hit them as hard as you want and set a higher number of threads per host or time between hits. If you don't own the hosts then you simply should not do that and use the defaults used in Nutch as a matter of courtesy. (moreover if you are too aggressive in your choice of parameters then you'll probably be blacklisted by the target servers and won't be allowed to fetch any content) Let's be completely clear once and for all : there is no particular issue with using Nutch for vertical crawls - loads of people have done and still do that. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com