> Well, Nutch (resp. Hadoop) are designed to process much data. Job >> management has some overhead >> (and some artificial sleeps): 5 cycles * 4 jobs >> (generate/fetch/parse/update) = 20 jobs. >> 6s per job seems roughly ok, though it could be slightly faster. >> > > Yes, this test is not well designed for Nutch, but I thought, as Stefan > said, about a config or hardcoded delay somewhere in the nutch files I can > try to reduce, since I will use on a single machine. >
If your crawl is small and does not require more than one machine then you could use the local mode instead of the distributed one. Nutch is designed for large scale crawling : if you are after low latency crawler then it is not the right tool. Most people use it on a small scale though but do not find the slight overhead in the distribution to be a big issue. A few seconds is quite insignificant considering that a round of fetching can take hours. J. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

