Ken, thanks, I guess thats a good hint! I'm using the simple org.apache.nutch.crawl.Crawl to perform the crawl - I guess the configuration of the Map-Reduce Job then is pretty low.
@Andrzej could you give me a hint where to configure the number of reduce tasks in nutch 0.9? (running on a single machine) Regards, Hannes On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[email protected]>wrote: > > On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote: > > Thank you for sharing your experiences! >> >> in my case the web servers are pretty stable and we are allowed to perform >> intensive crawling which make it easy to increase the threads per host. >> >> imho the fetch process isn't really the bottleneck. It is the process >> between the fetch process when merging and updating the crawldb. >> >> We are using a 16 Core Hardware, during fetch process CPUs are being used >> around 1000 % but in between fetching it is always around 90-100 % on a >> single core >> > > In regular map-reduce Hadoop jobs you get this situation if the job has > been configured to use a single reducer, and thus only one core is active > > Though it would surprise me if the crawlDB update job was configured this > way, as I don't see a reason why the crawlDB has to be a single file in > HDFS. > > Andrzej and others would know best, of course. > > -- Ken > > > > >> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[email protected]> >> wrote: >> >> Hannes, >>> >>> I guess It would depends on situation >>> - your server specs (where cralwer is running) and >>> - hosts specs >>> >>> Anyway, I have been crawling around 50 hosts. I tweaked a few to get it >>> right for my situation. >>> >>> Currently I am using 500 threads. and 10 threads per host. >>> >>> In my opinion, number of threads for crawler does not matter much. >>> Because >>> crawler does not take much of a resource (memory and CPU). As far as your >>> server network band width can handle, it should be fine. >>> >>> In my case, number of threads per host matters. Because some of my server >>> cannot handle that much of bandwidth. >>> >>> Not sure if it would helps, I had to adjust fetcher.server.delay, >>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts >>> sometimes cannot handle that much of threads. >>> >>> >>> Warm Regards, >>> >>> Y.T. Thet >>> >>> >>> >>> >>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer < >>> [email protected]> wrote: >>> >>> Hi Ken, >>>> >>>> our Crawler is allowed to hit those hosts in a frequent way at night so >>>> we >>>> are not getting a penalty ;-) >>>> >>>> Could you imagine running nutch in this case with about 400 threads, >>>> with >>>> 1 >>>> thread per host and a delay of 1.0? >>>> >>>> I tried that way but experienced some really long idle times... My idea >>>> was >>>> one thread per host. That would mean adding another host would require >>>> add >>>> an additional thread. >>>> >>>> Regards >>>> >>>> Hannes >>>> >>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler < >>>> [email protected] >>>> >>>>> wrote: >>>>> >>>> >>>> If you're hitting each host with 45 threads, you better be on really >>>>> >>>> good >>>> >>>>> terms with those webmasters :) >>>>> >>>>> With 90 total threads, that means as few as 2 hosts are active at any >>>>> >>>> time, >>>> >>>>> yes? >>>>> >>>>> -- Ken >>>>> >>>>> >>>>> >>>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote: >>>>> >>>>> Hi, >>>>> >>>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600 >>>>>> >>>>> pages. >>>> >>>>> That makes a volume of 240.000 fetched pages - I want to get all of >>>>>> >>>>> them. >>>> >>>>> >>>>>> Can one give me an advice on the right threads/delay/per-host >>>>>> configuration >>>>>> in this environnement? >>>>>> >>>>>> My current conf: >>>>>> >>>>>> <property> >>>>>> <name>fetcher.server.delay</name> >>>>>> <value>1.0</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fetcher.threads.fetch</name> >>>>>> <value>90</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fetcher.threads.per.host</name> >>>>>> <value>45</value> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> <name>fetcher.threads.per.host.by.ip</name> >>>>>> <value>false</value> >>>>>> </property> >>>>>> >>>>>> The total runtime is about 5 hours. >>>>>> >>>>>> How can performance be improved? (I still have enough CPU, Bandwith) >>>>>> >>>>>> Note: This runs on a single machine, distribution to other machines is >>>>>> >>>>> not >>>> >>>>> planned. >>>>>> >>>>>> Thanks and Regards >>>>>> >>>>>> Hannes >>>>>> >>>>>> >>>>> -------------------------- >>>>> Ken Krugler >>>>> +1 530-210-6378 >>>>> http://bixolabs.com >>>>> e l a s t i c w e b m i n i n g >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > >

