Re: Performance Configuration on Focused Web Crawl

Hannes Carl Meyer Sun, 21 Nov 2010 00:48:45 -0800

I'm going to give it a try and confgure a peudo-distributed env on our
testing machine (which also has 16 Cores and 24 GB RAM).


I'll get back here after testing it!

On Sat, Nov 20, 2010 at 10:53 PM, Ken Krugler
<[email protected]>wrote:

> [snip]
>
>
>  During fetching, multiple threads are spawned which will then use all
>>> your cores.
>>>
>>> But during regular map-reduce tasks (such as the CrawlDB update), you'll
>>> get a single map and a single reduce running sequentially.
>>>
>>
>> (Actually, LocalJobTracker will create multiple map tasks - as many as
>> there are input splits - but running sequentially).
>>
>
> Sorry, I was being vague in my wording. I meant one mapper and one reducer,
> which won't be run in parallel.
>
> You're right that there will be N map tasks, one per split (which typically
> means one per HDFS block).
>
>
>  To get reasonable performance from one box, you'd need to set up Hadoop
>>> to run in pseudo-distributed mode, and then run your Nutch crawl as a
>>> regular/distributed job.
>>>
>>> And also tweak the hadoop-site.xml settings, to specify something like 6
>>> mappers and 6 reducers (leave four cores for JobTracker, NameNode,
>>> TaskTracker, DataNode).
>>>
>>> But I'll confess, I've never tried to run a real job this way.
>>>
>>
>> I have. Within the limits of a single machine performance this works
>> reasonably well - if you have a node with 4 cores and enough RAM then
>> you can easily run 4 tasks in parallel. Jobs become then limited by the
>> amount of disk IO.
>>
>
> I'd be interested in hearing back from Hannes as to performance with a
> 16-core box. Based on the paper by the IRLBot team, it seems like this could
> scale pretty well.
>
> They did wind up having to install a lot of disks in their crawling box.
> And as Andrzej mentions, disk I/O will become a bottleneck, especially for
> crawlDB updates (less so for fetching or parsing).
>
> If you have multiple drives, then you could run multiple DataNodes, and
> configure each one to use a separate disk.
>
> I don't have a good sense of whether it would be worthwhile to use
> replication, but in the past Hadoop had some issues running with a
> replication of 1, so I'd probably set this to 2.
>
> -- Ken
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: Performance Configuration on Focused Web Crawl

Reply via email to