Re: Performance Configuration on Focused Web Crawl

Hannes Carl Meyer Sat, 20 Nov 2010 10:52:36 -0800

Ken, thanks, I guess thats a good hint!

I'm using the simple org.apache.nutch.crawl.Crawl to perform the crawl - I
guess the configuration of the Map-Reduce Job then is pretty low.


@Andrzej could you give me a hint where to configure the number of reduce
tasks in nutch 0.9? (running on a single machine)

Regards,

Hannes

On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <[email protected]>wrote:

>
> On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote:
>
>  Thank you for sharing your experiences!
>>
>> in my case the web servers are pretty stable and we are allowed to perform
>> intensive crawling which make it easy to increase the threads per host.
>>
>> imho the fetch process isn't really the bottleneck. It is the process
>> between the fetch process when merging and updating the crawldb.
>>
>> We are using a 16 Core Hardware, during fetch process CPUs are being used
>> around 1000 % but in between fetching it is always around 90-100 % on a
>> single core
>>
>
> In regular map-reduce Hadoop jobs you get this situation if the job has
> been configured to use a single reducer, and thus only one core is active
>
> Though it would surprise me if the crawlDB update job was configured this
> way, as I don't see a reason why the crawlDB has to be a single file in
> HDFS.
>
> Andrzej and others would know best, of course.
>
> -- Ken
>
>
>
>
>> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <[email protected]>
>> wrote:
>>
>>  Hannes,
>>>
>>> I guess It would depends on situation
>>> - your server specs (where cralwer is running) and
>>> - hosts specs
>>>
>>> Anyway, I have been crawling around 50 hosts. I tweaked a few to get it
>>> right for my situation.
>>>
>>> Currently I am using 500 threads. and 10 threads per host.
>>>
>>> In my opinion, number of threads for crawler does not matter much.
>>> Because
>>> crawler does not take much of a resource (memory and CPU). As far as your
>>> server network band width can handle, it should be fine.
>>>
>>> In my case, number of threads per host matters. Because some of my server
>>> cannot handle that much of bandwidth.
>>>
>>> Not sure if it would helps, I had to adjust fetcher.server.delay,
>>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my hosts
>>> sometimes cannot handle that much of threads.
>>>
>>>
>>> Warm Regards,
>>>
>>> Y.T. Thet
>>>
>>>
>>>
>>>
>>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer <
>>> [email protected]> wrote:
>>>
>>>  Hi Ken,
>>>>
>>>> our Crawler is allowed to hit those hosts in a frequent way at night so
>>>> we
>>>> are not getting a penalty ;-)
>>>>
>>>> Could you imagine running nutch in this case with about 400 threads,
>>>> with
>>>> 1
>>>> thread per host and a delay of 1.0?
>>>>
>>>> I tried that way but experienced some really long idle times... My idea
>>>> was
>>>> one thread per host. That would mean adding another host would require
>>>> add
>>>> an additional thread.
>>>>
>>>> Regards
>>>>
>>>> Hannes
>>>>
>>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <
>>>> [email protected]
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  If you're hitting each host with 45 threads, you better be on really
>>>>>
>>>> good
>>>>
>>>>> terms with those webmasters :)
>>>>>
>>>>> With 90 total threads, that means as few as 2 hosts are active at any
>>>>>
>>>> time,
>>>>
>>>>> yes?
>>>>>
>>>>> -- Ken
>>>>>
>>>>>
>>>>>
>>>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average of 600
>>>>>>
>>>>> pages.
>>>>
>>>>> That makes a volume of 240.000 fetched pages - I want to get all of
>>>>>>
>>>>> them.
>>>>
>>>>>
>>>>>> Can one give me an advice on the right threads/delay/per-host
>>>>>> configuration
>>>>>> in this environnement?
>>>>>>
>>>>>> My current conf:
>>>>>>
>>>>>> <property>
>>>>>>     <name>fetcher.server.delay</name>
>>>>>>     <value>1.0</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>>     <name>fetcher.threads.fetch</name>
>>>>>>     <value>90</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>>     <name>fetcher.threads.per.host</name>
>>>>>>     <value>45</value>
>>>>>> </property>
>>>>>>
>>>>>> <property>
>>>>>>   <name>fetcher.threads.per.host.by.ip</name>
>>>>>>   <value>false</value>
>>>>>> </property>
>>>>>>
>>>>>> The total runtime is about 5 hours.
>>>>>>
>>>>>> How can performance be improved? (I still have enough CPU, Bandwith)
>>>>>>
>>>>>> Note: This runs on a single machine, distribution to other machines is
>>>>>>
>>>>> not
>>>>
>>>>> planned.
>>>>>>
>>>>>> Thanks and Regards
>>>>>>
>>>>>> Hannes
>>>>>>
>>>>>>
>>>>> --------------------------
>>>>> Ken Krugler
>>>>> +1 530-210-6378
>>>>> http://bixolabs.com
>>>>> e l a s t i c   w e b   m i n i n g
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>

Re: Performance Configuration on Focused Web Crawl

Reply via email to