Re: how to enhance job start up speed?

Bertrand Dechoux Mon, 13 Aug 2012 09:08:27 -0700

Seems like you want to misuse Hadoop but maybe I still don't understand
your context.


The standard way would be to split your files into multiples maps. Each map
could profit from data locality. Do a part of the worker stuff in the
mapper and then use a reducer to aggregate all the results (which could be
another part of your worker). That way you would be able to parallelise
your worker logic on a file. You seems to avoid using a reducer in order to
lessen the network traffic. That's a good concern but reducer do have their
use too.

Bertrand


On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke <
[email protected]> wrote:

> @Bejoy KS: Thanks for your advice.
>
> @Bertrand: It is parallelisable, this is just a test case. In later cases
> there will be a lot of big files which should be processed completly each
> in one map step. We want to minimize the overhead of network traffic. The
> idea is to execute some worker (could be different stuff, e.g. wordcount,
> linecount, translation etc) at the node where the file is situated.
>
> If I get it right so far, we need to do several things... first chunk size
> should be as big as the file. Then the file is on a single node of the
> hadoop cluster, am I right? And
> set the file to non splitable.
>
> Did you have some more advice? Anyway thanks so far!
>
> Greetings,
> MK
>
>
> 2012/8/13 Bertrand Dechoux <[email protected]>
>
>> It was almost what I was getting at but I was not sure about your
>> problem.
>> Basically, Hadoop is only adding overhead due to the way your job is
>> constructed.
>> Now the question is : why do you need a single mapper? Is your need truly
>> not 'parallelisable'?
>>
>> Bertrand
>>
>>
>> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <[email protected]> wrote:
>>
>>> **
>>> Hi Matthais
>>>
>>> When an mapreduce program is being used there are some extra steps like
>>> checking for input and output dir, calclulating input splits, JT assigning
>>> TT for executing the task etc.
>>>
>>> If your file is non splittable , then one map task per file will be
>>> generated irrespective of the number of hdfs blocks. Now some blocks will
>>> be in a different node than the node where map task is executed so time
>>> will be spend here on the network transfer.
>>>
>>> In your case MR would be a overhead as your file is non splittable hence
>>> no parallelism and also there is an overhead of copying blocks to the map
>>> task node.
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from handheld, please excuse typos.
>>> ------------------------------
>>> *From: * Matthias Kricke <[email protected]>
>>> *Sender: * [email protected]
>>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200
>>> *To: *<[email protected]>
>>> *ReplyTo: * [email protected]
>>> *Subject: *Re: how to enhance job start up speed?
>>>
>>> Ok, I try to clarify:
>>>
>>> 1) The worker is the logic inside my mapper and the same for both cases.
>>> 2) I have two cases. In the first one I use hadoop to execute my worker
>>> and in a second one, I execute my worker without hadoop (simple read of the
>>> file).
>>>    Now I measured, for both cases, the time the worker and
>>> the surroundings need (so i have two values for each case). The worker took
>>> the same time in both cases for the same input (this is expected). But the
>>> surroundings took 17%  more time when using hadoop.
>>> 3) ~  3GB.
>>>
>>> I want to know how to reduce this difference and where they come from.
>>> I hope that helped? If not, feel free to ask again :)
>>>
>>> Greetings,
>>> MK
>>>
>>> P.S. just for your information, I did the same test with hypertable as
>>> well.
>>> I got:
>>>  * worker without anything: 15% overhead
>>>  * worker with hadoop: 32% overhead
>>>  * worker with hypertable: 53% overhead
>>> Remark: overhead was measured in comparison to the worker. e.g.
>>> hypertable uses 53% of the whole process time, while worker uses 47%.
>>>
>>> 2012/8/13 Bertrand Dechoux <[email protected]>
>>>
>>>> I am not sure to understand and I guess I am not the only one.
>>>>
>>>> 1) What's a worker in your context? Only the logic inside your Mapper
>>>> or something else?
>>>> 2) You should clarify your cases. You seem to have two cases but both
>>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential,
>>>> so sequential is not Hadoop?
>>>> 3) What are the size of the file?
>>>>
>>>> Bertrand
>>>>
>>>>
>>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm using CDH3u3.
>>>>> If I want to process one File, set to non splitable hadoop starts one
>>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>>>>> goes through a configuration step where some variables for the worker
>>>>> inside the mapper are initialized.
>>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>>>>> process the V with the worker.
>>>>>
>>>>> When I compare the run time of hadoop to the run time of the same
>>>>> process in sequentiell manner, I get:
>>>>>
>>>>> worker time --> same in both cases
>>>>>
>>>>> case: mapper --> overhead of ~32% to the worker process (same for
>>>>> bigger chunk size)
>>>>> case: sequentiell --> overhead of ~15% to the worker process
>>>>>
>>>>> It shouldn't be that much slower, because of non splitable, the mapper
>>>>> will be executed where the data is saved by HDFS, won't it?
>>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>>>>> time for reading or streaming the data out of HDFS?
>>>>>
>>>>> I would appreciate your help,
>>>>>
>>>>> Greetings
>>>>> mk
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Bertrand Dechoux
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: how to enhance job start up speed?

Reply via email to