Seems like you want to misuse Hadoop but maybe I still don't understand your context.
The standard way would be to split your files into multiples maps. Each map could profit from data locality. Do a part of the worker stuff in the mapper and then use a reducer to aggregate all the results (which could be another part of your worker). That way you would be able to parallelise your worker logic on a file. You seems to avoid using a reducer in order to lessen the network traffic. That's a good concern but reducer do have their use too. Bertrand On Mon, Aug 13, 2012 at 5:53 PM, Matthias Kricke < [email protected]> wrote: > @Bejoy KS: Thanks for your advice. > > @Bertrand: It is parallelisable, this is just a test case. In later cases > there will be a lot of big files which should be processed completly each > in one map step. We want to minimize the overhead of network traffic. The > idea is to execute some worker (could be different stuff, e.g. wordcount, > linecount, translation etc) at the node where the file is situated. > > If I get it right so far, we need to do several things... first chunk size > should be as big as the file. Then the file is on a single node of the > hadoop cluster, am I right? And > set the file to non splitable. > > Did you have some more advice? Anyway thanks so far! > > Greetings, > MK > > > 2012/8/13 Bertrand Dechoux <[email protected]> > >> It was almost what I was getting at but I was not sure about your >> problem. >> Basically, Hadoop is only adding overhead due to the way your job is >> constructed. >> Now the question is : why do you need a single mapper? Is your need truly >> not 'parallelisable'? >> >> Bertrand >> >> >> On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <[email protected]> wrote: >> >>> ** >>> Hi Matthais >>> >>> When an mapreduce program is being used there are some extra steps like >>> checking for input and output dir, calclulating input splits, JT assigning >>> TT for executing the task etc. >>> >>> If your file is non splittable , then one map task per file will be >>> generated irrespective of the number of hdfs blocks. Now some blocks will >>> be in a different node than the node where map task is executed so time >>> will be spend here on the network transfer. >>> >>> In your case MR would be a overhead as your file is non splittable hence >>> no parallelism and also there is an overhead of copying blocks to the map >>> task node. >>> Regards >>> Bejoy KS >>> >>> Sent from handheld, please excuse typos. >>> ------------------------------ >>> *From: * Matthias Kricke <[email protected]> >>> *Sender: * [email protected] >>> *Date: *Mon, 13 Aug 2012 16:33:06 +0200 >>> *To: *<[email protected]> >>> *ReplyTo: * [email protected] >>> *Subject: *Re: how to enhance job start up speed? >>> >>> Ok, I try to clarify: >>> >>> 1) The worker is the logic inside my mapper and the same for both cases. >>> 2) I have two cases. In the first one I use hadoop to execute my worker >>> and in a second one, I execute my worker without hadoop (simple read of the >>> file). >>> Now I measured, for both cases, the time the worker and >>> the surroundings need (so i have two values for each case). The worker took >>> the same time in both cases for the same input (this is expected). But the >>> surroundings took 17% more time when using hadoop. >>> 3) ~ 3GB. >>> >>> I want to know how to reduce this difference and where they come from. >>> I hope that helped? If not, feel free to ask again :) >>> >>> Greetings, >>> MK >>> >>> P.S. just for your information, I did the same test with hypertable as >>> well. >>> I got: >>> * worker without anything: 15% overhead >>> * worker with hadoop: 32% overhead >>> * worker with hypertable: 53% overhead >>> Remark: overhead was measured in comparison to the worker. e.g. >>> hypertable uses 53% of the whole process time, while worker uses 47%. >>> >>> 2012/8/13 Bertrand Dechoux <[email protected]> >>> >>>> I am not sure to understand and I guess I am not the only one. >>>> >>>> 1) What's a worker in your context? Only the logic inside your Mapper >>>> or something else? >>>> 2) You should clarify your cases. You seem to have two cases but both >>>> are in overhead so I am assuming there is a baseline? Hadoop vs sequential, >>>> so sequential is not Hadoop? >>>> 3) What are the size of the file? >>>> >>>> Bertrand >>>> >>>> >>>> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke < >>>> [email protected]> wrote: >>>> >>>>> Hello all, >>>>> >>>>> I'm using CDH3u3. >>>>> If I want to process one File, set to non splitable hadoop starts one >>>>> Mapper and no Reducer (thats ok for this test scenario). The Mapper >>>>> goes through a configuration step where some variables for the worker >>>>> inside the mapper are initialized. >>>>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I >>>>> process the V with the worker. >>>>> >>>>> When I compare the run time of hadoop to the run time of the same >>>>> process in sequentiell manner, I get: >>>>> >>>>> worker time --> same in both cases >>>>> >>>>> case: mapper --> overhead of ~32% to the worker process (same for >>>>> bigger chunk size) >>>>> case: sequentiell --> overhead of ~15% to the worker process >>>>> >>>>> It shouldn't be that much slower, because of non splitable, the mapper >>>>> will be executed where the data is saved by HDFS, won't it? >>>>> Where did those 17% go? How to reduce this? Did hadoop needs the whole >>>>> time for reading or streaming the data out of HDFS? >>>>> >>>>> I would appreciate your help, >>>>> >>>>> Greetings >>>>> mk >>>>> >>>>> >>>> >>>> >>>> -- >>>> Bertrand Dechoux >>>> >>> >>> >> >> >> -- >> Bertrand Dechoux >> > > -- Bertrand Dechoux
