It was almost what I was getting at but I was not sure about your problem. Basically, Hadoop is only adding overhead due to the way your job is constructed. Now the question is : why do you need a single mapper? Is your need truly not 'parallelisable'?
Bertrand On Mon, Aug 13, 2012 at 4:49 PM, Bejoy KS <[email protected]> wrote: > ** > Hi Matthais > > When an mapreduce program is being used there are some extra steps like > checking for input and output dir, calclulating input splits, JT assigning > TT for executing the task etc. > > If your file is non splittable , then one map task per file will be > generated irrespective of the number of hdfs blocks. Now some blocks will > be in a different node than the node where map task is executed so time > will be spend here on the network transfer. > > In your case MR would be a overhead as your file is non splittable hence > no parallelism and also there is an overhead of copying blocks to the map > task node. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Matthias Kricke <[email protected]> > *Sender: * [email protected] > *Date: *Mon, 13 Aug 2012 16:33:06 +0200 > *To: *<[email protected]> > *ReplyTo: * [email protected] > *Subject: *Re: how to enhance job start up speed? > > Ok, I try to clarify: > > 1) The worker is the logic inside my mapper and the same for both cases. > 2) I have two cases. In the first one I use hadoop to execute my worker > and in a second one, I execute my worker without hadoop (simple read of the > file). > Now I measured, for both cases, the time the worker and > the surroundings need (so i have two values for each case). The worker took > the same time in both cases for the same input (this is expected). But the > surroundings took 17% more time when using hadoop. > 3) ~ 3GB. > > I want to know how to reduce this difference and where they come from. > I hope that helped? If not, feel free to ask again :) > > Greetings, > MK > > P.S. just for your information, I did the same test with hypertable as > well. > I got: > * worker without anything: 15% overhead > * worker with hadoop: 32% overhead > * worker with hypertable: 53% overhead > Remark: overhead was measured in comparison to the worker. e.g. hypertable > uses 53% of the whole process time, while worker uses 47%. > > 2012/8/13 Bertrand Dechoux <[email protected]> > >> I am not sure to understand and I guess I am not the only one. >> >> 1) What's a worker in your context? Only the logic inside your Mapper or >> something else? >> 2) You should clarify your cases. You seem to have two cases but both are >> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so >> sequential is not Hadoop? >> 3) What are the size of the file? >> >> Bertrand >> >> >> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke < >> [email protected]> wrote: >> >>> Hello all, >>> >>> I'm using CDH3u3. >>> If I want to process one File, set to non splitable hadoop starts one >>> Mapper and no Reducer (thats ok for this test scenario). The Mapper >>> goes through a configuration step where some variables for the worker >>> inside the mapper are initialized. >>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I >>> process the V with the worker. >>> >>> When I compare the run time of hadoop to the run time of the same >>> process in sequentiell manner, I get: >>> >>> worker time --> same in both cases >>> >>> case: mapper --> overhead of ~32% to the worker process (same for bigger >>> chunk size) >>> case: sequentiell --> overhead of ~15% to the worker process >>> >>> It shouldn't be that much slower, because of non splitable, the mapper >>> will be executed where the data is saved by HDFS, won't it? >>> Where did those 17% go? How to reduce this? Did hadoop needs the whole >>> time for reading or streaming the data out of HDFS? >>> >>> I would appreciate your help, >>> >>> Greetings >>> mk >>> >>> >> >> >> -- >> Bertrand Dechoux >> > > -- Bertrand Dechoux
