Hi, there is a fixed overhead for scheduling and starting the job in MR mode. The minimum job time I have seen in my (limited) experience is around 1 minute for a piece of code that did basically nothing on a small dataset. If your job takes 1 minute locally, it's not a good candidate for parallelization :)
As suggested by others, try bigger numbers. A 10x increase should already give you something more meaningful in my opinion. Cheers, -- Gianmarco On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[email protected]>wrote: > FYI, local mode is ideally suited for debugging (easier since its a > single process). It is not suited for large datasets, that is the goal > of mapreduce. > > It might never be apples to apples if you are comparing the 2, since > the variables differ. For large datasets you might notice the job > choking on local mode. > > -Prashant > > On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[email protected]> > wrote: > > > I would recommend trying it with a few GBs. > > > > I'm curious as to why you are benchmarking local vs mapreduce? > > > > Thanks, > > Prashant > > > > On Jan 6, 2012, at 12:46 AM, Michael Lok <[email protected]> wrote: > > > >> Hi Prashant, > >> > >> Thanks for the input. Any idea what would be a good size to perform > >> benchmark on? > >> > >> > >> Thanks. > >> > >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi < > [email protected]> wrote: > >>> Hi Michael, > >>> > >>> That does not seem large enough for benchmarking/comparison. Please try > >>> increasing the filesize to make it a fair comparison :) > >>> It might be possible the cost of spawning multiple tasks across the > nodes > >>> is more than cost of running the job with little data locally. > >>> > >>> Thanks, > >>> Prashant > >>> > >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> > wrote: > >>> > >>>> Hi Prashant, > >>>> > >>>> 1000 and 4600 records respectively :) Hence the output from the cross > >>>> join is 4 million records. > >>>> > >>>> I suppose I should increase the number of records to take advantage of > >>>> the parallel features? :) > >>>> > >>>> > >>>> Thanks. > >>>> > >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi < > [email protected]> > >>>> wrote: > >>>>> What is the filesize' of the 2 data sets? If the datasets are really > >>>>> small, making it run distributed might not really give any advantage > >>>>> over local mode. > >>>>> > >>>>> Also the benefits of parallelism depends on how much data is being > >>>>> sent to the reducers. > >>>>> > >>>>> -Prashant > >>>>> > >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote: > >>>>> > >>>>>> Hi folks, > >>>>>> > >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for > the > >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values > >>>>>> from the joined result. > >>>>>> > >>>>>> The script was initially executed via local mode and the average > >>>>>> execution time is around 1 minute. > >>>>>> > >>>>>> However, when the script is executed via mapreduce mode, it averages > >>>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. > >>>>>> > >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but > it > >>>>>> doesn't affect the performance. > >>>>>> > >>>>>> Is there anything I should look at? BTW, the data size is pretty > >>>>>> small; around 4 million records generated from the CROSS operation. > >>>>>> > >>>>>> Here's the script which I'm referring to: > >>>>>> > >>>>>> set debug 'on'; > >>>>>> set job.name 'vacancy cross'; > >>>>>> set default_parallel 5; > >>>>>> > >>>>>> register pig/*.jar; > >>>>>> > >>>>>> define DIST com.pig.udf.Distance(); > >>>>>> > >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, > >>>>>> jsstate:chararray); > >>>>>> > >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, > >>>>>> vacstate:chararray); > >>>>>> > >>>>>> cx = cross js, vac; > >>>>>> > >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, > >>>> vacstate); > >>>>>> > >>>>>> store d into 'out' using PigStorage(','); > >>>>>> > >>>>>> > >>>>>> Thanks! > >>>> >
