I would recommend trying it with a few GBs. I'm curious as to why you are benchmarking local vs mapreduce?
Thanks, Prashant On Jan 6, 2012, at 12:46 AM, Michael Lok <[email protected]> wrote: > Hi Prashant, > > Thanks for the input. Any idea what would be a good size to perform > benchmark on? > > > Thanks. > > On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[email protected]> > wrote: >> Hi Michael, >> >> That does not seem large enough for benchmarking/comparison. Please try >> increasing the filesize to make it a fair comparison :) >> It might be possible the cost of spawning multiple tasks across the nodes >> is more than cost of running the job with little data locally. >> >> Thanks, >> Prashant >> >> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> wrote: >> >>> Hi Prashant, >>> >>> 1000 and 4600 records respectively :) Hence the output from the cross >>> join is 4 million records. >>> >>> I suppose I should increase the number of records to take advantage of >>> the parallel features? :) >>> >>> >>> Thanks. >>> >>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]> >>> wrote: >>>> What is the filesize' of the 2 data sets? If the datasets are really >>>> small, making it run distributed might not really give any advantage >>>> over local mode. >>>> >>>> Also the benefits of parallelism depends on how much data is being >>>> sent to the reducers. >>>> >>>> -Prashant >>>> >>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote: >>>> >>>>> Hi folks, >>>>> >>>>> I've a simple script which does CROSS join (thanks to Dimitry for the >>>>> tip :D) and calls a UDF to perform simple matching between 2 values >>>>> from the joined result. >>>>> >>>>> The script was initially executed via local mode and the average >>>>> execution time is around 1 minute. >>>>> >>>>> However, when the script is executed via mapreduce mode, it averages >>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. >>>>> >>>>> I've tried setting the "default_parallel" setting to 5 and 10, but it >>>>> doesn't affect the performance. >>>>> >>>>> Is there anything I should look at? BTW, the data size is pretty >>>>> small; around 4 million records generated from the CROSS operation. >>>>> >>>>> Here's the script which I'm referring to: >>>>> >>>>> set debug 'on'; >>>>> set job.name 'vacancy cross'; >>>>> set default_parallel 5; >>>>> >>>>> register pig/*.jar; >>>>> >>>>> define DIST com.pig.udf.Distance(); >>>>> >>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >>>>> jsstate:chararray); >>>>> >>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >>>>> vacstate:chararray); >>>>> >>>>> cx = cross js, vac; >>>>> >>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >>> vacstate); >>>>> >>>>> store d into 'out' using PigStorage(','); >>>>> >>>>> >>>>> Thanks! >>>
