Hi Prashant, Thanks for the input. Any idea what would be a good size to perform benchmark on?
Thanks. On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[email protected]> wrote: > Hi Michael, > > That does not seem large enough for benchmarking/comparison. Please try > increasing the filesize to make it a fair comparison :) > It might be possible the cost of spawning multiple tasks across the nodes > is more than cost of running the job with little data locally. > > Thanks, > Prashant > > On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> wrote: > >> Hi Prashant, >> >> 1000 and 4600 records respectively :) Hence the output from the cross >> join is 4 million records. >> >> I suppose I should increase the number of records to take advantage of >> the parallel features? :) >> >> >> Thanks. >> >> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]> >> wrote: >> > What is the filesize' of the 2 data sets? If the datasets are really >> > small, making it run distributed might not really give any advantage >> > over local mode. >> > >> > Also the benefits of parallelism depends on how much data is being >> > sent to the reducers. >> > >> > -Prashant >> > >> > On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote: >> > >> >> Hi folks, >> >> >> >> I've a simple script which does CROSS join (thanks to Dimitry for the >> >> tip :D) and calls a UDF to perform simple matching between 2 values >> >> from the joined result. >> >> >> >> The script was initially executed via local mode and the average >> >> execution time is around 1 minute. >> >> >> >> However, when the script is executed via mapreduce mode, it averages >> >> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >> >> >> I've tried setting the "default_parallel" setting to 5 and 10, but it >> >> doesn't affect the performance. >> >> >> >> Is there anything I should look at? BTW, the data size is pretty >> >> small; around 4 million records generated from the CROSS operation. >> >> >> >> Here's the script which I'm referring to: >> >> >> >> set debug 'on'; >> >> set job.name 'vacancy cross'; >> >> set default_parallel 5; >> >> >> >> register pig/*.jar; >> >> >> >> define DIST com.pig.udf.Distance(); >> >> >> >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >> >> jsstate:chararray); >> >> >> >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >> >> vacstate:chararray); >> >> >> >> cx = cross js, vac; >> >> >> >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >> vacstate); >> >> >> >> store d into 'out' using PigStorage(','); >> >> >> >> >> >> Thanks! >>
