Hi Prashant, 1000 and 4600 records respectively :) Hence the output from the cross join is 4 million records.
I suppose I should increase the number of records to take advantage of the parallel features? :) Thanks. On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]> wrote: > What is the filesize' of the 2 data sets? If the datasets are really > small, making it run distributed might not really give any advantage > over local mode. > > Also the benefits of parallelism depends on how much data is being > sent to the reducers. > > -Prashant > > On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote: > >> Hi folks, >> >> I've a simple script which does CROSS join (thanks to Dimitry for the >> tip :D) and calls a UDF to perform simple matching between 2 values >> from the joined result. >> >> The script was initially executed via local mode and the average >> execution time is around 1 minute. >> >> However, when the script is executed via mapreduce mode, it averages >> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >> I've tried setting the "default_parallel" setting to 5 and 10, but it >> doesn't affect the performance. >> >> Is there anything I should look at? BTW, the data size is pretty >> small; around 4 million records generated from the CROSS operation. >> >> Here's the script which I'm referring to: >> >> set debug 'on'; >> set job.name 'vacancy cross'; >> set default_parallel 5; >> >> register pig/*.jar; >> >> define DIST com.pig.udf.Distance(); >> >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >> jsstate:chararray); >> >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >> vacstate:chararray); >> >> cx = cross js, vac; >> >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate); >> >> store d into 'out' using PigStorage(','); >> >> >> Thanks!
