Hi folks,
I've a simple script which does CROSS join (thanks to Dimitry for the
tip :D) and calls a UDF to perform simple matching between 2 values
from the joined result.
The script was initially executed via local mode and the average
execution time is around 1 minute.
However, when the script is executed via mapreduce mode, it averages
2+ minutes. The cluster I've setup consists of 4 datanodes.
I've tried setting the "default_parallel" setting to 5 and 10, but it
doesn't affect the performance.
Is there anything I should look at? BTW, the data size is pretty
small; around 4 million records generated from the CROSS operation.
Here's the script which I'm referring to:
set debug 'on';
set job.name 'vacancy cross';
set default_parallel 5;
register pig/*.jar;
define DIST com.pig.udf.Distance();
js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
jsstate:chararray);
vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
vacstate:chararray);
cx = cross js, vac;
d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);
store d into 'out' using PigStorage(',');
Thanks!