Hi folks,

I've a simple script which does CROSS join (thanks to Dimitry for the
tip :D) and calls a UDF to perform simple matching between 2 values
from the joined result.

The script was initially executed via local mode and the average
execution time is around 1 minute.

However, when the script is executed via mapreduce mode, it averages
2+ minutes.  The cluster I've setup consists of 4 datanodes.

I've tried setting the "default_parallel" setting to 5 and 10, but it
doesn't affect the performance.

Is there anything I should look at?  BTW, the data size is pretty
small; around 4 million records generated from the CROSS operation.

Here's the script which I'm referring to:

set debug 'on';
set job.name 'vacancy cross';
set default_parallel 5;

register pig/*.jar;

define DIST com.pig.udf.Distance();

js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
jsstate:chararray);

vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
vacstate:chararray);

cx = cross js, vac;

d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);

store d into 'out' using PigStorage(',');


Thanks!

Reply via email to