Hi Prashant,

1000 and 4600 records respectively :)  Hence the output from the cross
join is 4 million records.

I suppose I should increase the number of records to take advantage of
the parallel features? :)


Thanks.

On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]> wrote:
> What is the filesize' of the 2 data sets? If the datasets are really
> small, making it run distributed might not really give any advantage
> over local mode.
>
> Also the benefits of parallelism depends on how much data is being
> sent to the reducers.
>
> -Prashant
>
> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote:
>
>> Hi folks,
>>
>> I've a simple script which does CROSS join (thanks to Dimitry for the
>> tip :D) and calls a UDF to perform simple matching between 2 values
>> from the joined result.
>>
>> The script was initially executed via local mode and the average
>> execution time is around 1 minute.
>>
>> However, when the script is executed via mapreduce mode, it averages
>> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
>>
>> I've tried setting the "default_parallel" setting to 5 and 10, but it
>> doesn't affect the performance.
>>
>> Is there anything I should look at?  BTW, the data size is pretty
>> small; around 4 million records generated from the CROSS operation.
>>
>> Here's the script which I'm referring to:
>>
>> set debug 'on';
>> set job.name 'vacancy cross';
>> set default_parallel 5;
>>
>> register pig/*.jar;
>>
>> define DIST com.pig.udf.Distance();
>>
>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
>> jsstate:chararray);
>>
>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
>> vacstate:chararray);
>>
>> cx = cross js, vac;
>>
>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);
>>
>> store d into 'out' using PigStorage(',');
>>
>>
>> Thanks!

Reply via email to