Re: Local executes faster compared to mapreduce mode

Prashant Kommireddi Fri, 06 Jan 2012 00:05:38 -0800

What is the filesize' of the 2 data sets? If the datasets are really
small, making it run distributed might not really give any advantage
over local mode.


Also the benefits of parallelism depends on how much data is being
sent to the reducers.

-Prashant

On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote:

> Hi folks,
>
> I've a simple script which does CROSS join (thanks to Dimitry for the
> tip :D) and calls a UDF to perform simple matching between 2 values
> from the joined result.
>
> The script was initially executed via local mode and the average
> execution time is around 1 minute.
>
> However, when the script is executed via mapreduce mode, it averages
> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
>
> I've tried setting the "default_parallel" setting to 5 and 10, but it
> doesn't affect the performance.
>
> Is there anything I should look at?  BTW, the data size is pretty
> small; around 4 million records generated from the CROSS operation.
>
> Here's the script which I'm referring to:
>
> set debug 'on';
> set job.name 'vacancy cross';
> set default_parallel 5;
>
> register pig/*.jar;
>
> define DIST com.pig.udf.Distance();
>
> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
> jsstate:chararray);
>
> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
> vacstate:chararray);
>
> cx = cross js, vac;
>
> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, vacstate);
>
> store d into 'out' using PigStorage(',');
>
>
> Thanks!

Re: Local executes faster compared to mapreduce mode

Reply via email to