Hi Prashant,

Thanks for the input.  Any idea what would be a good size to perform
benchmark on?


Thanks.

On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[email protected]> wrote:
> Hi Michael,
>
> That does not seem large enough for benchmarking/comparison. Please try
> increasing the filesize to make it a fair comparison :)
> It might be possible the cost of spawning multiple tasks across the nodes
> is more than cost of running the job with little data locally.
>
> Thanks,
> Prashant
>
> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> wrote:
>
>> Hi Prashant,
>>
>> 1000 and 4600 records respectively :)  Hence the output from the cross
>> join is 4 million records.
>>
>> I suppose I should increase the number of records to take advantage of
>> the parallel features? :)
>>
>>
>> Thanks.
>>
>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]>
>> wrote:
>> > What is the filesize' of the 2 data sets? If the datasets are really
>> > small, making it run distributed might not really give any advantage
>> > over local mode.
>> >
>> > Also the benefits of parallelism depends on how much data is being
>> > sent to the reducers.
>> >
>> > -Prashant
>> >
>> > On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote:
>> >
>> >> Hi folks,
>> >>
>> >> I've a simple script which does CROSS join (thanks to Dimitry for the
>> >> tip :D) and calls a UDF to perform simple matching between 2 values
>> >> from the joined result.
>> >>
>> >> The script was initially executed via local mode and the average
>> >> execution time is around 1 minute.
>> >>
>> >> However, when the script is executed via mapreduce mode, it averages
>> >> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
>> >>
>> >> I've tried setting the "default_parallel" setting to 5 and 10, but it
>> >> doesn't affect the performance.
>> >>
>> >> Is there anything I should look at?  BTW, the data size is pretty
>> >> small; around 4 million records generated from the CROSS operation.
>> >>
>> >> Here's the script which I'm referring to:
>> >>
>> >> set debug 'on';
>> >> set job.name 'vacancy cross';
>> >> set default_parallel 5;
>> >>
>> >> register pig/*.jar;
>> >>
>> >> define DIST com.pig.udf.Distance();
>> >>
>> >> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
>> >> jsstate:chararray);
>> >>
>> >> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
>> >> vacstate:chararray);
>> >>
>> >> cx = cross js, vac;
>> >>
>> >> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate,
>> vacstate);
>> >>
>> >> store d into 'out' using PigStorage(',');
>> >>
>> >>
>> >> Thanks!
>>

Reply via email to