Hi,

Thanks for the additional info.  Will be trying a substantial increase
in record size and see how that performs.

The benchmark is to see what kind of performance gain I can get from
mapreduce mode; depending on the number of nodes.


Thanks.

On Sat, Jan 7, 2012 at 10:38 PM, Gianmarco De Francisci Morales
<[email protected]> wrote:
> Hi,
> there is a fixed overhead for scheduling and starting the job in MR mode.
> The minimum job time I have seen in my (limited) experience is around 1
> minute for a piece of code that did basically nothing on a small dataset.
> If your job takes 1 minute locally, it's not a good candidate for
> parallelization :)
>
> As suggested by others, try bigger numbers.
> A 10x increase should already give you something more meaningful in my
> opinion.
>
> Cheers,
> --
> Gianmarco
>
>
>
> On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[email protected]>wrote:
>
>> FYI, local mode is ideally suited for debugging (easier since its a
>> single process). It is not suited for large datasets, that is the goal
>> of mapreduce.
>>
>> It might never be apples to apples if you are comparing the 2, since
>> the variables differ. For large datasets you might notice the job
>> choking on local mode.
>>
>> -Prashant
>>
>> On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[email protected]>
>> wrote:
>>
>> > I would recommend trying it with a few GBs.
>> >
>> > I'm curious as to why you are benchmarking local vs mapreduce?
>> >
>> > Thanks,
>> > Prashant
>> >
>> > On Jan 6, 2012, at 12:46 AM, Michael Lok <[email protected]> wrote:
>> >
>> >> Hi Prashant,
>> >>
>> >> Thanks for the input.  Any idea what would be a good size to perform
>> >> benchmark on?
>> >>
>> >>
>> >> Thanks.
>> >>
>> >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <
>> [email protected]> wrote:
>> >>> Hi Michael,
>> >>>
>> >>> That does not seem large enough for benchmarking/comparison. Please try
>> >>> increasing the filesize to make it a fair comparison :)
>> >>> It might be possible the cost of spawning multiple tasks across the
>> nodes
>> >>> is more than cost of running the job with little data locally.
>> >>>
>> >>> Thanks,
>> >>> Prashant
>> >>>
>> >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]>
>> wrote:
>> >>>
>> >>>> Hi Prashant,
>> >>>>
>> >>>> 1000 and 4600 records respectively :)  Hence the output from the cross
>> >>>> join is 4 million records.
>> >>>>
>> >>>> I suppose I should increase the number of records to take advantage of
>> >>>> the parallel features? :)
>> >>>>
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <
>> [email protected]>
>> >>>> wrote:
>> >>>>> What is the filesize' of the 2 data sets? If the datasets are really
>> >>>>> small, making it run distributed might not really give any advantage
>> >>>>> over local mode.
>> >>>>>
>> >>>>> Also the benefits of parallelism depends on how much data is being
>> >>>>> sent to the reducers.
>> >>>>>
>> >>>>> -Prashant
>> >>>>>
>> >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote:
>> >>>>>
>> >>>>>> Hi folks,
>> >>>>>>
>> >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for
>> the
>> >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values
>> >>>>>> from the joined result.
>> >>>>>>
>> >>>>>> The script was initially executed via local mode and the average
>> >>>>>> execution time is around 1 minute.
>> >>>>>>
>> >>>>>> However, when the script is executed via mapreduce mode, it averages
>> >>>>>> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
>> >>>>>>
>> >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but
>> it
>> >>>>>> doesn't affect the performance.
>> >>>>>>
>> >>>>>> Is there anything I should look at?  BTW, the data size is pretty
>> >>>>>> small; around 4 million records generated from the CROSS operation.
>> >>>>>>
>> >>>>>> Here's the script which I'm referring to:
>> >>>>>>
>> >>>>>> set debug 'on';
>> >>>>>> set job.name 'vacancy cross';
>> >>>>>> set default_parallel 5;
>> >>>>>>
>> >>>>>> register pig/*.jar;
>> >>>>>>
>> >>>>>> define DIST com.pig.udf.Distance();
>> >>>>>>
>> >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
>> >>>>>> jsstate:chararray);
>> >>>>>>
>> >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
>> >>>>>> vacstate:chararray);
>> >>>>>>
>> >>>>>> cx = cross js, vac;
>> >>>>>>
>> >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate,
>> >>>> vacstate);
>> >>>>>>
>> >>>>>> store d into 'out' using PigStorage(',');
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks!
>> >>>>
>>

Reply via email to