I would recommend trying it with a few GBs.

I'm curious as to why you are benchmarking local vs mapreduce?

Thanks,
Prashant

On Jan 6, 2012, at 12:46 AM, Michael Lok <[email protected]> wrote:

> Hi Prashant,
>
> Thanks for the input.  Any idea what would be a good size to perform
> benchmark on?
>
>
> Thanks.
>
> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi <[email protected]> 
> wrote:
>> Hi Michael,
>>
>> That does not seem large enough for benchmarking/comparison. Please try
>> increasing the filesize to make it a fair comparison :)
>> It might be possible the cost of spawning multiple tasks across the nodes
>> is more than cost of running the job with little data locally.
>>
>> Thanks,
>> Prashant
>>
>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> wrote:
>>
>>> Hi Prashant,
>>>
>>> 1000 and 4600 records respectively :)  Hence the output from the cross
>>> join is 4 million records.
>>>
>>> I suppose I should increase the number of records to take advantage of
>>> the parallel features? :)
>>>
>>>
>>> Thanks.
>>>
>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi <[email protected]>
>>> wrote:
>>>> What is the filesize' of the 2 data sets? If the datasets are really
>>>> small, making it run distributed might not really give any advantage
>>>> over local mode.
>>>>
>>>> Also the benefits of parallelism depends on how much data is being
>>>> sent to the reducers.
>>>>
>>>> -Prashant
>>>>
>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote:
>>>>
>>>>> Hi folks,
>>>>>
>>>>> I've a simple script which does CROSS join (thanks to Dimitry for the
>>>>> tip :D) and calls a UDF to perform simple matching between 2 values
>>>>> from the joined result.
>>>>>
>>>>> The script was initially executed via local mode and the average
>>>>> execution time is around 1 minute.
>>>>>
>>>>> However, when the script is executed via mapreduce mode, it averages
>>>>> 2+ minutes.  The cluster I've setup consists of 4 datanodes.
>>>>>
>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but it
>>>>> doesn't affect the performance.
>>>>>
>>>>> Is there anything I should look at?  BTW, the data size is pretty
>>>>> small; around 4 million records generated from the CROSS operation.
>>>>>
>>>>> Here's the script which I'm referring to:
>>>>>
>>>>> set debug 'on';
>>>>> set job.name 'vacancy cross';
>>>>> set default_parallel 5;
>>>>>
>>>>> register pig/*.jar;
>>>>>
>>>>> define DIST com.pig.udf.Distance();
>>>>>
>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray,
>>>>> jsstate:chararray);
>>>>>
>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray,
>>>>> vacstate:chararray);
>>>>>
>>>>> cx = cross js, vac;
>>>>>
>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate,
>>> vacstate);
>>>>>
>>>>> store d into 'out' using PigStorage(',');
>>>>>
>>>>>
>>>>> Thanks!
>>>

Reply via email to