Hi, Thanks for the additional info. Will be trying a substantial increase in record size and see how that performs.
The benchmark is to see what kind of performance gain I can get from mapreduce mode; depending on the number of nodes. Thanks. On Sat, Jan 7, 2012 at 10:38 PM, Gianmarco De Francisci Morales <[email protected]> wrote: > Hi, > there is a fixed overhead for scheduling and starting the job in MR mode. > The minimum job time I have seen in my (limited) experience is around 1 > minute for a piece of code that did basically nothing on a small dataset. > If your job takes 1 minute locally, it's not a good candidate for > parallelization :) > > As suggested by others, try bigger numbers. > A 10x increase should already give you something more meaningful in my > opinion. > > Cheers, > -- > Gianmarco > > > > On Fri, Jan 6, 2012 at 10:22, Prashant Kommireddi <[email protected]>wrote: > >> FYI, local mode is ideally suited for debugging (easier since its a >> single process). It is not suited for large datasets, that is the goal >> of mapreduce. >> >> It might never be apples to apples if you are comparing the 2, since >> the variables differ. For large datasets you might notice the job >> choking on local mode. >> >> -Prashant >> >> On Jan 6, 2012, at 1:16 AM, Prashant Kommireddi <[email protected]> >> wrote: >> >> > I would recommend trying it with a few GBs. >> > >> > I'm curious as to why you are benchmarking local vs mapreduce? >> > >> > Thanks, >> > Prashant >> > >> > On Jan 6, 2012, at 12:46 AM, Michael Lok <[email protected]> wrote: >> > >> >> Hi Prashant, >> >> >> >> Thanks for the input. Any idea what would be a good size to perform >> >> benchmark on? >> >> >> >> >> >> Thanks. >> >> >> >> On Fri, Jan 6, 2012 at 4:29 PM, Prashant Kommireddi < >> [email protected]> wrote: >> >>> Hi Michael, >> >>> >> >>> That does not seem large enough for benchmarking/comparison. Please try >> >>> increasing the filesize to make it a fair comparison :) >> >>> It might be possible the cost of spawning multiple tasks across the >> nodes >> >>> is more than cost of running the job with little data locally. >> >>> >> >>> Thanks, >> >>> Prashant >> >>> >> >>> On Fri, Jan 6, 2012 at 12:10 AM, Michael Lok <[email protected]> >> wrote: >> >>> >> >>>> Hi Prashant, >> >>>> >> >>>> 1000 and 4600 records respectively :) Hence the output from the cross >> >>>> join is 4 million records. >> >>>> >> >>>> I suppose I should increase the number of records to take advantage of >> >>>> the parallel features? :) >> >>>> >> >>>> >> >>>> Thanks. >> >>>> >> >>>> On Fri, Jan 6, 2012 at 4:04 PM, Prashant Kommireddi < >> [email protected]> >> >>>> wrote: >> >>>>> What is the filesize' of the 2 data sets? If the datasets are really >> >>>>> small, making it run distributed might not really give any advantage >> >>>>> over local mode. >> >>>>> >> >>>>> Also the benefits of parallelism depends on how much data is being >> >>>>> sent to the reducers. >> >>>>> >> >>>>> -Prashant >> >>>>> >> >>>>> On Jan 5, 2012, at 11:52 PM, Michael Lok <[email protected]> wrote: >> >>>>> >> >>>>>> Hi folks, >> >>>>>> >> >>>>>> I've a simple script which does CROSS join (thanks to Dimitry for >> the >> >>>>>> tip :D) and calls a UDF to perform simple matching between 2 values >> >>>>>> from the joined result. >> >>>>>> >> >>>>>> The script was initially executed via local mode and the average >> >>>>>> execution time is around 1 minute. >> >>>>>> >> >>>>>> However, when the script is executed via mapreduce mode, it averages >> >>>>>> 2+ minutes. The cluster I've setup consists of 4 datanodes. >> >>>>>> >> >>>>>> I've tried setting the "default_parallel" setting to 5 and 10, but >> it >> >>>>>> doesn't affect the performance. >> >>>>>> >> >>>>>> Is there anything I should look at? BTW, the data size is pretty >> >>>>>> small; around 4 million records generated from the CROSS operation. >> >>>>>> >> >>>>>> Here's the script which I'm referring to: >> >>>>>> >> >>>>>> set debug 'on'; >> >>>>>> set job.name 'vacancy cross'; >> >>>>>> set default_parallel 5; >> >>>>>> >> >>>>>> register pig/*.jar; >> >>>>>> >> >>>>>> define DIST com.pig.udf.Distance(); >> >>>>>> >> >>>>>> js = load 'jobseeker.csv' using PigStorage(',') as (ic:chararray, >> >>>>>> jsstate:chararray); >> >>>>>> >> >>>>>> vac = load 'vacancy.csv' using PigStorage(',') as (id:chararray, >> >>>>>> vacstate:chararray); >> >>>>>> >> >>>>>> cx = cross js, vac; >> >>>>>> >> >>>>>> d = foreach cx generate ic, jsstate, id, vacstate, DIST(jsstate, >> >>>> vacstate); >> >>>>>> >> >>>>>> store d into 'out' using PigStorage(','); >> >>>>>> >> >>>>>> >> >>>>>> Thanks! >> >>>> >>
