Jumping in a bit late on this.. I am not sure how valuable such benchmarks are from an academic standpoint. You wouldn't really be testing performance of the algorithms, but of the implementations, and with a lot of unknowns in the middle -- Pig and Hive use different serialization format, different code for reading and passing around data, construct their pipelines differently, treat garbage collection differently, and so on. I wouldn't be surprised if measuring theoretically identical join algorithms (say, the regular hash join) would give you different results. Moreover, all of these things are highly dependent on tuning, memory settings, and are a moving target as both projects keep improving their codebases.
Two ideas: 1) If you are specifically interested in benchmarks, an interesting benchmarking problem might be doing something like adjusting the various JVM parameters to identify what effect they have on execution of Pig and Hive jobs, and whether the same parameters are found ideal for both. That way you are isolating your test to a single variable (since you are only comparing Pig to Pig, and Hive to Hive). It would be really cool if you came up with something that cleverly searched the total space of the JVM parameters and identified likely best configurations, without doing an exhaustive search of the space of course. There are pointers at some JVM resources here: http://www.quora.com/What-are-some-useful-tips-for-tuning-programs-running-on-the-JVM?q=jvm+tun You might even try measuring what effects using different garbage collectors has. Try to do your experiments on a real cluster, I suspect using AWS will be a bit suspect since their virtualization tech will get in your way. 2) I would love to see someone do proper cost-based optimization for either Hive or Pig. I know several people have tried in the past but nothing that really worked came of it... I'd be happy to help brainstorm approaches. -D 2011/1/4 Alan Gates <[email protected]> > Hi Michal, > > A couple of areas where you could study performance without duplicating > Robert Stewart's work come to mind. One is in the area of how data skew > affects performance. This is a very real world concern since in my > experience almost all input data is power law distributed. Consider for > example if you want to join a highly skewed table against an evenly > distributed table. Using the default join algorithm some small subset of > your reducers will get the vast majority of the data. Pig has a join > implementation called skew join that can handle this and evenly distribute > the data. I believe Hive has a similar join implementation (I know at least > that they planned to, I'm not sure if it's done yet or not). So you could > test performance of skewed joins between the two as well as skewed versus > non skewed implementations of join. > > Another area that comes to mind is combining multiple grouping operations > into one Map Reduce job. This is something we see used extensively at Yahoo > as users often want to read data once and group it by different sets of > keys. Both Pig and Hive have support for this. In Pig we call it > multi-query. I think Hive calls it multiple insert or something like that. > This is another area where you could test performance both between Pig and > Hive and between using the multi-query algorithm and scanning the data > multiple times. > > I hope those are helpful. Whatever you choose, good luck with your thesis. > > Alan. > > > On Jan 4, 2011, at 1:21 PM, Michał Anglart wrote: > > Hi Everybody, >> >> I'm a soon-to-graduate student of computer science at the Univeristy >> of Wrocław in Poland. Currently I'm starting to write my master thesis >> and I'm looking for some inspirations/ideas. >> >> First of all I want to write about MapReduce - as far as I know nobody >> took such topics as their thesis at my faculty, but the topic is >> interesting, so someone should start. Lately I thought that maybe I >> could consider comparing Java's MapReduce with Hive and Pig in terms >> of it's performance, optimizations that are used inside etc. >> Personally I had found it nice idea as it would allowed me to learn >> both frameworks and take a look at the way they work. Unfortunately I >> found out that Robert Stewart from Heriot Watt Univeristy wrote his >> thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and >> Java" which can be found via Google. I looked through this paper and >> it looks quite similar to what I wanted to do. >> >> After this discover I thought that maybe a little bit different >> approach to performance comparison can prove to be a succesful topic >> for my master thesis: specifically I'm thinking about comparing the >> frameworks in some real-life problem. Robert in his paper made the >> experiments on few quite simple problems like word count, simple join >> of two sets or logs proccessing. I'm thinking about first: comparing >> them in real-life problem and second: look for optimizations that can >> be made in Pig or Hive (e.g. choosing join strategy) and how it >> affects the performance of the frameworks. >> >> Ok, after this long introduction I want to ask you: do you think it is >> interesting approach and does it make any sense? Is it worth trying? >> If so - maybe you can suggest me the features of frameworks on which I >> should look closer and maybe a real-life problems that can be used in >> the experiments? >> >> I look forward for any comments - thanks in advance. >> >> p.s. I've posted this messege on both framework's mailing lists - hive and >> pig. >> >> >> Thanks! >> Michal >> > >
