MapReduce

Alan Gates Tue, 04 Jan 2011 13:39:21 -0800

Hi Michal,

A couple of areas where you could study performance withoutduplicating Robert Stewart's work come to mind. One is in the area ofhow data skew affects performance. This is a very real world concernsince in my experience almost all input data is power lawdistributed. Consider for example if you want to join a highly skewedtable against an evenly distributed table. Using the default joinalgorithm some small subset of your reducers will get the vastmajority of the data. Pig has a join implementation called skew jointhat can handle this and evenly distribute the data. I believe Hivehas a similar join implementation (I know at least that they plannedto, I'm not sure if it's done yet or not). So you could testperformance of skewed joins between the two as well as skewed versusnon skewed implementations of join.

Another area that comes to mind is combining multiple groupingoperations into one Map Reduce job. This is something we see usedextensively at Yahoo as users often want to read data once and groupit by different sets of keys. Both Pig and Hive have support forthis. In Pig we call it multi-query. I think Hive calls it multipleinsert or something like that. This is another area where you couldtest performance both between Pig and Hive and between using the multi-query algorithm and scanning the data multiple times.

I hope those are helpful. Whatever you choose, good luck with yourthesis.


Alan.

On Jan 4, 2011, at 1:21 PM, Michał Anglart wrote:

Hi Everybody,

I'm a soon-to-graduate student of computer science at the Univeristy
of Wrocław in Poland. Currently I'm starting to write my master thesis
and I'm looking for some inspirations/ideas.

First of all I want to write about MapReduce - as far as I know nobody
took such topics as their thesis at my faculty, but the topic is
interesting, so someone should start. Lately I thought that maybe I
could consider comparing Java's MapReduce with Hive and Pig in terms
of it's performance, optimizations that are used inside etc.
Personally I had found it nice idea as it would allowed me to learn
both frameworks and take a look at the way they work. Unfortunately I
found out that Robert Stewart from Heriot Watt Univeristy wrote his
thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
Java" which can be found via Google. I looked through this paper and
it looks quite similar to what I wanted to do.

After this discover I thought that maybe a little bit different
approach to performance comparison can prove to be a succesful topic
for my master thesis: specifically I'm thinking about comparing the
frameworks in some real-life problem. Robert in his paper made the
experiments on few quite simple problems like word count, simple join
of two sets or logs proccessing. I'm thinking about first: comparing
them in real-life problem and second: look for optimizations that can
be made in Pig or Hive (e.g. choosing join strategy) and how it
affects the performance of the frameworks.

Ok, after this long introduction I want to ask you: do you think it is
interesting approach and does it make any sense? Is it worth trying?
If so - maybe you can suggest me the features of frameworks on which I
should look closer and maybe a real-life problems that can be used in
the experiments?

I look forward for any comments - thanks in advance.

p.s. I've posted this messege on both framework's mailing lists -hive and pig.



Thanks!
Michal

Re: Master thesis about Hive/Pig/MapReduce

Reply via email to