Hi Michal,
A couple of areas where you could study performance without
duplicating Robert Stewart's work come to mind. One is in the area of
how data skew affects performance. This is a very real world concern
since in my experience almost all input data is power law
distributed. Consider for example if you want to join a highly skewed
table against an evenly distributed table. Using the default join
algorithm some small subset of your reducers will get the vast
majority of the data. Pig has a join implementation called skew join
that can handle this and evenly distribute the data. I believe Hive
has a similar join implementation (I know at least that they planned
to, I'm not sure if it's done yet or not). So you could test
performance of skewed joins between the two as well as skewed versus
non skewed implementations of join.
Another area that comes to mind is combining multiple grouping
operations into one Map Reduce job. This is something we see used
extensively at Yahoo as users often want to read data once and group
it by different sets of keys. Both Pig and Hive have support for
this. In Pig we call it multi-query. I think Hive calls it multiple
insert or something like that. This is another area where you could
test performance both between Pig and Hive and between using the multi-
query algorithm and scanning the data multiple times.
I hope those are helpful. Whatever you choose, good luck with your
thesis.
Alan.
On Jan 4, 2011, at 1:21 PM, Michał Anglart wrote:
Hi Everybody,
I'm a soon-to-graduate student of computer science at the Univeristy
of Wrocław in Poland. Currently I'm starting to write my master thesis
and I'm looking for some inspirations/ideas.
First of all I want to write about MapReduce - as far as I know nobody
took such topics as their thesis at my faculty, but the topic is
interesting, so someone should start. Lately I thought that maybe I
could consider comparing Java's MapReduce with Hive and Pig in terms
of it's performance, optimizations that are used inside etc.
Personally I had found it nice idea as it would allowed me to learn
both frameworks and take a look at the way they work. Unfortunately I
found out that Robert Stewart from Heriot Watt Univeristy wrote his
thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
Java" which can be found via Google. I looked through this paper and
it looks quite similar to what I wanted to do.
After this discover I thought that maybe a little bit different
approach to performance comparison can prove to be a succesful topic
for my master thesis: specifically I'm thinking about comparing the
frameworks in some real-life problem. Robert in his paper made the
experiments on few quite simple problems like word count, simple join
of two sets or logs proccessing. I'm thinking about first: comparing
them in real-life problem and second: look for optimizations that can
be made in Pig or Hive (e.g. choosing join strategy) and how it
affects the performance of the frameworks.
Ok, after this long introduction I want to ask you: do you think it is
interesting approach and does it make any sense? Is it worth trying?
If so - maybe you can suggest me the features of frameworks on which I
should look closer and maybe a real-life problems that can be used in
the experiments?
I look forward for any comments - thanks in advance.
p.s. I've posted this messege on both framework's mailing lists -
hive and pig.
Thanks!
Michal