Thanks for the suggestion. Can you please explain a little on "focusing on the design, the implementation with third party tools", do you mean comparing them ? And by script you mean scripts of UDFs, SerDes and Loaders ?
Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland. On Sat, May 3, 2014 at 4:23 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > IMHP Comparing the "performance" is boring and has been done umpteen times > before. The world won't get much out of another performance benchmark, > other then a bunch of fan boys saying "Look ours is faster hahahahah" and > then the other side says "but in this case ours is faster and that is the > more important case" Benchmarks are easy to bias and manipulate, and > comparing two like but not exact systems is hard. For example you will see > impala "winning" benchmarks HPC by re-writing queries, and then someone in > tez re-writes it another way tunes a setting and then they are "winning" > the benchmark. > > You would be better off focusing on the design, the implementation with > third party tools (udfs, serdes, loaders) , the nuances of a more > procedural language then a declarative. Look in the world for scripts and > see who is deploying them effectively. > > > > > > On Sat, May 3, 2014 at 4:46 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com>wrote: > >> Thanks Thejas for your input! These are interesting and very specific >> which is exactly what is required for a masters thesis. >> >> Are there any publications on Hive and the evaluation of its performance >> that i can use to compare ? >> >> Regards, >> Sarfraz Rasheed Ramay (DIT) >> Dublin, Ireland. >> >> >> On Sat, May 3, 2014 at 3:07 AM, Thejas Nair <the...@hortonworks.com>wrote: >> >>> The primary difference between hive and pig is the language. There are >>> implementation differences that will result in performance >>> differences, but it will be hard to figure out what aspect of >>> implementation responsible for what improvement. >>> >>> I think a more interesting project would be to compare the impact of >>> various performance improvements in hive. There are many features that >>> you can turn on and off. >>> >>> example - >>> - hive vectorization >>> - file format - text vs RCFile vs ORC >>> - compressed vs uncompressed >>> - mapreduce vs tez execution engine >>> - stats optimized queries >>> >>> >>> >>> On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay <sarfraz.ra...@gmail.com> >>> wrote: >>> >> >>> >> Hi, >>> >> >>> >> It seems that both Hive and Pig are used for managing large data sets. >>> >> Hive is more SQL oriented whereas Pig is more for the data flows. I >>> am doing >>> >> a master's thesis on the performance evaluation of both. Can some >>> please >>> >> provide a list of tasks that would make for an interesting comparison >>> ? >>> >> >>> >> >>> >> What is Hive good at ? >>> >> >>> >> What is Pig good at ? >>> >> >>> >> Ideally, i would like to take what Hive is good at and test it in Pig >>> and >>> >> vice versa. The competitive characteristics would make for an >>> interesting >>> >> comparison. >>> >> >>> >> >>> >> >>> >> >>> >> Regards, >>> >> Sarfraz Rasheed Ramay (DIT) >>> >> Dublin, Ireland. >>> > >>> > >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >>> >> >> >