Tez is designed as a set of libraries and APIs that will make it easier to write data processing applications on YARN. It provides no logical functionality by itself. Instead it provides infrastructure pieces that take care of YARN scheduling, YARN container allocation, YARN container launch and setup and other aspects YARN reporting like ATS integration and security. Think of Tez as providing the infrastructure to coordinate and orchestrate the application on YARN.
MR was both a logic application that provided Map-Reduce functional style semantics with a Key-Value data model. Hive and Pig were record oriented engines that provide higher level logical functionality but where built on MR and had to translate their complex logical plans into MR. By switching to Tez, these applications get necessary cluster coordination libraries from Tez - so its easier for them to natively integrate with YARN instead of translating to MR semantics. The DAG based model in Tez comes from the DAG API that Tez exposes to define the structure of the application that will execute on YARN. This only defines the physical layout of parts of the program that will get launched on YARN. What happens inside those launched programs is defined by the application - not Tez. Inside the launched programs, the application runs its own processing logic (eg joining or filtering data) and does some IO (say to local storage or HDFS). Tez provides some helper libraries for the IO but the application is free to write their own. So pluggability of the IO is also provided by Tez to customize the application. Effectively, Tez provides a pluggable coordination layer for scheduling applications on the cluster. With the recent extensions made to Tez under TEZ-2003, it may be possible to have the functionality extended to not just to YARN clusters but other clusters like Mesos. 1) Tez is providing building blocks that can be used to write higher level engines like MR, Hive, Pig etc. Application scenarios are any applications whose final scheduling structure looks like a DAG of distributed tasks. 2) The problem its solving it to provide libraries that can be used by higher level engines and other projects. 3) hive and Pig use it because it only provides the cluster coordination and does not impose data semantics. So hive and Pig can use their native data semantics (earlier they were translating to MR semantics). Similarly MR can be run using the Tez libraries and it works today. There was a prototype of Spark running on YARN using Tez libraries for YARN scheduling. All of these are higher level engines that provide data semantics and logical operations while Tez provides the scheduling infrastructure to run on YARN. 4) Don’t solve problems that have already been solved reiterates the common libraries. Pig, hive, cascading, etc. don’t have to write the same code to solve the same problems if they can use Tez libraries for common functionality. Hope that helps! Bikas -----Original Message----- From: LLBian [mailto:[email protected]] Sent: Wednesday, January 20, 2016 8:44 AM To: [email protected] Subject: What's the application scenario of Apache TEZ Hello,Tez experts: I have known that, tez is used in DAG cases. Because it can control the intermediate results do not write to disk, and container reuse, so it is more effective in processing small amount of data than mr. So, mybe I will think that hive on tez is better than hive on mr in processing small amount of data, am I right? Well, now, my questions are: (1)Even though there are main design themes in https://tez.apache.org/ , I am still not very clear about its application scenarios,and If there are some real and main enterprise applications,so much the better. (2)I am still not very clear what question It is mainly used to solving? (3) Why it is use for hive and pig? how is it better than spark or mr? (4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications" , but still not very clearly. How to understand this :"Don’t solve problems that have already been solved. Or else you will have to solve them again!"? Is there any real example? Apache tez is a great product , I hope to learn more about it. Any reply are very appreciated. Thankyou & Best Regards. ---LLBian
