Tez is designed as a set of libraries and APIs that will make it easier to 
write data processing applications on YARN. It provides no logical 
functionality by itself. Instead it provides infrastructure pieces that take 
care of YARN scheduling, YARN container allocation, YARN container launch and 
setup and other aspects YARN reporting like ATS integration and security. Think 
of Tez as providing the infrastructure to coordinate and orchestrate the 
application on YARN.

MR was both a logic application that provided Map-Reduce functional style 
semantics with a Key-Value data model. Hive and Pig were record oriented 
engines that provide higher level logical functionality but where built on MR 
and had to translate their complex logical plans into MR. By switching to Tez, 
these applications get necessary cluster coordination libraries from Tez - so 
its easier for them to natively integrate with YARN instead of translating to 
MR semantics.

The DAG based model in Tez comes from the DAG API that Tez exposes to define 
the structure of the application that will execute on YARN. This only defines 
the physical layout of parts of the program that will get launched on YARN. 
What happens inside those launched programs is defined by the application - not 
Tez. Inside the launched programs, the application runs its own processing 
logic (eg joining or filtering data) and does some IO (say to local storage or 
HDFS). Tez provides some helper libraries for the IO but the application is 
free to write their own. So pluggability of the IO is also provided by Tez to 
customize the application.

Effectively, Tez provides a pluggable coordination layer for scheduling 
applications on the cluster. With the recent extensions made to Tez under 
TEZ-2003, it may be possible to have the functionality extended to not just to 
YARN clusters but other clusters like Mesos.

1) Tez is providing building blocks that can be used to write higher level 
engines like MR, Hive, Pig etc. Application scenarios are any applications 
whose final scheduling structure looks like a DAG of distributed tasks.
2) The problem its solving it to provide libraries that can be used by higher 
level engines and other projects.
3) hive and Pig use it because it only provides the cluster coordination and 
does not impose data semantics. So hive and Pig can use their native data 
semantics (earlier they were translating to MR semantics). Similarly MR can be 
run using the Tez libraries and it works today. There was a prototype of Spark 
running on YARN using Tez libraries for YARN scheduling. All of these are 
higher level engines that provide data semantics and logical operations while 
Tez provides the scheduling infrastructure to run on YARN.
4) Don’t solve problems that have already been solved reiterates the common 
libraries. Pig, hive, cascading, etc. don’t have to write the same code to 
solve the same problems if they can use Tez libraries for common functionality.

Hope that helps!
Bikas

-----Original Message-----
From: LLBian [mailto:[email protected]] 
Sent: Wednesday, January 20, 2016 8:44 AM
To: [email protected]
Subject: What's the application scenario of Apache TEZ


Hello,Tez experts:
      I have known that, tez is used in DAG cases.
       Because it can control the intermediate results do not write to disk, 
and container reuse, so it is more effective in processing small amount of data 
than mr. So, mybe I will think that hive on tez is better than hive on mr in 
processing small amount of data, am I right?
     Well, now, my questions are:
(1)Even though there are main design themes in https://tez.apache.org/ , I am 
still not very clear about its application scenarios,and If there are some real 
and main enterprise applications,so much the better.
(2)I am still not very clear what question It is mainly used to solving? 
(3) Why it is use for hive and pig? how is it better than spark or mr?
(4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework 
for Modeling and Building Data Processing Applications" , but still not very 
clearly. 
 How to understand this :"Don’t solve problems that have already been solved. 
Or else you will have to solve them again!"? Is there any real example?

      Apache tez is a great product , I hope to learn more about it.

 Any reply are very appreciated.

Thankyou & Best Regards.

---LLBian

   

Reply via email to