Tez is designed as a set of libraries and APIs that will make it easier to
write data processing applications on YARN. It provides no logical
functionality by itself. Instead it provides infrastructure pieces that take
care of YARN scheduling, YARN container allocation, YARN container launch and
setup and other aspects YARN reporting like ATS integration and security. Think
of Tez as providing the infrastructure to coordinate and orchestrate the
application on YARN.
MR was both a logic application that provided Map-Reduce functional style
semantics with a Key-Value data model. Hive and Pig were record oriented
engines that provide higher level logical functionality but where built on MR
and had to translate their complex logical plans into MR. By switching to Tez,
these applications get necessary cluster coordination libraries from Tez - so
its easier for them to natively integrate with YARN instead of translating to
MR semantics.
The DAG based model in Tez comes from the DAG API that Tez exposes to define
the structure of the application that will execute on YARN. This only defines
the physical layout of parts of the program that will get launched on YARN.
What happens inside those launched programs is defined by the application - not
Tez. Inside the launched programs, the application runs its own processing
logic (eg joining or filtering data) and does some IO (say to local storage or
HDFS). Tez provides some helper libraries for the IO but the application is
free to write their own. So pluggability of the IO is also provided by Tez to
customize the application.
Effectively, Tez provides a pluggable coordination layer for scheduling
applications on the cluster. With the recent extensions made to Tez under
TEZ-2003, it may be possible to have the functionality extended to not just to
YARN clusters but other clusters like Mesos.
1) Tez is providing building blocks that can be used to write higher level
engines like MR, Hive, Pig etc. Application scenarios are any applications
whose final scheduling structure looks like a DAG of distributed tasks.
2) The problem its solving it to provide libraries that can be used by higher
level engines and other projects.
3) hive and Pig use it because it only provides the cluster coordination and
does not impose data semantics. So hive and Pig can use their native data
semantics (earlier they were translating to MR semantics). Similarly MR can be
run using the Tez libraries and it works today. There was a prototype of Spark
running on YARN using Tez libraries for YARN scheduling. All of these are
higher level engines that provide data semantics and logical operations while
Tez provides the scheduling infrastructure to run on YARN.
4) Don’t solve problems that have already been solved reiterates the common
libraries. Pig, hive, cascading, etc. don’t have to write the same code to
solve the same problems if they can use Tez libraries for common functionality.
Hope that helps!
Bikas
-Original Message-
From: LLBian [mailto:linanmengxia...@126.com]
Sent: Wednesday, January 20, 2016 8:44 AM
To: user@tez.apache.org
Subject: What's the application scenario of Apache TEZ
Hello,Tez experts:
I have known that, tez is used in DAG cases.
Because it can control the intermediate results do not write to disk,
and container reuse, so it is more effective in processing small amount of data
than mr. So, mybe I will think that hive on tez is better than hive on mr in
processing small amount of data, am I right?
Well, now, my questions are:
(1)Even though there are main design themes in https://tez.apache.org/ , I am
still not very clear about its application scenarios,and If there are some real
and main enterprise applications,so much the better.
(2)I am still not very clear what question It is mainly used to solving?
(3) Why it is use for hive and pig? how is it better than spark or mr?
(4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework
for Modeling and Building Data Processing Applications" , but still not very
clearly.
How to understand this :"Don’t solve problems that have already been solved.
Or else you will have to solve them again!"? Is there any real example?
Apache tez is a great product , I hope to learn more about it.
Any reply are very appreciated.
Thankyou & Best Regards.
---LLBian