RE: What's the application scenario of Apache TEZ

2016-01-20 Thread Bikas Saha
Tez is designed as a set of libraries and APIs that will make it easier to 
write data processing applications on YARN. It provides no logical 
functionality by itself. Instead it provides infrastructure pieces that take 
care of YARN scheduling, YARN container allocation, YARN container launch and 
setup and other aspects YARN reporting like ATS integration and security. Think 
of Tez as providing the infrastructure to coordinate and orchestrate the 
application on YARN.

MR was both a logic application that provided Map-Reduce functional style 
semantics with a Key-Value data model. Hive and Pig were record oriented 
engines that provide higher level logical functionality but where built on MR 
and had to translate their complex logical plans into MR. By switching to Tez, 
these applications get necessary cluster coordination libraries from Tez - so 
its easier for them to natively integrate with YARN instead of translating to 
MR semantics.

The DAG based model in Tez comes from the DAG API that Tez exposes to define 
the structure of the application that will execute on YARN. This only defines 
the physical layout of parts of the program that will get launched on YARN. 
What happens inside those launched programs is defined by the application - not 
Tez. Inside the launched programs, the application runs its own processing 
logic (eg joining or filtering data) and does some IO (say to local storage or 
HDFS). Tez provides some helper libraries for the IO but the application is 
free to write their own. So pluggability of the IO is also provided by Tez to 
customize the application.

Effectively, Tez provides a pluggable coordination layer for scheduling 
applications on the cluster. With the recent extensions made to Tez under 
TEZ-2003, it may be possible to have the functionality extended to not just to 
YARN clusters but other clusters like Mesos.

1) Tez is providing building blocks that can be used to write higher level 
engines like MR, Hive, Pig etc. Application scenarios are any applications 
whose final scheduling structure looks like a DAG of distributed tasks.
2) The problem its solving it to provide libraries that can be used by higher 
level engines and other projects.
3) hive and Pig use it because it only provides the cluster coordination and 
does not impose data semantics. So hive and Pig can use their native data 
semantics (earlier they were translating to MR semantics). Similarly MR can be 
run using the Tez libraries and it works today. There was a prototype of Spark 
running on YARN using Tez libraries for YARN scheduling. All of these are 
higher level engines that provide data semantics and logical operations while 
Tez provides the scheduling infrastructure to run on YARN.
4) Don’t solve problems that have already been solved reiterates the common 
libraries. Pig, hive, cascading, etc. don’t have to write the same code to 
solve the same problems if they can use Tez libraries for common functionality.

Hope that helps!
Bikas

-Original Message-
From: LLBian [mailto:linanmengxia...@126.com] 
Sent: Wednesday, January 20, 2016 8:44 AM
To: user@tez.apache.org
Subject: What's the application scenario of Apache TEZ


Hello,Tez experts:
  I have known that, tez is used in DAG cases.
   Because it can control the intermediate results do not write to disk, 
and container reuse, so it is more effective in processing small amount of data 
than mr. So, mybe I will think that hive on tez is better than hive on mr in 
processing small amount of data, am I right?
 Well, now, my questions are:
(1)Even though there are main design themes in https://tez.apache.org/ , I am 
still not very clear about its application scenarios,and If there are some real 
and main enterprise applications,so much the better.
(2)I am still not very clear what question It is mainly used to solving? 
(3) Why it is use for hive and pig? how is it better than spark or mr?
(4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework 
for Modeling and Building Data Processing Applications" , but still not very 
clearly. 
 How to understand this :"Don’t solve problems that have already been solved. 
Or else you will have to solve them again!"? Is there any real example?

  Apache tez is a great product , I hope to learn more about it.

 Any reply are very appreciated.

Thankyou & Best Regards.

---LLBian

   



What's the application scenario of Apache TEZ

2016-01-20 Thread LLBian

Hello,Tez experts:
      I have known that, tez is used in DAG cases.
       Because it can control the intermediate results do not write to disk, 
and container reuse, so it is more effective in processing small amount of data 
than mr. So, mybe I will think that hive on tez is better than hive on mr in 
processing small amount of data, am I right?
     Well, now, my questions are:
(1)Even though there are main design themes in https://tez.apache.org/ , I am 
still not very clear about its application scenarios,and If there are some real 
and main enterprise applications,so much the better.
(2)I am still not very clear what question It is mainly used to solving? 
(3) Why it is use for hive and pig? how is it better than spark or mr?
(4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework 
for Modeling and Building Data Processing Applications" , but still not very 
clearly. 
 How to understand this :"Don’t solve problems that have already been solved. 
Or else you will have to solve them again!"? Is there any real example?

  Apache tez is a great product , I hope to learn more about it.

 Any reply are very appreciated.

Thankyou & Best Regards.

---LLBian

   

Re: What's the application scenario of Apache TEZ

2016-01-20 Thread Hitesh Shah
Couple of other points to add to Bikas’s email: 

Regarding your question on small data: No - Tez is geared to work in both small 
data and extremely large data cases. Hive should likely perform better with Tez 
regardless of data size unless there is a bad query plan created that is 
non-optimal for Tez.

For 3). Hive/Pig/Cascading when used with MR would deconstruct a single hive 
query/pig script into multiple MR jobs. This would end up reading/writing 
from/to HDFS multiple times. Furthermore, with MR, you are stuck to fitting all 
your code into a Mapper and Reducer ( each with only a single input and output 
) and using Shuffle for data transfer. This introduces additional 
inefficiencies. With Tez, a single hive query can be converted into a single 
DAG. Vertices can run any kind of logic and the edges between vertices are not 
restricted to “shuffle-like” data transfer which allows more optimizations at 
the query planning stages. The fact that Tez allows Hive/Pig to use smarter 
ways of processing queries/scripts is what is usually the biggest win in terms 
of performance. Spark is similarly better than MR as it provides a richer 
operator library in some sense. As for comparing Spark vs Tez, to some extent, 
it is likely comparing apples to oranges as Tez is quite a low-level library. 
Depending on how an application is written to make use of Tez vs Spark, you 
will find different cases where one is faster than the other. 
 
— Hitesh

On Jan 20, 2016, at 8:44 AM, LLBian  wrote:

> 
> Hello,Tez experts:
>   I have known that, tez is used in DAG cases.
>Because it can control the intermediate results do not write to disk, 
> and container reuse, so it is more effective in processing small amount of 
> data than mr. So, mybe I will think that hive on tez is better than hive on 
> mr in processing small amount of data, am I right?
>  Well, now, my questions are:
> (1)Even though there are main design themes in https://tez.apache.org/ , I am 
> still not very clear about its application scenarios,and If there are some 
> real and main enterprise applications,so much the better.
> (2)I am still not very clear what question It is mainly used to solving? 
> (3) Why it is use for hive and pig? how is it better than spark or mr?
> (4)I looked at your official PPT and paper “Apache Tez: A Unifying Framework 
> for Modeling and Building Data Processing Applications" , but still not very 
> clearly. 
> How to understand this :"Don’t solve problems that have already been solved. 
> Or else you will have to solve them again!"? Is there any real example?
> 
>  Apache tez is a great product , I hope to learn more about it.
> 
> Any reply are very appreciated.
> 
> Thankyou & Best Regards.
> 
> ---LLBian
> 
>