Re: Porting legacy MapReduce application to Tez

Rohini Palaniswamy Tue, 22 Dec 2020 19:32:04 -0800

The nutch jars have to be added to the distributed cache (
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#DistributedCache)
for it to be available in classpath in the tasks. Distributed cache is a
mapreduce terminology (from hadoop 1.x). With YARN (hadoop 2.x) the
implementation is via LocalResource (
https://blog.cloudera.com/resource-localization-in-yarn-deep-dive/). In Tez
user APIs you will find LocalResource while mapreduce still maintains the
original user apis of DistributedCache with the underlying implementation
being LocalResource.

The hadoop jar command takes care of adding the jar in the command to the
distributed cache. Any additional files need to be shipped with
-files/-libjars/-archives option (
https://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options)
or using the settings mapreduce.job.cache.{files|archives}. yarn-tez mode
also honors the mapreduce.job.cache.{files|archives} settings. So instead
of adding it to tez.lib.uris.classpath, you can specify via those settings.

Just a heads up. Tez is slightly more low level and was meant to be used by
frameworks like Pig, Hive, Cascading, etc and so building a Tez application
DAG from scratch is going to be more code and not as straightforward as
writing a mapper and reducer job. But it does come with a lot of
flexibility and ability to customize the DAG and can make a big difference
for some applications. For eg: Twitter folks extended it and used for an
application to do custom partitioning and routing of data (
https://issues.apache.org/jira/browse/TEZ-3209). Below are some classes
from Pig and Hive where dags are constructed to give you a general idea.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java

Regards,
Rohini

On Mon, Dec 21, 2020 at 12:50 PM Lewis John McGibbney <lewi...@apache.org>
wrote:

> I found the Tez Counters package
>
> https://tez.apache.org/releases/0.9.2/tez-api-javadocs/index.html?org/apache/tez/common/counters/package-summary.html
> I'm going to experiment adapting the Injector job to use this package
> rather than the legacy Map and Reduce Context objects.
>
> On 2020/12/21 20:11:59, Lewis John McGibbney <lewi...@apache.org> wrote:
> > Hi László,
> > Thank you for the additional explanation. Adapting my configuration
> based on your suggestions results in successful job execution as DAG's now.
> A huge thank you :)
> >
> ..
>

Re: Porting legacy MapReduce application to Tez

Reply via email to