The nutch jars have to be added to the distributed cache ( https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#DistributedCache) for it to be available in classpath in the tasks. Distributed cache is a mapreduce terminology (from hadoop 1.x). With YARN (hadoop 2.x) the implementation is via LocalResource ( https://blog.cloudera.com/resource-localization-in-yarn-deep-dive/). In Tez user APIs you will find LocalResource while mapreduce still maintains the original user apis of DistributedCache with the underlying implementation being LocalResource.
The hadoop jar command takes care of adding the jar in the command to the distributed cache. Any additional files need to be shipped with -files/-libjars/-archives option ( https://hadoop.apache.org/docs/r1.0.4/commands_manual.html#Generic+Options) or using the settings mapreduce.job.cache.{files|archives}. yarn-tez mode also honors the mapreduce.job.cache.{files|archives} settings. So instead of adding it to tez.lib.uris.classpath, you can specify via those settings. Just a heads up. Tez is slightly more low level and was meant to be used by frameworks like Pig, Hive, Cascading, etc and so building a Tez application DAG from scratch is going to be more code and not as straightforward as writing a mapper and reducer job. But it does come with a lot of flexibility and ability to customize the DAG and can make a big difference for some applications. For eg: Twitter folks extended it and used for an application to do custom partitioning and routing of data ( https://issues.apache.org/jira/browse/TEZ-3209). Below are some classes from Pig and Hive where dags are constructed to give you a general idea. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/DagUtils.java https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java Regards, Rohini On Mon, Dec 21, 2020 at 12:50 PM Lewis John McGibbney <lewi...@apache.org> wrote: > I found the Tez Counters package > > https://tez.apache.org/releases/0.9.2/tez-api-javadocs/index.html?org/apache/tez/common/counters/package-summary.html > I'm going to experiment adapting the Injector job to use this package > rather than the legacy Map and Reduce Context objects. > > On 2020/12/21 20:11:59, Lewis John McGibbney <lewi...@apache.org> wrote: > > Hi László, > > Thank you for the additional explanation. Adapting my configuration > based on your suggestions results in successful job execution as DAG's now. > A huge thank you :) > > > .. >