Hi László, Thank you for the additional explanation. Adapting my configuration based on your suggestions results in successful job execution as DAG's now. A huge thank you :)
A couple of notes 1. I was unable to use the Tez minimal distribution. I had to use the full 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar dependency in minimal distribution. 2. Looking at the syslog for the DAG, I have observed java.lang.NoSuchMethodException: java.nio.channels.ClosedByInterruptException. The full paste can be seen at https://paste.apache.org/mjw0c. Is this normal behavior? 3. We added some useful counters to the Injector job which are printed to the application log. Example is as follows 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 completed successfully 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by filters: 0 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 0 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 0 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls injected: 0 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 12:06:35, elapsed: 00:00:14 As you can see, apparently the counters are not correctly representing the relevant entries in the newly created crawl database. I can verify this by running the following read database job nutch readdb crawldb -stats 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: crawldb 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls: 11322 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval: 30 days, 00:00:00 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval: 30 days, 00:00:00 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval: 30 days, 00:00:00 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time: Mon Dec 21 12:06:00 PST 2020 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times: Mon Dec 21 12:06:00 PST 2020 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time: Mon Dec 21 12:06:00 PST 2020 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0: 11322 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score: 1.0 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score: 1.0 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 11322 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done I'm investigating this now. Again, thank you very much for your help. On 2020/12/21 00:04:36, László Bodor <bodorlaszlo0...@gmail.com> wrote: > Hi! > > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive: > here > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing> > > 1. > hdfs dfs -mkdir -p /apps/tez > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez > ..