Some more observations 4. When 'mapreduce.framework.name' is set to yarn, counters are present for the Injector job. The following is example output showing all of the Injector job counters.
2020-12-21 12:13:58,674 INFO mapreduce.Job: Job job_1608581566698_0001 completed successfully 2020-12-21 12:13:58,760 INFO mapreduce.Job: Counters: 52 File System Counters FILE: Number of bytes read=1456826 FILE: Number of bytes written=3699396 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1163333 HDFS: Number of bytes written=794148 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=6 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=19002 Total time spent by all reduces in occupied slots (ms)=7305 Total time spent by all map tasks (ms)=6334 Total time spent by all reduce tasks (ms)=2435 Total vcore-milliseconds taken by all map tasks=6334 Total vcore-milliseconds taken by all reduce tasks=2435 Total megabyte-milliseconds taken by all map tasks=9501000 Total megabyte-milliseconds taken by all reduce tasks=3652500 Map-Reduce Framework Map input records=22845 Map output records=22765 Map output bytes=1411290 Map output materialized bytes=1456832 Input split bytes=572 Combine input records=0 Combine output records=0 Reduce input groups=11322 Reduce shuffle bytes=1456832 Reduce input records=22765 Reduce output records=11322 Spilled Records=45530 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=114 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=1046478848 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 injector urls_filtered=80 urls_injected=11443 urls_merged=11322 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=794148 2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls rejected by filters: 80 2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 11443 2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 11322 2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total new urls injected: 121 2020-12-21 12:13:58,794 INFO crawl.Injector: Injector: Total urls with status gone removed from CrawlDb (db.update.purge.404): 0 2020-12-21 12:13:58,804 INFO crawl.Injector: Injector: finished at 2020-12-21 12:13:58, elapsed: 00:00:36 5. When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the following runtimes * 1st run: elapsed: 00:00:42 * 2nd run: elapsed: 00:00:13 * 3rd run: elapsed: 00:00:14 5. When 'mapreduce.framework.name' is set to 'yarn' I am observing the following runtimes * 1st run: elapsed: 00:00:34 * 2nd run: elapsed: 00:00:32 * 3rd run: elapsed: 00:00:34 After the first DAG run, it looks like there is a marked runtime improvement running the Injector job on Tez. It would be great if I could somehow explain this but my Tez knowledge is still very limited. I will keep digging though! I'm also going to create a tutorial on the Nutch wiki covering this entire experience. I'll maybe pull together a YouTube video or something as well. lewismc On 2020/12/21 20:11:59, Lewis John McGibbney <lewi...@apache.org> wrote: > Hi László, > Thank you for the additional explanation. Adapting my configuration based on > your suggestions results in successful job execution as DAG's now. A huge > thank you :) > > A couple of notes > 1. I was unable to use the Tez minimal distribution. I had to use the full > 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar > dependency in minimal distribution. > 2. Looking at the syslog for the DAG, I have observed > java.lang.NoSuchMethodException: > java.nio.channels.ClosedByInterruptException. The full paste can be seen at > https://paste.apache.org/mjw0c. Is this normal behavior? > 3. We added some useful counters to the Injector job which are printed to the > application log. Example is as follows > > 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 > completed successfully > 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0 > 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by > filters: 0 > 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected > after normalization and filtering: 0 > 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected > but already in CrawlDb: 0 > 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls > injected: 0 > 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status > gone removed from CrawlDb (db.update.purge.404): 0 > 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 > 12:06:35, elapsed: 00:00:14 > > As you can see, apparently the counters are not correctly representing the > relevant entries in the newly created crawl database. I can verify this by > running the following read database job > > nutch readdb crawldb -stats > > 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0 > 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: > crawldb > 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls: 11322 > 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval: > 30 days, 00:00:00 > 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval: 30 > days, 00:00:00 > 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval: > 30 days, 00:00:00 > 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time: > Mon Dec 21 12:06:00 PST 2020 > 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times: Mon Dec > 21 12:06:00 PST 2020 > 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time: Mon Dec > 21 12:06:00 PST 2020 > 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0: 11322 > 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99: > 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score: 1.0 > 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score: 1.0 > 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 11322 > 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done > > I'm investigating this now. > > Again, thank you very much for your help. > > On 2020/12/21 00:04:36, László Bodor <bodorlaszlo0...@gmail.com> wrote: > > Hi! > > > > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive: > > here > > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing> > > > > 1. > > hdfs dfs -mkdir -p /apps/tez > > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez > > > .. >