Some more observations

4. When 'mapreduce.framework.name' is set to yarn, counters are present for the 
Injector job. The following is example output showing all of the Injector job 
counters. 

2020-12-21 12:13:58,674 INFO mapreduce.Job: Job job_1608581566698_0001 
completed successfully
2020-12-21 12:13:58,760 INFO mapreduce.Job: Counters: 52
        File System Counters
                FILE: Number of bytes read=1456826
                FILE: Number of bytes written=3699396
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1163333
                HDFS: Number of bytes written=794148
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=19002
                Total time spent by all reduces in occupied slots (ms)=7305
                Total time spent by all map tasks (ms)=6334
                Total time spent by all reduce tasks (ms)=2435
                Total vcore-milliseconds taken by all map tasks=6334
                Total vcore-milliseconds taken by all reduce tasks=2435
                Total megabyte-milliseconds taken by all map tasks=9501000
                Total megabyte-milliseconds taken by all reduce tasks=3652500
        Map-Reduce Framework
                Map input records=22845
                Map output records=22765
                Map output bytes=1411290
                Map output materialized bytes=1456832
                Input split bytes=572
                Combine input records=0
                Combine output records=0
                Reduce input groups=11322
                Reduce shuffle bytes=1456832
                Reduce input records=22765
                Reduce output records=11322
                Spilled Records=45530
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=114
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1046478848
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        injector
                urls_filtered=80
                urls_injected=11443
                urls_merged=11322
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=794148
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls rejected by 
filters: 80
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected 
after normalization and filtering: 11443
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected but 
already in CrawlDb: 11322
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total new urls injected: 
121
2020-12-21 12:13:58,794 INFO crawl.Injector: Injector: Total urls with status 
gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:13:58,804 INFO crawl.Injector: Injector: finished at 2020-12-21 
12:13:58, elapsed: 00:00:36

5. When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the 
following runtimes
  * 1st run: elapsed: 00:00:42
  * 2nd run: elapsed: 00:00:13
  * 3rd run: elapsed: 00:00:14

5. When 'mapreduce.framework.name' is set to 'yarn' I am observing the 
following runtimes
  * 1st run: elapsed: 00:00:34
  * 2nd run: elapsed: 00:00:32
  * 3rd run: elapsed: 00:00:34

After the first DAG run, it looks like there is a marked runtime improvement 
running the Injector job on Tez. It would be great if I could somehow explain 
this but my Tez knowledge is still very limited. I will keep digging though!

I'm also going to create a tutorial on the Nutch wiki covering this entire 
experience. I'll maybe pull together a YouTube video or something as well.

lewismc

On 2020/12/21 20:11:59, Lewis John McGibbney <lewi...@apache.org> wrote: 
> Hi László,
> Thank you for the additional explanation. Adapting my configuration based on 
> your suggestions results in successful job execution as DAG's now. A huge 
> thank you :)
> 
> A couple of notes
> 1. I was unable to use the Tez minimal distribution. I had to use the full 
> 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar 
> dependency in minimal distribution.
> 2. Looking at the syslog for the DAG, I have observed 
> java.lang.NoSuchMethodException: 
> java.nio.channels.ClosedByInterruptException. The full paste can be seen at 
> https://paste.apache.org/mjw0c. Is this normal behavior? 
> 3. We added some useful counters to the Injector job which are printed to the 
> application log. Example is as follows
> 
> 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 
> completed successfully
> 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by 
> filters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected 
> after normalization and filtering: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected 
> but already in CrawlDb: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls 
> injected: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status 
> gone removed from CrawlDb (db.update.purge.404): 0
> 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 
> 12:06:35, elapsed: 00:00:14
> 
> As you can see, apparently the counters are not correctly representing the 
> relevant entries in the newly created crawl database. I can verify this by 
> running the following read database job
> 
> nutch readdb crawldb -stats
> 
> 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
> crawldb
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls: 11322
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:    
> 30 days, 00:00:00
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval: 30 
> days, 00:00:00
> 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:     
> 30 days, 00:00:00
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:        
> Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times: Mon Dec 
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time:  Mon Dec 
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0:    11322
> 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score:  1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score:  1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score:  1.0
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 11322
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done
> 
> I'm investigating this now.
> 
> Again, thank you very much for your help.
> 
> On 2020/12/21 00:04:36, László Bodor <bodorlaszlo0...@gmail.com> wrote: 
> > Hi!
> > 
> > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> > here
> > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> > 
> > 1.
> > hdfs dfs -mkdir -p /apps/tez
> > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> > 
> ..
> 

Reply via email to