Re: Porting legacy MapReduce application to Tez

Lewis John McGibbney Mon, 21 Dec 2020 12:30:48 -0800

Some more observations

4. When 'mapreduce.framework.name' is set to yarn, counters are present for the 
Injector job. The following is example output showing all of the Injector job 
counters.


2020-12-21 12:13:58,674 INFO mapreduce.Job: Job job_1608581566698_0001 
completed successfully
2020-12-21 12:13:58,760 INFO mapreduce.Job: Counters: 52
        File System Counters
                FILE: Number of bytes read=1456826
                FILE: Number of bytes written=3699396
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1163333
                HDFS: Number of bytes written=794148
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=19002
                Total time spent by all reduces in occupied slots (ms)=7305
                Total time spent by all map tasks (ms)=6334
                Total time spent by all reduce tasks (ms)=2435
                Total vcore-milliseconds taken by all map tasks=6334
                Total vcore-milliseconds taken by all reduce tasks=2435
                Total megabyte-milliseconds taken by all map tasks=9501000
                Total megabyte-milliseconds taken by all reduce tasks=3652500
        Map-Reduce Framework
                Map input records=22845
                Map output records=22765
                Map output bytes=1411290
                Map output materialized bytes=1456832
                Input split bytes=572
                Combine input records=0
                Combine output records=0
                Reduce input groups=11322
                Reduce shuffle bytes=1456832
                Reduce input records=22765
                Reduce output records=11322
                Spilled Records=45530
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=114
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1046478848
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        injector
                urls_filtered=80
                urls_injected=11443
                urls_merged=11322
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=794148
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls rejected by 
filters: 80
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected 
after normalization and filtering: 11443
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total urls injected but 
already in CrawlDb: 11322
2020-12-21 12:13:58,793 INFO crawl.Injector: Injector: Total new urls injected: 
121
2020-12-21 12:13:58,794 INFO crawl.Injector: Injector: Total urls with status 
gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:13:58,804 INFO crawl.Injector: Injector: finished at 2020-12-21 
12:13:58, elapsed: 00:00:36

5. When 'mapreduce.framework.name' is set to 'yarn-tez' I am observing the 
following runtimes
  * 1st run: elapsed: 00:00:42
  * 2nd run: elapsed: 00:00:13
  * 3rd run: elapsed: 00:00:14

5. When 'mapreduce.framework.name' is set to 'yarn' I am observing the 
following runtimes
  * 1st run: elapsed: 00:00:34
  * 2nd run: elapsed: 00:00:32
  * 3rd run: elapsed: 00:00:34

After the first DAG run, it looks like there is a marked runtime improvement 
running the Injector job on Tez. It would be great if I could somehow explain 
this but my Tez knowledge is still very limited. I will keep digging though!

I'm also going to create a tutorial on the Nutch wiki covering this entire 
experience. I'll maybe pull together a YouTube video or something as well.

lewismc

On 2020/12/21 20:11:59, Lewis John McGibbney <lewi...@apache.org> wrote: 
> Hi László,
> Thank you for the additional explanation. Adapting my configuration based on 
> your suggestions results in successful job execution as DAG's now. A huge 
> thank you :)
> 
> A couple of notes
> 1. I was unable to use the Tez minimal distribution. I had to use the full 
> 0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar 
> dependency in minimal distribution.
> 2. Looking at the syslog for the DAG, I have observed 
> java.lang.NoSuchMethodException: 
> java.nio.channels.ClosedByInterruptException. The full paste can be seen at 
> https://paste.apache.org/mjw0c. Is this normal behavior? 
> 3. We added some useful counters to the Injector job which are printed to the 
> application log. Example is as follows
> 
> 2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 
> completed successfully
> 2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by 
> filters: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected 
> after normalization and filtering: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected 
> but already in CrawlDb: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls 
> injected: 0
> 2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status 
> gone removed from CrawlDb (db.update.purge.404): 0
> 2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 
> 12:06:35, elapsed: 00:00:14
> 
> As you can see, apparently the counters are not correctly representing the 
> relevant entries in the newly created crawl database. I can verify this by 
> running the following read database job
> 
> nutch readdb crawldb -stats
> 
> 2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
> crawldb
> 2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls: 11322
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:    
> 30 days, 00:00:00
> 2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval: 30 
> days, 00:00:00
> 2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:     
> 30 days, 00:00:00
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:        
> Mon Dec 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times: Mon Dec 
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time:  Mon Dec 
> 21 12:06:00 PST 2020
> 2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0:    11322
> 2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9: 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:        
> 1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score:  1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score:  1.0
> 2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score:  1.0
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 11322
> 2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done
> 
> I'm investigating this now.
> 
> Again, thank you very much for your help.
> 
> On 2020/12/21 00:04:36, László Bodor <bodorlaszlo0...@gmail.com> wrote: 
> > Hi!
> > 
> > This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> > here
> > <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> > 
> > 1.
> > hdfs dfs -mkdir -p /apps/tez
> > hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> > 
> ..
>

Re: Porting legacy MapReduce application to Tez

Reply via email to