Re: Porting legacy MapReduce application to Tez

Lewis John McGibbney Mon, 21 Dec 2020 12:12:07 -0800

Hi László,
Thank you for the additional explanation. Adapting my configuration based on 
your suggestions results in successful job execution as DAG's now. A huge thank 
you :)


A couple of notes
1. I was unable to use the Tez minimal distribution. I had to use the full 
0.10.0-SNAPSHOT due to the absence of the jetty-http-9.4.20.v20190813.jar 
dependency in minimal distribution.
2. Looking at the syslog for the DAG, I have observed 
java.lang.NoSuchMethodException: java.nio.channels.ClosedByInterruptException. 
The full paste can be seen at https://paste.apache.org/mjw0c. Is this normal 
behavior? 
3. We added some useful counters to the Injector job which are printed to the 
application log. Example is as follows

2020-12-21 12:06:35,242 INFO mapreduce.Job: Job job_1608580287657_0003 
completed successfully
2020-12-21 12:06:35,249 INFO mapreduce.Job: Counters: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls rejected by 
filters: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected 
after normalization and filtering: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls injected but 
already in CrawlDb: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total new urls injected: 0
2020-12-21 12:06:35,282 INFO crawl.Injector: Injector: Total urls with status 
gone removed from CrawlDb (db.update.purge.404): 0
2020-12-21 12:06:35,293 INFO crawl.Injector: Injector: finished at 2020-12-21 
12:06:35, elapsed: 00:00:14

As you can see, apparently the counters are not correctly representing the 
relevant entries in the newly created crawl database. I can verify this by 
running the following read database job

nutch readdb crawldb -stats

2020-12-21 12:08:06,558 INFO mapreduce.Job: Counters: 0
2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
crawldb
2020-12-21 12:08:06,586 INFO crawl.CrawlDbReader: TOTAL urls:   11322
2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: shortest fetch interval:      
30 days, 00:00:00
2020-12-21 12:08:06,597 INFO crawl.CrawlDbReader: avg fetch interval:   30 
days, 00:00:00
2020-12-21 12:08:06,598 INFO crawl.CrawlDbReader: longest fetch interval:       
30 days, 00:00:00
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: earliest fetch time:  Mon Dec 
21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: avg of fetch times:   Mon Dec 
21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: latest fetch time:    Mon Dec 
21 12:06:00 PST 2020
2020-12-21 12:08:06,600 INFO crawl.CrawlDbReader: retry 0:      11322
2020-12-21 12:08:06,605 INFO crawl.CrawlDbReader: score quantile 0.01:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.05:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.1:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.2:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.25:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.3:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.4:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.5:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.6:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.7:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.75:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.8:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.9:   1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.95:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: score quantile 0.99:  1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: min score:    1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: avg score:    1.0
2020-12-21 12:08:06,606 INFO crawl.CrawlDbReader: max score:    1.0
2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: status 1 (db_unfetched):      
11322
2020-12-21 12:08:06,610 INFO crawl.CrawlDbReader: CrawlDb statistics: done

I'm investigating this now.

Again, thank you very much for your help.

On 2020/12/21 00:04:36, László Bodor <bodorlaszlo0...@gmail.com> wrote: 
> Hi!
> 
> This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
> here
> <https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>
> 
> 1.
> hdfs dfs -mkdir -p /apps/tez
> hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez
> 
..

Re: Porting legacy MapReduce application to Tez

Reply via email to