Hi Zhiyuan, Thanks for your response. On 2020/12/11 03:51:17, Zhiyuan Yang <zhiyu...@apache.org> wrote: > I think the first step can be simply trying replacing what you currently > have in MapReduce with Tez,
I'm working on understanding how to do that :) I'm going to start with the Injector tool. The InjectorMapper [0] which reads (i) the crawl database seeds are injected into, and (ii) a plain-text seed file, parsing each line in a particular way. Depending on configuration and command-line parameters the URLs are normalized and filtered using the configured plugins. The InjectorReducer [1] combines multiple new entries for a url based on some logical rules. The result is the updated crawl database serialized as the MapFileOutputFormat where output keys are of type org.apache.hadoop.io.Text and values a custom CrawlDatum type [2] which represents the crawl state of a url. Do you have any advice on how one would go about transforming the above job into a Tez DAG? [0] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L105-L269 [1] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L271-L349 [2] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java > instead of trying to integrate the entire data > flow into Tez in a single step. My concern is whether there are some > unpopular MapReduce features you rely on but are not supported by Tez yet. > I would not be surprised. I will only really know this once I encounter something (which I hope does not happen). Thanks again for any thoughts you have. lewismc