Hi Zhiyuan,
Thanks for your response.

On 2020/12/11 03:51:17, Zhiyuan Yang <zhiyu...@apache.org> wrote: 
> I think the first step can be simply trying replacing what you currently
> have in MapReduce with Tez,

I'm working on understanding how to do that :)
I'm going to start with the Injector tool.
The InjectorMapper [0] which reads (i) the crawl database seeds are injected 
into, and (ii) a plain-text seed file, parsing each line in a particular way. 
Depending on configuration and command-line parameters the URLs are normalized 
and filtered using the configured plugins.
The InjectorReducer [1] combines multiple new entries for a url based on some 
logical rules.
The result is the updated crawl database serialized as the MapFileOutputFormat 
where output keys are  of type org.apache.hadoop.io.Text and values a custom 
CrawlDatum type [2] which represents the crawl state of a url.

Do you have any advice on how one would go about transforming the above job 
into a Tez DAG?

[0] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L105-L269
[1] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L271-L349
[2] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDatum.java
  
> instead of trying to integrate the entire data
> flow into Tez in a single step. My concern is whether there are some
> unpopular MapReduce features you rely on but are not supported by Tez yet.
> 

I would not be surprised. I will only really know this once I encounter 
something (which I hope does not happen).

Thanks again for any thoughts you have.
lewismc

Reply via email to