Hi user@,

First a couple of things
1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a bit 
about the Tez community. When I looked through the mailing lists, even thought 
you guys just made a release, I wasn't sure if the project was alive and 
kicking. Thanks Jonathan for confirming.
2. Based on my digging through documentation and YouTube videos, I pulled 
together TEZ-4257 [0] and the corresponding pull request [1]. I also saw that 
the TravisCI build was broken so I produced TEZ-4258.

Now, the important stuff... I'm a long time developer of the Apache Nutch 
project [2]; a well matured, production ready Web crawler. Nutch relyies on 
Apache Hadoop data structures relying heavily on MapReduce.

A typical Nutch crawl lifecycle involves the following steps
* inject - from a seed list either create or inject entries into an existing 
crawl database
* generate - fetch lists from suitable entries present within the crawl 
database  
* fetch - URL partitions
* parse - extract data and metadata from the fetched content
* updatedb - based upon what was fetched, update the crawl database 
* wash, rinse repeat (there are other steps like indexing however for 
simplicity lets leave those steps out for now)

I recently started a thread over on the Nutch dev@ list to see if there is any 
interest in investigating what it would take to evolve Nutch from MapReduce --> 
Tez. In order to understand the programming model I looked at the Tez Javadoc 
and examples both of which have been useful.

I suppose I have one basic question. Given my brief explanation of the crawl 
cycle above, should I be looking to implement just one DAG covering the entire 
crawl cycle? Or something else?

Currently we automate the crawl cycle via a bash script with each step executed 
in sequence. There are several appealing reasons why an explicit data flow 
programming model would be advantageous but I just need clarity on the correct 
approach.

Thank you for any assistance.
lewismc

[0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257
[1] https://github.com/apache/tez/pull/82
[2] http://nutch.apache.org

Reply via email to