Hi user@, First a couple of things 1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a bit about the Tez community. When I looked through the mailing lists, even thought you guys just made a release, I wasn't sure if the project was alive and kicking. Thanks Jonathan for confirming. 2. Based on my digging through documentation and YouTube videos, I pulled together TEZ-4257 [0] and the corresponding pull request [1]. I also saw that the TravisCI build was broken so I produced TEZ-4258.
Now, the important stuff... I'm a long time developer of the Apache Nutch project [2]; a well matured, production ready Web crawler. Nutch relyies on Apache Hadoop data structures relying heavily on MapReduce. A typical Nutch crawl lifecycle involves the following steps * inject - from a seed list either create or inject entries into an existing crawl database * generate - fetch lists from suitable entries present within the crawl database * fetch - URL partitions * parse - extract data and metadata from the fetched content * updatedb - based upon what was fetched, update the crawl database * wash, rinse repeat (there are other steps like indexing however for simplicity lets leave those steps out for now) I recently started a thread over on the Nutch dev@ list to see if there is any interest in investigating what it would take to evolve Nutch from MapReduce --> Tez. In order to understand the programming model I looked at the Tez Javadoc and examples both of which have been useful. I suppose I have one basic question. Given my brief explanation of the crawl cycle above, should I be looking to implement just one DAG covering the entire crawl cycle? Or something else? Currently we automate the crawl cycle via a bash script with each step executed in sequence. There are several appealing reasons why an explicit data flow programming model would be advantageous but I just need clarity on the correct approach. Thank you for any assistance. lewismc [0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257 [1] https://github.com/apache/tez/pull/82 [2] http://nutch.apache.org