Hi Lewis!

Just for curiosity's sake, could you please point me to a place in nutch
code where some of the steps of the workflow are compiled into / done by
MapReduce?
Also - again for curiosity's sake - what about the adoption level of Apache
Nutch, could please send references about Nutch adopters? This looks like
an interesting project.

Thanks,
Laszlo Bodor

On Fri, 11 Dec 2020 at 04:51, Zhiyuan Yang <zhiyu...@apache.org> wrote:

> I think the first step can be simply trying replacing what you currently
> have in MapReduce with Tez, instead of trying to integrate the entire data
> flow into Tez in a single step. My concern is whether there are some
> unpopular MapReduce features you rely on but are not supported by Tez yet.
>
> Thanks,
> Zhiyuan
>
> On Fri, Dec 11, 2020 at 10:40 AM Lewis John McGibbney <lewi...@apache.org>
> wrote:
>
>> Hi user@,
>>
>> First a couple of things
>> 1. Thanks to Jonathan Eagles (who I spoke to offlist) for explaining a
>> bit about the Tez community. When I looked through the mailing lists, even
>> thought you guys just made a release, I wasn't sure if the project was
>> alive and kicking. Thanks Jonathan for confirming.
>> 2. Based on my digging through documentation and YouTube videos, I pulled
>> together TEZ-4257 [0] and the corresponding pull request [1]. I also saw
>> that the TravisCI build was broken so I produced TEZ-4258.
>>
>> Now, the important stuff... I'm a long time developer of the Apache Nutch
>> project [2]; a well matured, production ready Web crawler. Nutch relyies on
>> Apache Hadoop data structures relying heavily on MapReduce.
>>
>> A typical Nutch crawl lifecycle involves the following steps
>> * inject - from a seed list either create or inject entries into an
>> existing crawl database
>> * generate - fetch lists from suitable entries present within the crawl
>> database
>> * fetch - URL partitions
>> * parse - extract data and metadata from the fetched content
>> * updatedb - based upon what was fetched, update the crawl database
>> * wash, rinse repeat (there are other steps like indexing however for
>> simplicity lets leave those steps out for now)
>>
>> I recently started a thread over on the Nutch dev@ list to see if there
>> is any interest in investigating what it would take to evolve Nutch from
>> MapReduce --> Tez. In order to understand the programming model I looked at
>> the Tez Javadoc and examples both of which have been useful.
>>
>> I suppose I have one basic question. Given my brief explanation of the
>> crawl cycle above, should I be looking to implement just one DAG covering
>> the entire crawl cycle? Or something else?
>>
>> Currently we automate the crawl cycle via a bash script with each step
>> executed in sequence. There are several appealing reasons why an explicit
>> data flow programming model would be advantageous but I just need clarity
>> on the correct approach.
>>
>> Thank you for any assistance.
>> lewismc
>>
>> [0] https://issues.apache.org/jira/projects/TEZ/issues/TEZ-4257
>> [1] https://github.com/apache/tez/pull/82
>> [2] http://nutch.apache.org
>>
>

Reply via email to