Hi László,
Thanks for your response

On 2020/12/12 09:43:33, László Bodor <bodorlaszlo0...@gmail.com> wrote: 
> Hi Lewis!
> 
> Just for curiosity's sake, could you please point me to a place in nutch
> code where some of the steps of the workflow are compiled into / done by
> MapReduce?

Please see my response to Zhiyuan earlier in this thread. I have broken down 
the Injector job and tried to describe the MapReduce logic without going into 
too many specifics. If would be greatly appreciated if you were able to take a 
look at that. Also, do you have any general guidance on how one would go about 
porting a MapReduce job to the Tez programming model? It's not clear to me how 
one identifies candidate Vertices and Edges. Thank you

> Also - again for curiosity's sake - what about the adoption level of Apache
> Nutch, could please send references about Nutch adopters? This looks like
> an interesting project.

Nutch is probably the most popular open source crawler. I understand that Doug 
Cutting and others began writing it and realized that in order to scale the Web 
crawler they needed a distributed computing model. The Hadoop project was born 
out of Nutch so that gives you an idea of how long it's been around for. I've 
been on the project for many years and have interacted with literally thousands 
of people on the mailing lists. I suspect that it is in deployment in a lot of 
places. I will also say that it is not a particularly easy code base to 
understand... it is quite complex. Even though Nutch has sensible default 
configuration, unfortunately it is notoriously difficult to configure as it 
has, similar to Hadoop, literally hundreds of configuration parameters which 
may need to be tuned.

Thank you for assisting me with better understanding the process of evolving 
MapReduce jobs --> Tez.
lewismc 

Reply via email to