Hi László, Thanks for your response On 2020/12/12 09:43:33, László Bodor <bodorlaszlo0...@gmail.com> wrote: > Hi Lewis! > > Just for curiosity's sake, could you please point me to a place in nutch > code where some of the steps of the workflow are compiled into / done by > MapReduce?
Please see my response to Zhiyuan earlier in this thread. I have broken down the Injector job and tried to describe the MapReduce logic without going into too many specifics. If would be greatly appreciated if you were able to take a look at that. Also, do you have any general guidance on how one would go about porting a MapReduce job to the Tez programming model? It's not clear to me how one identifies candidate Vertices and Edges. Thank you > Also - again for curiosity's sake - what about the adoption level of Apache > Nutch, could please send references about Nutch adopters? This looks like > an interesting project. Nutch is probably the most popular open source crawler. I understand that Doug Cutting and others began writing it and realized that in order to scale the Web crawler they needed a distributed computing model. The Hadoop project was born out of Nutch so that gives you an idea of how long it's been around for. I've been on the project for many years and have interacted with literally thousands of people on the mailing lists. I suspect that it is in deployment in a lot of places. I will also say that it is not a particularly easy code base to understand... it is quite complex. Even though Nutch has sensible default configuration, unfortunately it is notoriously difficult to configure as it has, similar to Hadoop, literally hundreds of configuration parameters which may need to be tuned. Thank you for assisting me with better understanding the process of evolving MapReduce jobs --> Tez. lewismc