Hi On Wednesday, February 27, 2013, adfel70 <[email protected]> wrote: > Yes I looked at the code. Great
> I saw that shouldProccess() check is performed on each file in the mapper. > I've got used in nutch1.* to a method in which in each cycle only a set of > urls is being processed. > Is nutch2.* processing all the urls in each cycle and thus, this > shouldProccess() is required to make sure the same file isn't parsed twice? Nothing is static in the nutch 2.x code. I make this statement with the intention of communicating that if you have an itch and want to scratch it then come on board and we can work on ensuring that shouldprocess() ensures multiple/unnecessary parsing is not executed. We do not need this and even if it is not a bug (which it might be) it is still a pain, and also annoying me. > Also, I see that there is a loop on depth parameter. So if the defined depth > is greater than the actual depth of the site I'm crawling, the loop will > just go on until it reaches the defined depth I would think not. We cannot force fetching of content which simply does not exist however this being said we need to ensure that Nutch does not misinterpret our desired intentions. I am at ApacheCon and I am looking at Nutch code. It is 1am so I can try and look at this tomorrow. -- *Lewis*

