Re: why is nutch2.1 trying to parse the same documnets again and again?

Lewis John Mcgibbney Wed, 27 Feb 2013 00:50:57 -0800

Hi

On Wednesday, February 27, 2013, adfel70 <[email protected]> wrote:
> Yes I looked at the code.
Great


> I saw that shouldProccess() check is performed on each file in the mapper.
> I've got used in nutch1.* to a method in which in each cycle only a set of
> urls is being processed.
> Is nutch2.* processing all the urls in each cycle and thus, this
> shouldProccess() is required to make sure the same file isn't parsed
twice?
Nothing is static in the nutch 2.x code. I make this statement with the
intention of communicating that if you have an itch and want to scratch it
then come on board and we can work on ensuring that shouldprocess() ensures
multiple/unnecessary parsing is not executed. We do not need this and even
if it is not a bug (which it might be) it is still a pain, and also
annoying me.

> Also, I see that there is a loop on depth parameter. So if the defined
depth
> is greater than the actual depth of the site I'm crawling, the loop will
> just go on until it reaches the defined depth

I would think not. We cannot force fetching of content which simply does
not exist however this being said we need to ensure that Nutch does not
misinterpret our desired intentions.
I am at ApacheCon and I am looking at Nutch code. It is 1am so I can try
and look at this tomorrow.

-- 
*Lewis*

Re: why is nutch2.1 trying to parse the same documnets again and again?

Reply via email to