Quite frustrating problem , after 24h fetching properly
fetcher is spinwaiting on a fetchqueue size of 6 and then goes in aborting
the fetch.
I have this problem from time to time and I have no idea why.
Can anyone help or suggest something to debug this issue.
--
-MilleBii-
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it? I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send Modification-Date because it uses shmtl
(Server-parsed HTML). I
Hi Paul,
On Aug 19, 2009, at 6:08am, Paul Tomblin wrote:
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it? I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote:
Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?
I believe the hashes would be the same for either raw MD5
Hi,
I have read a few tutorials on running Nutch to crawl web. However, I still do
not understand the meaning of topN variable in crawl command. In tutorials it
is suggested to create 3 segments and fetch them with topN=1000. What if I
create 100 segments or only one. What would be
Well in the segment there is nothing but _temporary... and then after a
number of spinwaiting for the last 6 elements it aborts and the segment is
empty...
I guess everything is left in the tmp files.
If only it would abort gracefully and close the segment properly...
basically lost 200k URL.
Thanks. What if urls in my seed file do not have outlinks, let say .pdf files.
Should I still specify topN variable? All I need is to index all urls in my
seed file. And they are about 1 M.
Alex.
-Original Message-
From: Kirby Bohling kirby.bohl...@gmail.com
To:
I have Nutch v1.0 setup to use protocol-httpclient. I can authentication to an
IIS 6.0 web server that requires Integrated Windows authentication (i.e.,
NTLM) if the account I use is *local*, but authentication fails if I use a
*domain* account.
For the purpose of testing I have