Fetcher aborting strangely

2009-08-19 Thread MilleBii
Quite frustrating problem , after 24h fetching properly fetcher is spinwaiting on a fetchqueue size of 6 and then goes in aborting the fetch. I have this problem from time to time and I have no idea why. Can anyone help or suggest something to debug this issue. -- -MilleBii-

Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send Modification-Date because it uses shmtl (Server-parsed HTML). I

Re: Nutch.SIGNATURE_KEY

2009-08-19 Thread Ken Krugler
Hi Paul, On Aug 19, 2009, at 6:08am, Paul Tomblin wrote: Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send

Re: Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote: Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? I believe the hashes would be the same for either raw MD5

topN value in crawl

2009-08-19 Thread alxsss
Hi, I have read a few tutorials on running Nutch to crawl web. However, I still do not understand the meaning of topN variable in crawl command. In tutorials it is suggested to create 3 segments and fetch them with topN=1000. What if I create 100 segments or only one. What would be

Re: Fetcher aborting strangely

2009-08-19 Thread MilleBii
Well in the segment there is nothing but _temporary... and then after a number of spinwaiting for the last 6 elements it aborts and the segment is empty... I guess everything is left in the tmp files. If only it would abort gracefully and close the segment properly... basically lost 200k URL.

Re: topN value in crawl

2009-08-19 Thread alxsss
Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. Should I still specify topN variable? All I need is to index all urls in my seed file. And they are about 1 M. Alex. -Original Message- From: Kirby Bohling kirby.bohl...@gmail.com To:

protocol-httpclient, NTLM, and Domain Controller authentication

2009-08-19 Thread Mike Hays
I have Nutch v1.0 setup to use protocol-httpclient. I can authentication to an IIS 6.0 web server that requires Integrated Windows authentication (i.e., NTLM) if the account I use is *local*, but authentication fails if I use a *domain* account. For the purpose of testing I have