Re: Nutch.SIGNATURE_KEY

2009-08-22 Thread Andrzej Bialecki
Paul Tomblin wrote: On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote: Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? I believe the hashes would be the same

Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send Modification-Date because it uses shmtl (Server-parsed HTML). I

Re: Nutch.SIGNATURE_KEY

2009-08-19 Thread Ken Krugler
Hi Paul, On Aug 19, 2009, at 6:08am, Paul Tomblin wrote: Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send

Re: Nutch.SIGNATURE_KEY

2009-08-19 Thread Paul Tomblin
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote: Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page? I believe the hashes would be the same for either raw MD5