Paul Tomblin wrote:
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote:
Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?
I believe the hashes would be the same
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it? I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send Modification-Date because it uses shmtl
(Server-parsed HTML). I
Hi Paul,
On Aug 19, 2009, at 6:08am, Paul Tomblin wrote:
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it? I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send
On Wed, Aug 19, 2009 at 1:00 PM, Ken Kruglerkkrugler_li...@transpac.com wrote:
Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?
I believe the hashes would be the same for either raw MD5