[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Andrzej Bialecki resolved NUTCH-114:
-
Resolution: Fixed
Applied with changes. Thanks!
getting number of urls and links from crawldb
-
Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki:
Yes, this is required to detect unmodified content. A small note:
plain MD5Hash(byte[] content) is quite ineffective for many pages,
e.g. pages with a counter, or with ads. It would be good to provide
a framework for other implementations
Andrzej Bialecki wrote:
Doug Cutting wrote:
Modify CrawlDatum to store the MD5Hash of the content of fetched urls.
Yes, this is required to detect unmodified content. A small note: plain
MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages
with a counter, or with ads. It