[jira] Resolved: (NUTCH-114) getting number of urls and links from crawldb

2005-12-02 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ] Andrzej Bialecki resolved NUTCH-114: - Resolution: Fixed Applied with changes. Thanks! getting number of urls and links from crawldb -

Re: incremental crawling

2005-12-02 Thread Stefan Groschupf
Am 02.12.2005 um 10:15 schrieb Andrzej Bialecki: Yes, this is required to detect unmodified content. A small note: plain MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages with a counter, or with ads. It would be good to provide a framework for other implementations

Re: incremental crawling

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: Modify CrawlDatum to store the MD5Hash of the content of fetched urls. Yes, this is required to detect unmodified content. A small note: plain MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages with a counter, or with ads. It