Ron van der Vegt created NUTCH-2232: ---------------------------------------
Summary: DeduplicationJob: Url is not decoded before the url length is compared. Key: NUTCH-2232 URL: https://issues.apache.org/jira/browse/NUTCH-2232 Project: Nutch Issue Type: Bug Components: crawldb Reporter: Ron van der Vegt When certain documents have the same signature de deduplication script will elect one as duplicate. The urls are stored url encoded in the crawldb. When two urls are compared by url length, the urls are not first decoded. This could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)