Ron van der Vegt created NUTCH-2232:
---------------------------------------

             Summary: DeduplicationJob: Url is not decoded before the url 
length is compared.
                 Key: NUTCH-2232
                 URL: https://issues.apache.org/jira/browse/NUTCH-2232
             Project: Nutch
          Issue Type: Bug
          Components: crawldb
            Reporter: Ron van der Vegt


When certain documents have the same signature de deduplication script will 
elect one as duplicate. The urls are stored url encoded in the crawldb. When 
two urls are compared by url length, the urls are not first decoded. This could 
lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to