[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-07-02 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884624#action_12884624
 ] 

Julien Nioche commented on NUTCH-835:
-

This patch has been marked for 1.2 but has been committed to trunk only (2.0). 
Shall we also apply it to /nutch/branches/branch-1.2 ?

 document deduplication (exact duplicates) failed using MD5Signature
 ---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.2, 2.0


 The MD5Signature class calculates different signatures for identical 
 documents.
 The reason is that
   byte[] data = content.getContent();
   ... StringBuilder().append(data) ...
 uses java.lang.Object.toString() to get a string representation of the 
 (binary) content
 which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
 arrays
 with identical content.
 A solution would be to take the MD5 sum of the binary content as first part 
 of the
 final signature calculation (the parsed content is the second part):
   ... 
 .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
 Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884540#action_12884540
 ] 

Hudson commented on NUTCH-835:
--

Integrated in Nutch-trunk #1195 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1195/])
NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel 
via ab)


 document deduplication (exact duplicates) failed using MD5Signature
 ---

 Key: NUTCH-835
 URL: https://issues.apache.org/jira/browse/NUTCH-835
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0, 1.1
 Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
Assignee: Andrzej Bialecki 
 Fix For: 1.2, 2.0


 The MD5Signature class calculates different signatures for identical 
 documents.
 The reason is that
   byte[] data = content.getContent();
   ... StringBuilder().append(data) ...
 uses java.lang.Object.toString() to get a string representation of the 
 (binary) content
 which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
 arrays
 with identical content.
 A solution would be to take the MD5 sum of the binary content as first part 
 of the
 final signature calculation (the parsed content is the second part):
   ... 
 .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
 Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.