[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633946#comment-13633946
 ] 

Hudson commented on MAPREDUCE-5065:
---

Integrated in Hadoop-Yarn-trunk #186 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/186/])
MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are 
different on source/target. Contributed by Mithun Radhakrishnan. (Revision 
1468629)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633987#comment-13633987
 ] 

Hudson commented on MAPREDUCE-5065:
---

Integrated in Hadoop-Hdfs-0.23-Build #584 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/584/])
MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are 
different on source/target. Contributed by Mithun Radhakrishnan. (Revision 
1468636)

 Result = UNSTABLE
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468636
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
* 
/hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/branches/branch-0.23/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633999#comment-13633999
 ] 

Hudson commented on MAPREDUCE-5065:
---

Integrated in Hadoop-Hdfs-trunk #1375 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1375/])
MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are 
different on source/target. Contributed by Mithun Radhakrishnan. (Revision 
1468629)

 Result = FAILURE
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13634055#comment-13634055
 ] 

Hudson commented on MAPREDUCE-5065:
---

Integrated in Hadoop-Mapreduce-trunk #1402 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1402/])
MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are 
different on source/target. Contributed by Mithun Radhakrishnan. (Revision 
1468629)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Fix For: 3.0.0, 2.0.5-beta, 0.23.8

 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633462#comment-13633462
 ] 

Hudson commented on MAPREDUCE-5065:
---

Integrated in Hadoop-trunk-Commit #3618 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3618/])
MAPREDUCE-5065. DistCp should skip checksum comparisons if block-sizes are 
different on source/target. Contributed by Mithun Radhakrishnan. (Revision 
1468629)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1468629
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* 
/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-16 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13633470#comment-13633470
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

I've committed this to trunk, branch-2 and branch-0.23. 

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-09 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626624#comment-13626624
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

The patch looks good to me. [~cutting], are you okay with the change?

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-09 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627076#comment-13627076
 ] 

Doug Cutting commented on MAPREDUCE-5065:
-

+1 This looks great to me.  Thanks!

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-04-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626263#comment-13626263
 ] 

Hadoop QA commented on MAPREDUCE-5065:
--

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12577712/MAPREDUCE-5065.branch-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-distcp.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3511//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3511//console

This message is automatically generated.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.
 Edit: I've modified the fix to warn the user (instead of skipping the 
 checksum-check). Skipping parity-checks is unsafe. The code now fails the 
 copy, and suggests that the user either use -pb to preserve block-size, or 
 consider -skipCrc (and forgo copy validation entirely).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-18 Thread Dave Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605205#comment-13605205
 ] 

Dave Thompson commented on MAPREDUCE-5065:
--

Reviewed latest patch.  Looks good.  +1

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-18 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13605626#comment-13605626
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

Review comments:
* Add a reasonable timeout to the test case. This is a relatively new rule. It 
applies even when you are modifying existing test cases. Please take account 
that tests may run on a slower hardware.
* If we suggest -skipCrc along with -pb, we should probably inform users of the 
risk of skipping validation.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13604815#comment-13604815
 ] 

Hadoop QA commented on MAPREDUCE-5065:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12574096/MAPREDUCE-5065.branch-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 one of tests included doesn't have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-distcp.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3424//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3424//console

This message is automatically generated.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-15 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603418#comment-13603418
 ] 

Mithun Radhakrishnan commented on MAPREDUCE-5065:
-

I'm with you on the need for a blocksize-independent checksum. I wasn't 
convinced that combining CRC32-checksums together to form a higher-level 
checksum could be correct. (Thanks for the explanation.)

{quote}
instruct her to run with -pb, not -skipCrc.
{quote}

Yep, that should take care of #2 (above), but not #1. The user will still need 
to fail first and rerun, because she's unlikely to know that some of her 
source-files might have non-default block-sizes. Unless the checksum 
calculation is fixed (or -pb is default), I don't think DistCp should enforce a 
check that's a guaranteed failure, under unforeseeable circumstances.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch23.patch, 
 MAPREDUCE-5065.branch2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603437#comment-13603437
 ] 

Hadoop QA commented on MAPREDUCE-5065:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12573882/MAPREDUCE-5065.branch-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-tools/hadoop-distcp.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3419//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3419//console

This message is automatically generated.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch-0.23.patch, 
 MAPREDUCE-5065.branch-2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-15 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603493#comment-13603493
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

bq. Another option might be to implement a checksum that's 
blocksize-independent...

Reading whole metadata may be too much, especially for huge files. It will be 
better if we make computation happen where the data is. :)
 
Most hashing is incremental, so if DFSClient feeds the last state of hash into 
the next datanode and let it continue updating it, the result will be 
independent of block size. The current way of doing file checksum allows 
calculating individual block checksums in parallel, but we are not taking 
advantage of it in DFSClient anyway. So I don't think there won't be any 
significant changes in performance or overhead.

We should probably continue this discussion in a separate jira.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-15 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603494#comment-13603494
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

bq. So I don't think there won't be any significant changes in performance or 
overhead.
Sorry, unintended double negation.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-15 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13603502#comment-13603502
 ] 

Kihwal Lee commented on MAPREDUCE-5065:
---

Filed HDFS-4605 for block-size independent FileChecksum in HDFS.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan

 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-14 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602538#comment-13602538
 ] 

Doug Cutting commented on MAPREDUCE-5065:
-

This seems like it could give false comfort.  Rather it would be safer to 
advise people to, when they attempt to copy files with different block sizes, 
to either specify -pb or -skipCrc.  So better documentation, warnings and error 
messages might suffice.  Then the results of a distcp could still be trusted 
unless you've explicitly specified -skipCrc.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch23.patch, 
 MAPREDUCE-5065.branch2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-14 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602647#comment-13602647
 ] 

Mithun Radhakrishnan commented on MAPREDUCE-5065:
-

Hello, Doug. Thank you for looking at this.

(For the moment, let's ignore that while DistCp code in 2.0/ does honour 
-skipCrc, 0.23/ code does not. I'll update the 0.23 patch to bring both of 
these to parity.)

IMO, it will not suffice to only document this in docs/code/warning-messages:

1. The user isn't likely to realize that the default block-sizes differ between 
source and target. She is even less likely to perceive the difference if the 
block-sizes on the source-files were explicitly set to a non-default value. 
(And that's entirely possible with FileSystem.create().)
The most likely manner in which she'd notice is when DistCp fails on 
checksum-diff, at which point the warning would instruct her to -skipCrc on the 
rerun.

2. Using -skipCrc will disable checksum-checks on all files copied. It's 
preferable to apply checks where we can, and skip only where block-sizes differ 
(because that's a guaranteed failure.)

One alternative is to make -pb/-skipCrc default, but that's undesirable as well.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch23.patch, 
 MAPREDUCE-5065.branch2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-14 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13602752#comment-13602752
 ] 

Doug Cutting commented on MAPREDUCE-5065:
-

I think we should instead probably instruct her to run with -pb, not -skipCrc.

Another option might be to implement a checksum that's blocksize-independent, 
for when block sizes are different.  Currently the file checksum works by 
taking the CRC32 for every 512 byte chunk of the block, combining these with 
MD5 into a single checksum for the block, then combining these with MD5 into a 
single checksum for the file.  The first combination is done at the Datanode 
(in DataXceiver#blockChecksum) and the second at the client (in 
DFSClient#getFileChecksum).  If instead the client could directly retrieve the 
list of CRC32s from the datanode then it could combine them into a 
blocksize-independent checksum (so long as blockSize is a multiple of 
bytesPerChecksum and bytesPerChecksum is the same between the filesystems, 
which is ordinarily the case).  Op.java already includes a READ_METADATA 
operation, presumably intended to return the CRC32s to the client, but it is 
not implemented.  We'd probably want to extend the getFileChecksum API to 
permit specifying the type of checksum requested, whether MD5MD5CRC32 or 
MD5CRC32.  This would be a significant effort and it touches core bits of HDFS 
so should not be approached lightly.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch23.patch, 
 MAPREDUCE-5065.branch2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.

2013-03-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13601937#comment-13601937
 ] 

Hadoop QA commented on MAPREDUCE-5065:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12573645/MAPREDUCE-5065.branch23.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3412//console

This message is automatically generated.

 DistCp should skip checksum comparisons if block-sizes are different on 
 source/target.
 --

 Key: MAPREDUCE-5065
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: distcp
Affects Versions: 2.0.3-alpha, 0.23.5
Reporter: Mithun Radhakrishnan
Assignee: Mithun Radhakrishnan
 Attachments: MAPREDUCE-5065.branch23.patch, 
 MAPREDUCE-5065.branch2.patch


 When copying files between 2 clusters with different default block-sizes, one 
 sees that the copy fails with a checksum-mismatch, even though the files have 
 identical contents.
 The reason is that on HDFS, a file's checksum is unfortunately a function of 
 the block-size of the file. So you could have 2 different files with 
 identical contents (but different block-sizes) have different checksums. 
 (Thus, it's also possible for DistCp to fail to copy files on the same 
 file-system, if the source-file's block-size differs from HDFS default, and 
 -pb isn't used.)
 I propose that we skip checksum comparisons under the following conditions:
 1. -skipCrc is specified.
 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
 3. source.getBlockSize() != target.getBlockSize(), since the checksums are 
 guaranteed to differ in this case.
 I have a patch for #3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira