Thank you, Elliot!

On 4/28/16 03:40, Elliot West wrote:
I've raised this as an issue:

https://issues.apache.org/jira/browse/HDFS-10338

On Wednesday, 27 April 2016, Elliot West <[email protected]
<mailto:[email protected]>> wrote:

    Hello,

    We are using DistCp V2 to replicate data between two HDFS file
    systems. We were working on the assumption that we could rely on CRC
    checks to ensure that the data was replicated correctly. However,
    after examining the DistCp source code it seems that there are edge
    cases where the CRCs could differ and yet the copy succeeds even
    when we are not skipping CRC checks.

    I'm wondering whether this is by design and if so, the reasoning
    behind it? If this is a bug, I'd like to raise an issue to fix it.
    If it is by design, I'd like to propose the introduction an option
    for stricter CRC checks.

    The code in question is contained in the method:

        org.apache.hadoop.tools.util.DistCpUtils#checksumsAreEqual(...)

    which can be seen here:

        
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L457


    Specifically this code block suggests that if there is a failure
    when trying to read the source or target checksum then the method
    will return 'true', implying that the check succeeded. In actual
    fact we just failed to obtain the checksum and could perform no check.

         try {
           sourceChecksum = sourceChecksum != null ? sourceChecksum :
    sourceFS
               .getFileChecksum(source);
           targetChecksum = targetFS.getFileChecksum(target);
         } catch (IOException e) {
           LOG.error("Unable to retrieve checksum for " + source + " or
    " + target, e);
         }
         return (sourceChecksum == null || targetChecksum == null ||
                 sourceChecksum.equals(targetChecksum));

    Ideally I'd like to be able to configure a check where we require
    that both the source and target CRCs are retrieved and compared, and
    if for any reason either of the CRCs retrievals fail then an
    exception is thrown. I do appreciate that some FileSystems cannot
    return CRCs but these could still be handled correctly as they would
    simply return null and not throw an exception (I assume).

    I'd appreciate any thoughts on this matter.

    Elliot.



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to