[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-04-16 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-13056:
-
Fix Version/s: 3.1.1

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, HDFS-13056.012.patch, HDFS-13056.013.patch, 
> HDFS-13056.014.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org




[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-04-10 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HDFS-13056:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.2.0
   Status: Resolved  (was: Patch Available)

Committed patch 14 to trunk.

Thanks [~dennishuo] for the great work here! Also thanks [~wzzdreamer], 
[~ajayydv], [~arpitagarwal], [~ste...@apache.org] for the reviews and comments!

[~dennishuo]:

Please add a release note for this (by Edit the jira, and enter into the 
'Release Note' field.

It seems you wanted to target this to branch-2.8. If this is true, please feel 
free to reopen and attach branch-2 patches. Precommit will only run if the Jira 
is in Patch Available state.

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, HDFS-13056.012.patch, HDFS-13056.013.patch, 
> HDFS-13056.014.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-04-03 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.014.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, HDFS-13056.012.patch, HDFS-13056.013.patch, 
> HDFS-13056.014.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-27 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.013.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, HDFS-13056.012.patch, HDFS-13056.013.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-24 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.012.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, HDFS-13056.012.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-24 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.011.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> HDFS-13056.011.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-23 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.010.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, HDFS-13056.010.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.009.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, HDFS-13056.009.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-22 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-13056:
-
Attachment: (was: IMG_1964.JPG)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-22 Thread Tsz Wo Nicholas Sze (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-13056:
---
Attachment: IMG_1964.JPG

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, IMG_1964.JPG, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-07 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-07 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-07 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.008.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-07 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.005.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.005.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, HDFS-13056.007.patch, 
> HDFS-13056.008.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-05 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.007.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> HDFS-13056.006.patch, HDFS-13056.007.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-05 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.004.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Assignee: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.004.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> HDFS-13056.006.patch, HDFS-13056.007.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-02 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-02 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-02 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.006.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-02 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.003.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.003.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> HDFS-13056.005.patch, HDFS-13056.006.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.005.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, HDFS-13056.005.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.002.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.002.patch, HDFS-13056-branch-2.8.poc1.patch, 
> HDFS-13056.001.patch, HDFS-13056.002.patch, HDFS-13056.003.patch, 
> HDFS-13056.003.patch, HDFS-13056.004.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-03-01 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.004.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, HDFS-13056.004.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-22 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.003.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, HDFS-13056.003.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-21 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-21 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-21 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.003.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-21 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Open  (was: Patch Available)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-21 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> HDFS-13056.003.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf, 
> hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-20 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Status: Patch Available  (was: Open)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-02-20 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.002.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, HDFS-13056.002.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-31 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: hdfs-file-composite-crc32-v3.pdf

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf, hdfs-file-composite-crc32-v3.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-31 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056.001.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, HDFS-13056.001.patch, 
> Reference_only_zhen_PPOC_hadoop2.6.X.diff, hdfs-file-composite-crc32-v1.pdf, 
> hdfs-file-composite-crc32-v2.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-31 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: hdfs-file-composite-crc32-v2.pdf

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf, hdfs-file-composite-crc32-v2.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-26 Thread zhenzhao wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated HDFS-13056:
-
Attachment: Reference_only_zhen_PPOC_hadoop2.6.X.diff

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, Reference_only_zhen_PPOC_hadoop2.6.X.diff, 
> hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-26 Thread zhenzhao wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated HDFS-13056:
-
Attachment: (was: HDFSFileLevelChecksum.diff)

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-26 Thread zhenzhao wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhenzhao wang updated HDFS-13056:
-
Attachment: HDFSFileLevelChecksum.diff

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-24 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.001.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.001.patch, 
> HDFS-13056-branch-2.8.poc1.patch, hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-23 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: HDFS-13056-branch-2.8.poc1.patch

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: HDFS-13056-branch-2.8.poc1.patch, 
> hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13056) Expose file-level composite CRCs in HDFS which are comparable across different instances/layouts

2018-01-23 Thread Dennis Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated HDFS-13056:
--
Attachment: hdfs-file-composite-crc32-v1.pdf

> Expose file-level composite CRCs in HDFS which are comparable across 
> different instances/layouts
> 
>
> Key: HDFS-13056
> URL: https://issues.apache.org/jira/browse/HDFS-13056
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, distcp, erasure-coding, federation, hdfs
>Affects Versions: 3.0.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: hdfs-file-composite-crc32-v1.pdf
>
>
> FileChecksum was first introduced in 
> [https://issues-test.apache.org/jira/browse/HADOOP-3981] and ever since then 
> has remained defined as MD5-of-MD5-of-CRC, where per-512-byte chunk CRCs are 
> already stored as part of datanode metadata, and the MD5 approach is used to 
> compute an aggregate value in a distributed manner, with individual datanodes 
> computing the MD5-of-CRCs per-block in parallel, and the HDFS client 
> computing the second-level MD5.
>  
> A shortcoming of this approach which is often brought up is the fact that 
> this FileChecksum is sensitive to the internal block-size and chunk-size 
> configuration, and thus different HDFS files with different block/chunk 
> settings cannot be compared. More commonly, one might have different HDFS 
> clusters which use different block sizes, in which case any data migration 
> won't be able to use the FileChecksum for distcp's rsync functionality or for 
> verifying end-to-end data integrity (on top of low-level data integrity 
> checks applied at data transfer time).
>  
> This was also revisited in https://issues.apache.org/jira/browse/HDFS-8430 
> during the addition of checksum support for striped erasure-coded files; 
> while there was some discussion of using CRC composability, it still 
> ultimately settled on hierarchical MD5 approach, which also adds the problem 
> that checksums of basic replicated files are not comparable to striped files.
>  
> This feature proposes to add a "COMPOSITE-CRC" FileChecksum type which uses 
> CRC composition to remain completely chunk/block agnostic, and allows 
> comparison between striped vs replicated files, between different HDFS 
> instances, and possible even between HDFS and other external storage systems. 
> This feature can also be added in-place to be compatible with existing block 
> metadata, and doesn't need to change the normal path of chunk verification, 
> so is minimally invasive. This also means even large preexisting HDFS 
> deployments could adopt this feature to retroactively sync data. A detailed 
> design document can be found here: 
> https://storage.googleapis.com/dennishuo/hdfs-file-composite-crc32-v1.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org