Hairong Kuang wrote:
If end-to-end is a concern, we could let the client generate the checksums
and send it to the data node following the block data.
I created an issue in Jira related to this issue:
https://issues.apache.org/jira/browse/HADOOP-928
The idea there is to first make it possible
A checksummed filesystem that embeds checksums into data makes the data
unapproachable by tools that don't anticipate checksums. In HDFS data is
accessible only via the HDFS client so this is not an issue and the
checksums can be stripped out before they reach clients. But for Local
and S3 where d
A checksum file per block would have many CRCs, one per 64k chunk or so
in the block. So it would still permit random access. The datanode would
only checksum the data accessed plus on average an extra 32k.
Also, if datanodes were to send the checksum after the data on a read,
the client could
Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node
where
the block is placed.
Yes, but then we'd need a separate checksum implementation for
intermediate data, and for other distributed filesystems that don't
already guarantee end
: inline checksums
Hairong Kuang wrote:
> Another option is to create a checksum file per block at the data node
> where the block is placed.
Yes, but then we'd need a separate checksum implementation for intermediate
data, and for other distributed filesystems that don't already
Doug Cutting wrote:
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node
where
the block is placed.
Yes, but then we'd need a separate checksum implementation for
intermediate data, and for other distributed filesystems that don't
already guarantee e
Hairong Kuang wrote:
Another option is to create a checksum file per block at the data node where
the block is placed.
Yes, but then we'd need a separate checksum implementation for
intermediate data, and for other distributed filesystems that don't
already guarantee end-to-end data integrity
Another option is to create a checksum file per block at the data node where
the block is placed. This approach clearly separates data and checksums and
does not requires too much changes for open(), seek() and length(). For
create, when a block is written to a data node, the data node creates a
ch