Re: Understanding compression in hdfs

2012-07-29 Thread Yaron Gonen
Is the checksum pluggable? CRC-32 is good for error detection not for duplication check. I need this for duplication check. On Sun, Jul 29, 2012 at 8:41 PM, Brock Noland wrote: > Also note that HDFS already does checksums which I believe you > can retrieve: > > > http://hadoop.apache.org/common/

Re: Understanding compression in hdfs

2012-07-29 Thread Brock Noland
Also note that HDFS already does checksums which I believe you can retrieve: http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity Brock On Sun,

Re: Understanding compression in hdfs

2012-07-29 Thread Yaron Gonen
Thanks! I'll dig into those classes to figure out my next step. Anyway, I just realized the block-level compression has nothing to do with HDFS blocks. An HDFS block can contain an unknown number of compressed blocks, which makes my efforts kind of worthless. thanks again! On Sun, Jul 29, 2012 a

Re: Understanding compression in hdfs

2012-07-29 Thread Tim Broberg
What if you wrote a CompressionOutputStream class that wraps around the existing ones and outputs a hash per bytes and a CompressionInputStream that checks them? ...and a Codec that wraps your compressors around arbitrary existing codecs. Sounds like a bunch of work, and I'm not sure where you

Understanding compression in hdfs

2012-07-29 Thread Yaron Gonen
Hi, I've created a SequeceFile.Writer with block-level compression. I'd like to create a SHA1 hash for each block written. How do I do that? I didn't see any way to take the compression under my control in order to know when a block is over. Thanks, Yaron