The SequenceFile.Reader will work PErfect! (I should have seen that). As always - thanks Harsh
On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <[email protected]> wrote: > If you're looking for file header/contents based inspection, you could > download the file and run the Linux utility 'file' on the file, and it > should tell you the format. > > I don't know about Snappy (AFAIK, we don't have a snappy > frame/container format support in Hadoop yet, although upstream Snappy > issue 34 seems resolved now), but Gzip files can be identified simply > by their header bytes for the magic sequence. > > If its sequence files you are looking to analyse, a simple way is to > read its first few hundred bytes, which should have the codec string > in it. Programmatically you can use > > https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec() > for sequence files. > > On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <[email protected]> wrote: > > What's the best way to check the compression codec that an HDFS file was > > written with? > > > > We use both Gzip and Snappy compression so I want a way to determine how > a > > specific file is compressed. > > > > The closest I found is the getCodec but that relies on the file name > suffix > > ... which don't exist since Reducers typically don't add a suffix to the > > filenames they create. > > > > Thanks > > > > -- > Harsh J >
