Re: Check compression codec of an HDFS file

alex bohr Thu, 05 Dec 2013 14:35:09 -0800

The SequenceFile.Reader will work PErfect!  (I should have seen that).

As always - thanks Harsh



On Thu, Dec 5, 2013 at 2:22 AM, Harsh J <[email protected]> wrote:

> If you're looking for file header/contents based inspection, you could
> download the file and run the Linux utility 'file' on the file, and it
> should tell you the format.
>
> I don't know about Snappy (AFAIK, we don't have a snappy
> frame/container format support in Hadoop yet, although upstream Snappy
> issue 34 seems resolved now), but Gzip files can be identified simply
> by their header bytes for the magic sequence.
>
> If its sequence files you are looking to analyse, a simple way is to
> read its first few hundred bytes, which should have the codec string
> in it. Programmatically you can use
>
> https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec()
> for sequence files.
>
> On Thu, Dec 5, 2013 at 5:10 AM, alex bohr <[email protected]> wrote:
> > What's the best way to check the compression codec that an HDFS file was
> > written with?
> >
> > We use both Gzip and Snappy compression so I want a way to determine how
> a
> > specific file is compressed.
> >
> > The closest I found is the getCodec but that relies on the file name
> suffix
> > ... which don't exist since Reducers typically don't add a suffix to the
> > filenames they create.
> >
> > Thanks
>
>
>
> --
> Harsh J
>

Re: Check compression codec of an HDFS file

Reply via email to