Thanks for the reply. Compressed sequence files with compression might work.
However, it's not clear to me if it's possible to read Sequence files using
an external table.

On 5 November 2012 16:04, Edward Capriolo <edlinuxg...@gmail.com> wrote:

> Compression is a confusing issue. Sequence files that are in block
> format are always split table regardless of what compression for the
> block is chosen.The Programming Hive book has an entire section
> dedicated to the permutations of compression options.
>
> Edward
> On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <krishnanj...@gmail.com>
> wrote:
> > Hi all,
> >
> > I'm looking into finding a suitable format to store data in HDFS, so that
> > it's available for processing by Hive. Ideally I would like to satisfy
> the
> > following:
> >
> > 1. store the data in a format that is readable by multiple Hadoop
> projects
> > (eg. Pig, Mahout, etc.), not just Hive
> > 2. work with a Hive external table
> > 3. store data in a compressed format that is splittable
> >
> > (1) is a requirement because Hive isn't appropriate for all the problems
> > that we want to throw at Hadoop.
> >
> > (2) is really more of a consequence of (1). Ideally we want the data
> stored
> > in some open format that is compressed in HDFS.
> > This way we can just point Hive, Pig, Mahout, etc at it depending on the
> > problem.
> >
> > (3) is obviously so it plays well with Hadoop.
> >
> > Gzip is no good because it is not splittable. Snappy looked promising,
> but
> > it is splittable only if used with a non-external Hive table.
> > LZO also looked promising, but I wonder about whether it is future proof
> > given the licencing issues surrounding it.
> >
> > So far, the only solution I could find that satisfies all the above
> seems to
> > be bzip2 compression, but concerns about its performance make me wary
> about
> > choosing it.
> >
> > Is bzip2 the only option I have? Or have I missed some other compression
> > option?
> >
> > Cheers,
> >
> > Krishna
>

Reply via email to