Thanks for the reply. Compressed sequence files with compression might work. However, it's not clear to me if it's possible to read Sequence files using an external table.
On 5 November 2012 16:04, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Compression is a confusing issue. Sequence files that are in block > format are always split table regardless of what compression for the > block is chosen.The Programming Hive book has an entire section > dedicated to the permutations of compression options. > > Edward > On Mon, Nov 5, 2012 at 10:57 AM, Krishna Rao <krishnanj...@gmail.com> > wrote: > > Hi all, > > > > I'm looking into finding a suitable format to store data in HDFS, so that > > it's available for processing by Hive. Ideally I would like to satisfy > the > > following: > > > > 1. store the data in a format that is readable by multiple Hadoop > projects > > (eg. Pig, Mahout, etc.), not just Hive > > 2. work with a Hive external table > > 3. store data in a compressed format that is splittable > > > > (1) is a requirement because Hive isn't appropriate for all the problems > > that we want to throw at Hadoop. > > > > (2) is really more of a consequence of (1). Ideally we want the data > stored > > in some open format that is compressed in HDFS. > > This way we can just point Hive, Pig, Mahout, etc at it depending on the > > problem. > > > > (3) is obviously so it plays well with Hadoop. > > > > Gzip is no good because it is not splittable. Snappy looked promising, > but > > it is splittable only if used with a non-external Hive table. > > LZO also looked promising, but I wonder about whether it is future proof > > given the licencing issues surrounding it. > > > > So far, the only solution I could find that satisfies all the above > seems to > > be bzip2 compression, but concerns about its performance make me wary > about > > choosing it. > > > > Is bzip2 the only option I have? Or have I missed some other compression > > option? > > > > Cheers, > > > > Krishna >