Cloudera has published a blog post [1] about the Parquet which seems to be answering most of the questions. I would encourage to read that article. It specifically talks about relationship with Trevni:
Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, and learning from, the initial work done toward this goal in Trevni, Parquet includes the following enhancements: * Efficiently encode nested structures and sparsely populated data based on the Google Dremel definition/repetition levels * Provide extensible support for per-column encodings (e.g. delta, run length, etc) * Provide extensibility of storing multiple types of data in column data (e.g. indexes, bloom filters, statistics) * Offer better write performance by storing metadata at the end of the file Jarcec Links: 1: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/ On Tue, Mar 12, 2013 at 01:06:04PM -0700, Kevin Olson wrote: > Second on that. Parquet looks compelling, but I'm curious to understand why > Cloudera suddenly switched from espousing future support for Trevni to > teaming with Twitter on Parquet. > > On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg > <[email protected]>wrote: > > > Dmitriy, > > > > Please excuse my ignorance. What is/was wrong with trevni > > (https://github.com/cutting/trevni) ? > > > > Thanks, > > > > stan > > > > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > Fellow Hadoopers, > > > > > > We'd like to introduce a joint project between Twitter and Cloudera > > > engineers -- a new columnar storage format for Hadoop called Parquet ( > > > http://parquet.github.com). > > > > > > We created Parquet to make the advantages of compressed, efficient > > columnar > > > data representation available to any project in the Hadoop ecosystem, > > > regardless of the choice of data processing framework, data model, or > > > programming language. > > > > > > Parquet is built from the ground up with complex nested data structures > > in > > > mind. We adopted the repetition/definition level approach to encoding > > such > > > data structures, as described in Google's Dremel paper; we have found > > this > > > to be a very efficient method of encoding data in non-trivial object > > > schemas. > > > > > > Parquet is built to support very efficient compression and encoding > > > schemes. Parquet allows compression schemes to be specified on a > > per-column > > > level, and is future-proofed to allow adding more encodings as they are > > > invented and implemented. We separate the concepts of encoding and > > > compression, allowing parquet consumers to implement operators that work > > > directly on encoded data without paying decompression and decoding > > penalty > > > when possible. > > > > > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with > > > data processing frameworks, and we are not interested in playing > > favorites. > > > We believe that an efficient, well-implemented columnar storage substrate > > > should be useful to all frameworks without the cost of extensive and > > > difficult to set up dependencies. > > > > > > The initial code, available at https://github.com/Parquet, defines the > > file > > > format, provides Java building blocks for processing columnar data, and > > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an > > example > > > of a complex integration -- Input/Output formats that can convert > > > Parquet-stored data directly to and from Thrift objects. > > > > > > A preview version of Parquet support will be available in Cloudera's > > Impala > > > 0.7. > > > > > > Twitter is starting to convert some of its major data source to Parquet > > in > > > order to take advantage of the compression and deserialization savings. > > > > > > Parquet is currently under heavy development. Parquet's near-term roadmap > > > includes: > > > * Hive SerDes (Criteo) > > > * Cascading Taps (Criteo) > > > * Support for dictionary encoding, zigzag encoding, and RLE encoding of > > > data (Cloudera and Twitter) > > > * Further improvements to Pig support (Twitter) > > > > > > Company names in parenthesis indicate whose engineers signed up to do the > > > work -- others can feel free to jump in too, of course. > > > > > > We've also heard requests to provide an Avro container layer, similar to > > > what we do with Thrift. Seeking volunteers! > > > > > > We welcome all feedback, patches, and ideas; to foster community > > > development, we plan to contribute Parquet to the Apache Incubator when > > the > > > development is farther along. > > > > > > Regards, > > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, > > > Jonathan Coveney, and friends. > >
signature.asc
Description: Digital signature
