IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni and ORCFile, all of which are columnar formats for Hadoop that are relatively new. Do we really need 3 columnar formats?
On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <[email protected]> wrote: > Fellow Hadoopers, > > We'd like to introduce a joint project between Twitter and Cloudera > engineers -- a new columnar storage format for Hadoop called Parquet ( > http://parquet.github.com). > > We created Parquet to make the advantages of compressed, efficient columnar > data representation available to any project in the Hadoop ecosystem, > regardless of the choice of data processing framework, data model, or > programming language. > > Parquet is built from the ground up with complex nested data structures in > mind. We adopted the repetition/definition level approach to encoding such > data structures, as described in Google's Dremel paper; we have found this > to be a very efficient method of encoding data in non-trivial object > schemas. > > Parquet is built to support very efficient compression and encoding > schemes. Parquet allows compression schemes to be specified on a per-column > level, and is future-proofed to allow adding more encodings as they are > invented and implemented. We separate the concepts of encoding and > compression, allowing parquet consumers to implement operators that work > directly on encoded data without paying decompression and decoding penalty > when possible. > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with > data processing frameworks, and we are not interested in playing favorites. > We believe that an efficient, well-implemented columnar storage substrate > should be useful to all frameworks without the cost of extensive and > difficult to set up dependencies. > > The initial code, available at https://github.com/Parquet, defines the > file > format, provides Java building blocks for processing columnar data, and > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example > of a complex integration -- Input/Output formats that can convert > Parquet-stored data directly to and from Thrift objects. > > A preview version of Parquet support will be available in Cloudera's Impala > 0.7. > > Twitter is starting to convert some of its major data source to Parquet in > order to take advantage of the compression and deserialization savings. > > Parquet is currently under heavy development. Parquet's near-term roadmap > includes: > * Hive SerDes (Criteo) > * Cascading Taps (Criteo) > * Support for dictionary encoding, zigzag encoding, and RLE encoding of > data (Cloudera and Twitter) > * Further improvements to Pig support (Twitter) > > Company names in parenthesis indicate whose engineers signed up to do the > work -- others can feel free to jump in too, of course. > > We've also heard requests to provide an Avro container layer, similar to > what we do with Thrift. Seeking volunteers! > > We welcome all feedback, patches, and ideas; to foster community > development, we plan to contribute Parquet to the Apache Incubator when the > development is farther along. > > Regards, > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, > Jonathan Coveney, and friends. >
