Hi folks, Thanks for your interest. The Cloudera blog post has a few additional bullet points about the difference between Trevni and Parquet: http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/
D On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <[email protected]> wrote: > IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni > and ORCFile, all of which are columnar formats for Hadoop that are > relatively new. Do we really need 3 columnar formats? > > > On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <[email protected]>wrote: > >> Fellow Hadoopers, >> >> We'd like to introduce a joint project between Twitter and Cloudera >> engineers -- a new columnar storage format for Hadoop called Parquet ( >> http://parquet.github.com). >> >> We created Parquet to make the advantages of compressed, efficient >> columnar >> data representation available to any project in the Hadoop ecosystem, >> regardless of the choice of data processing framework, data model, or >> programming language. >> >> Parquet is built from the ground up with complex nested data structures in >> mind. We adopted the repetition/definition level approach to encoding such >> data structures, as described in Google's Dremel paper; we have found this >> to be a very efficient method of encoding data in non-trivial object >> schemas. >> >> Parquet is built to support very efficient compression and encoding >> schemes. Parquet allows compression schemes to be specified on a >> per-column >> level, and is future-proofed to allow adding more encodings as they are >> invented and implemented. We separate the concepts of encoding and >> compression, allowing parquet consumers to implement operators that work >> directly on encoded data without paying decompression and decoding penalty >> when possible. >> >> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with >> data processing frameworks, and we are not interested in playing >> favorites. >> We believe that an efficient, well-implemented columnar storage substrate >> should be useful to all frameworks without the cost of extensive and >> difficult to set up dependencies. >> >> The initial code, available at https://github.com/Parquet, defines the >> file >> format, provides Java building blocks for processing columnar data, and >> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an >> example >> of a complex integration -- Input/Output formats that can convert >> Parquet-stored data directly to and from Thrift objects. >> >> A preview version of Parquet support will be available in Cloudera's >> Impala >> 0.7. >> >> Twitter is starting to convert some of its major data source to Parquet in >> order to take advantage of the compression and deserialization savings. >> >> Parquet is currently under heavy development. Parquet's near-term roadmap >> includes: >> * Hive SerDes (Criteo) >> * Cascading Taps (Criteo) >> * Support for dictionary encoding, zigzag encoding, and RLE encoding of >> data (Cloudera and Twitter) >> * Further improvements to Pig support (Twitter) >> >> Company names in parenthesis indicate whose engineers signed up to do the >> work -- others can feel free to jump in too, of course. >> >> We've also heard requests to provide an Avro container layer, similar to >> what we do with Thrift. Seeking volunteers! >> >> We welcome all feedback, patches, and ideas; to foster community >> development, we plan to contribute Parquet to the Apache Incubator when >> the >> development is farther along. >> >> Regards, >> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy, >> Jonathan Coveney, and friends. >> > >
