Re: Introducing Parquet: efficient columnar storage for Hadoop.

Dmitriy Ryaboy Wed, 13 Mar 2013 10:25:32 -0700

Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/


D


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <[email protected]> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
>
>
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <[email protected]>wrote:
>
>> Fellow Hadoopers,
>>
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>>
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>>
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>>
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>>
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>>
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>>
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>>
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>>
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>>
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.
>>
>
>

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Reply via email to