Re: Introducing Parquet: efficient columnar storage for Hadoop.

Jarek Jarcec Cecho Wed, 13 Mar 2013 12:39:48 -0700

Cloudera has published a blog post [1] about the Parquet which seems to be 
answering most of the questions. I would encourage to read that article. It 
specifically talks about relationship with Trevni:


Parquet is designed to bring efficient columnar storage to Hadoop. Compared to, 
and learning from, the initial work done toward this goal in Trevni, Parquet 
includes the following enhancements:

* Efficiently encode nested structures and sparsely populated data based on the 
Google Dremel definition/repetition levels
* Provide extensible support for per-column encodings (e.g. delta, run length, 
etc)
* Provide extensibility of storing multiple types of data in column data (e.g. 
indexes, bloom filters, statistics)
* Offer better write performance by storing metadata at the end of the file

Jarcec

Links:
1: 
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

On Tue, Mar 12, 2013 at 01:06:04PM -0700, Kevin Olson wrote:
> Second on that. Parquet looks compelling, but I'm curious to understand why
> Cloudera suddenly switched from espousing future support for Trevni to
> teaming with Twitter on Parquet.
> 
> On Tue, Mar 12, 2013 at 11:01 AM, Stan Rosenberg
> <[email protected]>wrote:
> 
> > Dmitriy,
> >
> > Please excuse my ignorance.  What is/was wrong with trevni
> > (https://github.com/cutting/trevni) ?
> >
> > Thanks,
> >
> > stan
> >
> > On Tue, Mar 12, 2013 at 11:45 AM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> > > Fellow Hadoopers,
> > >
> > > We'd like to introduce a joint project between Twitter and Cloudera
> > > engineers -- a new columnar storage format for Hadoop called Parquet (
> > > http://parquet.github.com).
> > >
> > > We created Parquet to make the advantages of compressed, efficient
> > columnar
> > > data representation available to any project in the Hadoop ecosystem,
> > > regardless of the choice of data processing framework, data model, or
> > > programming language.
> > >
> > > Parquet is built from the ground up with complex nested data structures
> > in
> > > mind. We adopted the repetition/definition level approach to encoding
> > such
> > > data structures, as described in Google's Dremel paper; we have found
> > this
> > > to be a very efficient method of encoding data in non-trivial object
> > > schemas.
> > >
> > > Parquet is built to support very efficient compression and encoding
> > > schemes. Parquet allows compression schemes to be specified on a
> > per-column
> > > level, and is future-proofed to allow adding more encodings as they are
> > > invented and implemented. We separate the concepts of encoding and
> > > compression, allowing parquet consumers to implement operators that work
> > > directly on encoded data without paying decompression and decoding
> > penalty
> > > when possible.
> > >
> > > Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> > > data processing frameworks, and we are not interested in playing
> > favorites.
> > > We believe that an efficient, well-implemented columnar storage substrate
> > > should be useful to all frameworks without the cost of extensive and
> > > difficult to set up dependencies.
> > >
> > > The initial code, available at https://github.com/Parquet, defines the
> > file
> > > format, provides Java building blocks for processing columnar data, and
> > > implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
> > example
> > > of a complex integration -- Input/Output formats that can convert
> > > Parquet-stored data directly to and from Thrift objects.
> > >
> > > A preview version of Parquet support will be available in Cloudera's
> > Impala
> > > 0.7.
> > >
> > > Twitter is starting to convert some of its major data source to Parquet
> > in
> > > order to take advantage of the compression and deserialization savings.
> > >
> > > Parquet is currently under heavy development. Parquet's near-term roadmap
> > > includes:
> > > * Hive SerDes (Criteo)
> > > * Cascading Taps (Criteo)
> > > * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> > > data (Cloudera and Twitter)
> > > * Further improvements to Pig support (Twitter)
> > >
> > > Company names in parenthesis indicate whose engineers signed up to do the
> > > work -- others can feel free to jump in too, of course.
> > >
> > > We've also heard requests to provide an Avro container layer, similar to
> > > what we do with Thrift. Seeking volunteers!
> > >
> > > We welcome all feedback, patches, and ideas; to foster community
> > > development, we plan to contribute Parquet to the Apache Incubator when
> > the
> > > development is farther along.
> > >
> > > Regards,
> > > Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> > > Jonathan Coveney, and friends.
> >

signature.asc
Description: Digital signature

Re: Introducing Parquet: efficient columnar storage for Hadoop.

Reply via email to