Re: Is it possible to use Parquet with Dremel encoding

Michael Armbrust Sat, 27 Sep 2014 11:50:44 -0700

Based on your first example it looks like what you want is actually run
length encoding (which parquet does support
<https://github.com/Parquet/parquet-format/blob/master/Encodings.md>).
Repetition and definition levels are used to reconstruct nested or repeated
(arrays) data that has been shredded so that each column can be stored
separately (allowing you to avoid reading bits for columns you don't care
about).


Spark SQL is likely the easiest way for you to achieve what you want and
does support nested and array data (though it does not look like your
schema has that).  Given the original data you could save it as parquet as
follows:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext._

case class Data(col1: Int, col2: Long, col2: long, col4: Int)

sc.parallelize(
  Data(14, 1234, 1422, 3)  ::
  Data(14, 3212, 1542, 2)  ::
  Data(14, 8910, 1422, 8)  ::
  Data(15, 1234, 1542, 9)  ::
  Data(15, 8897, 1422, 13) :: Nil).saveAsParquetFile(...)

Note that this is only an illustration of the API, and if the data is large
you will not want to construct it all as a static List on the driver and
parallelize.  Instead transform it into the case class representation using
a map or something similar and then saveAsParquetFile.

On Fri, Sep 26, 2014 at 9:00 AM, Frank Austin Nothaft <fnoth...@berkeley.edu
> wrote:

> Matthes,
>
> Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType
> in Spark-SQL. I only use Spark, but it does looks like that should be
> supported in Spark-SQL 1.1.0. I’m not sure though if you can apply
> predicates on repeated items from Spark-SQL.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
> On Sep 26, 2014, at 8:48 AM, matthes <mdiekst...@sensenetworks.com> wrote:
>
> > Hi Frank,
> >
> > thanks al lot for your response, this is a very helpful!
> >
> > Actually I'm try to figure out does the current spark version supports
> > Repetition levels
> > (https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now
> it
> > looks good to me.
> > It is very hard to find some good things about that. Now I found this as
> > well:
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala;h=1dc58633a2a68cd910c1bab01c3d5ee1eb4f8709;hb=f479cf37
> >
> > I wasn't sure of that because nested data can be many different things!
> > If it works with SQL, to find the firstRepeatedid or secoundRepeatedid
> would
> > be awesome. But if it only works with kind of map/reduce job than it also
> > good. The most important thing is to filter the first or secound
> repeated
> > value as fast as possible and in combination as well.
> > I start now to play with this things to get the best search results!
> >
> > Me schema looks like this:
> >
> > val nestedSchema =
> >    """message nestedRowSchema
> > {
> >                 int32 firstRepeatedid;
> >                 repeated group level1
> >                 {
> >                       int64 secoundRepeatedid;
> >                       repeated group level2
> >                     {
> >                       int64   value1;
> >                       int32   value2;
> >                     }
> >                 }
> >       }
> >    """
> >
> > Best,
> > Matthes
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15239.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Is it possible to use Parquet with Dremel encoding

Reply via email to