Re: Standard of Parquet nested data types

Steven Phillips Tue, 01 Sep 2015 10:10:45 -0700

No there is no trick. This is because Drill reads the data as it is
physically written. At some point, we will add the ability to interpret
these types according to their logical type. However, that will require
that the parquet files are written with the correct OriginalType metadata.
I don't know if hive or Spark are currently doing this.


On Fri, Aug 28, 2015 at 4:44 PM, Hao Zhu <[email protected]> wrote:

> Thanks and I do not want to argue if Drill's parquet format is valid or
> Spark/Hive is doing the right thing.
>
> Current concern is the nested types in parquet generated by Spark/Hive can
> not be read properly in Drill.
>
> Take previous simple list for example, if it is converted by Spark to
> parquet file:
> 1. Spark can read it as a list
> val parquetFile =
> sqlContext.parquetFile("/tmp/testjson_spark/part-r-00001.parquet")
> parquetFile.registerTempTable("parquetFile")
> val myresult = sqlContext.sql("SELECT * FROM parquetFile limit 10")
> myresult.map(t => "Name: " + t(0)).collect().foreach(println)
>
> Name: ArrayBuffer(1, 2, 3)
>
> 2. Drill can only return like this
> > select * from dfs.`/tmp/testjson_spark/part-r-00001.parquet`;
> +------------------------------------------------+
> |                       c1                       |
> +------------------------------------------------+
> | {"bag":[{"array":1},{"array":2},{"array":3}]}  |
> +------------------------------------------------+
> 1 row selected (0.335 seconds)
>
>
> Is there any trick of reading the list properly in Drill?
>
> Thanks,
> Hao
>
>
>
>
>
> On Fri, Aug 28, 2015 at 4:20 PM, Steven Phillips <[email protected]>
> wrote:
>
> > Both parquet and drill internal data model is based on protobuf, meaning
> > there are required, optional, and repeated fields. In this model,
> repeated
> > fields cannot be null, nor can they have null elements. The 3-layer
> nested
> > structure is necessary to represent a field where the array itself is
> > nullable, as well as elements of the array.
> >
> > We are going to add nullability to repeated types in Drill, and when we
> do
> > so, it would make sense to adopt the same format for representing them in
> > parquet that other projects have adopted.
> >
> > At the same time, I would argue that the fact that Drill writes the
> parquet
> > data in a different format than spark sql is not a problem. The format
> the
> > Drill currently writes is perfectly valid, and other parquet tools should
> > be able to interpret it just fine. It's just that this way of writing an
> > array doesn't allow for null values, which Drill internally doesn't
> > currently support anyway.
> >
> > On Fri, Aug 28, 2015 at 11:41 AM, Hao Zhu <[email protected]> wrote:
> >
> > > Hi Team,
> > >
> > > I want to raise one topic about the Standard of Parquet nested data
> > types.
> > > Firstly let me show you one simple example.
> > >
> > > Sample Json file:
> > > {"c1":[1,2,3]}
> > >
> > > Using Spark to convert it to parquet, the schema is:
> > >  c1:          OPTIONAL F:1
> > > .bag:        REPEATED F:1
> > > ..array:     OPTIONAL INT64 R:1 D:3
> > >
> > > Using Drill to create parquet file, schema will be:
> > > c1:          REPEATED INT64 R:1 D:1
> > >
> > > So this caused that Drill can not read the parquet nested data types
> > > generated by Spark, or even Hive(See DRILL-1999
> > > <https://issues.apache.org/jira/browse/DRILL-1999>)
> > > Spark community's answer to this standard question of parquet nested
> data
> > > types are in:
> > > https://www.mail-archive.com/[email protected]/msg35663.html
> > >
> > > What is Drill's stand point on this topic? Do we need to make some
> > > agreement on the standard of nested data types in the Parquet
> community?
> > >
> > > Any comment is welcome.
> > >
> > > Thanks,
> > > Hao
> > >
> >
>

Re: Standard of Parquet nested data types

Reply via email to