Re: Standard of Parquet nested data types

Hao Zhu Tue, 01 Sep 2015 13:32:13 -0700

Per the previous email thread from Spark community, it seems they are
following this parquet logical type standard:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types


Should Drill follow the same?


On Tue, Sep 1, 2015 at 10:09 AM, Steven Phillips <[email protected]> wrote:

> No there is no trick. This is because Drill reads the data as it is
> physically written. At some point, we will add the ability to interpret
> these types according to their logical type. However, that will require
> that the parquet files are written with the correct OriginalType metadata.
> I don't know if hive or Spark are currently doing this.
>
> On Fri, Aug 28, 2015 at 4:44 PM, Hao Zhu <[email protected]> wrote:
>
> > Thanks and I do not want to argue if Drill's parquet format is valid or
> > Spark/Hive is doing the right thing.
> >
> > Current concern is the nested types in parquet generated by Spark/Hive
> can
> > not be read properly in Drill.
> >
> > Take previous simple list for example, if it is converted by Spark to
> > parquet file:
> > 1. Spark can read it as a list
> > val parquetFile =
> > sqlContext.parquetFile("/tmp/testjson_spark/part-r-00001.parquet")
> > parquetFile.registerTempTable("parquetFile")
> > val myresult = sqlContext.sql("SELECT * FROM parquetFile limit 10")
> > myresult.map(t => "Name: " + t(0)).collect().foreach(println)
> >
> > Name: ArrayBuffer(1, 2, 3)
> >
> > 2. Drill can only return like this
> > > select * from dfs.`/tmp/testjson_spark/part-r-00001.parquet`;
> > +------------------------------------------------+
> > |                       c1                       |
> > +------------------------------------------------+
> > | {"bag":[{"array":1},{"array":2},{"array":3}]}  |
> > +------------------------------------------------+
> > 1 row selected (0.335 seconds)
> >
> >
> > Is there any trick of reading the list properly in Drill?
> >
> > Thanks,
> > Hao
> >
> >
> >
> >
> >
> > On Fri, Aug 28, 2015 at 4:20 PM, Steven Phillips <[email protected]>
> > wrote:
> >
> > > Both parquet and drill internal data model is based on protobuf,
> meaning
> > > there are required, optional, and repeated fields. In this model,
> > repeated
> > > fields cannot be null, nor can they have null elements. The 3-layer
> > nested
> > > structure is necessary to represent a field where the array itself is
> > > nullable, as well as elements of the array.
> > >
> > > We are going to add nullability to repeated types in Drill, and when we
> > do
> > > so, it would make sense to adopt the same format for representing them
> in
> > > parquet that other projects have adopted.
> > >
> > > At the same time, I would argue that the fact that Drill writes the
> > parquet
> > > data in a different format than spark sql is not a problem. The format
> > the
> > > Drill currently writes is perfectly valid, and other parquet tools
> should
> > > be able to interpret it just fine. It's just that this way of writing
> an
> > > array doesn't allow for null values, which Drill internally doesn't
> > > currently support anyway.
> > >
> > > On Fri, Aug 28, 2015 at 11:41 AM, Hao Zhu <[email protected]> wrote:
> > >
> > > > Hi Team,
> > > >
> > > > I want to raise one topic about the Standard of Parquet nested data
> > > types.
> > > > Firstly let me show you one simple example.
> > > >
> > > > Sample Json file:
> > > > {"c1":[1,2,3]}
> > > >
> > > > Using Spark to convert it to parquet, the schema is:
> > > >  c1:          OPTIONAL F:1
> > > > .bag:        REPEATED F:1
> > > > ..array:     OPTIONAL INT64 R:1 D:3
> > > >
> > > > Using Drill to create parquet file, schema will be:
> > > > c1:          REPEATED INT64 R:1 D:1
> > > >
> > > > So this caused that Drill can not read the parquet nested data types
> > > > generated by Spark, or even Hive(See DRILL-1999
> > > > <https://issues.apache.org/jira/browse/DRILL-1999>)
> > > > Spark community's answer to this standard question of parquet nested
> > data
> > > > types are in:
> > > > https://www.mail-archive.com/[email protected]/msg35663.html
> > > >
> > > > What is Drill's stand point on this topic? Do we need to make some
> > > > agreement on the standard of nested data types in the Parquet
> > community?
> > > >
> > > > Any comment is welcome.
> > > >
> > > > Thanks,
> > > > Hao
> > > >
> > >
> >
>

Re: Standard of Parquet nested data types

Reply via email to