Thanks and I do not want to argue if Drill's parquet format is valid or
Spark/Hive is doing the right thing.
Current concern is the nested types in parquet generated by Spark/Hive can
not be read properly in Drill.
Take previous simple list for example, if it is converted by Spark to
parquet file:
1. Spark can read it as a list
val parquetFile =
sqlContext.parquetFile("/tmp/testjson_spark/part-r-00001.parquet")
parquetFile.registerTempTable("parquetFile")
val myresult = sqlContext.sql("SELECT * FROM parquetFile limit 10")
myresult.map(t => "Name: " + t(0)).collect().foreach(println)
Name: ArrayBuffer(1, 2, 3)
2. Drill can only return like this
> select * from dfs.`/tmp/testjson_spark/part-r-00001.parquet`;
+------------------------------------------------+
| c1 |
+------------------------------------------------+
| {"bag":[{"array":1},{"array":2},{"array":3}]} |
+------------------------------------------------+
1 row selected (0.335 seconds)
Is there any trick of reading the list properly in Drill?
Thanks,
Hao
On Fri, Aug 28, 2015 at 4:20 PM, Steven Phillips <[email protected]> wrote:
> Both parquet and drill internal data model is based on protobuf, meaning
> there are required, optional, and repeated fields. In this model, repeated
> fields cannot be null, nor can they have null elements. The 3-layer nested
> structure is necessary to represent a field where the array itself is
> nullable, as well as elements of the array.
>
> We are going to add nullability to repeated types in Drill, and when we do
> so, it would make sense to adopt the same format for representing them in
> parquet that other projects have adopted.
>
> At the same time, I would argue that the fact that Drill writes the parquet
> data in a different format than spark sql is not a problem. The format the
> Drill currently writes is perfectly valid, and other parquet tools should
> be able to interpret it just fine. It's just that this way of writing an
> array doesn't allow for null values, which Drill internally doesn't
> currently support anyway.
>
> On Fri, Aug 28, 2015 at 11:41 AM, Hao Zhu <[email protected]> wrote:
>
> > Hi Team,
> >
> > I want to raise one topic about the Standard of Parquet nested data
> types.
> > Firstly let me show you one simple example.
> >
> > Sample Json file:
> > {"c1":[1,2,3]}
> >
> > Using Spark to convert it to parquet, the schema is:
> > c1: OPTIONAL F:1
> > .bag: REPEATED F:1
> > ..array: OPTIONAL INT64 R:1 D:3
> >
> > Using Drill to create parquet file, schema will be:
> > c1: REPEATED INT64 R:1 D:1
> >
> > So this caused that Drill can not read the parquet nested data types
> > generated by Spark, or even Hive(See DRILL-1999
> > <https://issues.apache.org/jira/browse/DRILL-1999>)
> > Spark community's answer to this standard question of parquet nested data
> > types are in:
> > https://www.mail-archive.com/[email protected]/msg35663.html
> >
> > What is Drill's stand point on this topic? Do we need to make some
> > agreement on the standard of nested data types in the Parquet community?
> >
> > Any comment is welcome.
> >
> > Thanks,
> > Hao
> >
>