Both parquet and drill internal data model is based on protobuf, meaning
there are required, optional, and repeated fields. In this model, repeated
fields cannot be null, nor can they have null elements. The 3-layer nested
structure is necessary to represent a field where the array itself is
nullable, as well as elements of the array.

We are going to add nullability to repeated types in Drill, and when we do
so, it would make sense to adopt the same format for representing them in
parquet that other projects have adopted.

At the same time, I would argue that the fact that Drill writes the parquet
data in a different format than spark sql is not a problem. The format the
Drill currently writes is perfectly valid, and other parquet tools should
be able to interpret it just fine. It's just that this way of writing an
array doesn't allow for null values, which Drill internally doesn't
currently support anyway.

On Fri, Aug 28, 2015 at 11:41 AM, Hao Zhu <[email protected]> wrote:

> Hi Team,
>
> I want to raise one topic about the Standard of Parquet nested data types.
> Firstly let me show you one simple example.
>
> Sample Json file:
> {"c1":[1,2,3]}
>
> Using Spark to convert it to parquet, the schema is:
>  c1:          OPTIONAL F:1
> .bag:        REPEATED F:1
> ..array:     OPTIONAL INT64 R:1 D:3
>
> Using Drill to create parquet file, schema will be:
> c1:          REPEATED INT64 R:1 D:1
>
> So this caused that Drill can not read the parquet nested data types
> generated by Spark, or even Hive(See DRILL-1999
> <https://issues.apache.org/jira/browse/DRILL-1999>)
> Spark community's answer to this standard question of parquet nested data
> types are in:
> https://www.mail-archive.com/[email protected]/msg35663.html
>
> What is Drill's stand point on this topic? Do we need to make some
> agreement on the standard of nested data types in the Parquet community?
>
> Any comment is welcome.
>
> Thanks,
> Hao
>

Reply via email to