Hi Jim, Unfortunately this is neither possible in Spark nor a standard practice for Parquet.
In your case, actually repeated int64 c1 doesn't catch the full semantics. Because it represents a *required* array of long values containing zero or more *non-null* elements. However, when inferring schema from JSON files, it's not safe to assume any field is non-nullable. So we always generated nullable schema for JSON. The schema generated by Spark SQL for the JSON snippet you provided is: message root { optional group c1 (LIST) { repeated group bag { optional int64 array; } } } The outer optional means the array field c1 itself can be null, and the inner optional means elements contained in the array can also be null. That's the reason why parquet-format defines a 3-level structure to represent LIST. This is different from ProtocolBuffer. Another thing to note is that, extra nested levels are super cheap in Parquet (almost zero cost), because only leaf nodes are materialized in the physical file. If you are worrying about interoperability with other Parquet libraries like parquet-protobuf, then to the best of my knowledge, currently Spark SQL 1.5 is the only system that can correctly interpret Parquet files generated by other systems. Because it is the only one that implemented all backwards-compatibility rules defined in parquet-format. As for Parquet compatibility, it's a little bit complicated and requires some background knowledge to understand. You may find more details in this section of parquet-format spec <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists>. Although Parquet was designed with interoperability in mind, the format spec didn't explicitly specify how nested types like LIST and MAP should be represented in the early days. The consequence is that, different Parquet libraries, including Spark SQL, all use different representations, and are incompatible with each other in many cases. For example, to represent a required list of string containing no null values, all the Parquet schemas below are valid: // parquet-protobuf style message m0 { repeated binary f (UTF8); } // parquet-avro style message m1 { required group f (LIST) { repeated binary array (UTF8); } } // parquet-thrift style message m2 { required group f (LIST) { repeated binary f_tuple (UTF8); } } // standard layout defined in the most recent parquet-format message m3 { required group f (LIST) { repeated group list { required binary element (UTF8); } } } Apparently, this badly hurts Parquet interoperability. To fix this issue, recently parquet-format defined standard layouts for nested types <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types>, as well as backwards-compatibility rules for reading legacy Parquet files (1 <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists>, 2 <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps>). We implemented all these rules on the read path in Spark SQL 1.5, this means now we can read non-standard legacy Parquet files generated by various systems. However, we haven't refactored the write path to follow the spec yet. This is a task for 1.6. HTH. Cheng On Fri, Aug 28, 2015 at 6:53 AM, Jim Green <openkbi...@gmail.com> wrote: > Hi Team, > > Say I have a test.json file: {"c1":[1,2,3]} > I can create a parquet file like : > var df = sqlContext.load("/tmp/test.json","json") > var df_c = df.repartition(1) > df_c.select("*").save("/tmp/testjson_spark","parquet”) > > The output parquet file’s schema is like: > c1: OPTIONAL F:1 > .bag: REPEATED F:1 > ..array: OPTIONAL INT64 R:1 D:3 > > Is there anyway to avoid using “.bag”, instead of, can we create the > parquet file using column type “REPEATED INT64”? > The expected data type is: > c1: REPEATED INT64 R:1 D:1 > > Thanks! > -- > Thanks, > www.openkb.info > (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool) >