[jira] [Updated] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16344:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> This is a weird corner case. Users may hit this issue if they have a schema 
> that
> # has an array field whose element type is a struct, and
> # the struct has one and only one field, and
> # that field is named as "element".
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
> ... 

[jira] [Updated] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16344:
---
Description: 
This is a weird corner case. Users may hit this issue if they have a schema that

# has an array field whose element type is a struct, and
# the struct has one and only one field, and
# that field is named as "element".

The following Spark shell snippet for Spark 1.6 reproduces this bug:

{code}
case class A(element: Long)
case class B(f: Array[A])

val path = "/tmp/silly.parquet"
Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)

val df = sqlContext.read.parquet(path)
df.printSchema()
// root
//  |-- f0: array (nullable = true)
//  ||-- element: struct (containsNull = true)
//  |||-- element: long (nullable = true)

df.show()
{code}

Exception thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: Expected instance of group converter 
but got 
"org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
at 
org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 26 more
{noformat}

Spark 2.0.0-SNAPSHOT and Spark master also suffer this issue. To reproduce it 
using these versions, just replace {{sqlContext}} in the above snippet with 
{{spark}}.

The reason behind is related to Parquet backwards-compatibility rules for LIST 
types defined in [parquet-format 
spec|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists].

The Spark SQL schema shown above

{noformat}
root
 |-- f0: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- element: long (nullable = true)
{noformat}

is equivalent to the following SQL type:

{noformat}
STRUCT<
  f: ARRAY<
STRUCT
  >
>
{noformat}

According to the parquet-format spec, the standard 

[jira] [Updated] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16344:
---
Description: 
Array of struct with a single field name "element" can't be decoded from 
Parquet files written by Spark 1.6+

The following Spark shell snippet for Spark 1.6 reproduces this bug:

{code}
case class A(element: Long)
case class B(f: Array[A])

val path = "/tmp/silly.parquet"
Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)

val df = sqlContext.read.parquet(path)
df.printSchema()
// root
//  |-- f0: array (nullable = true)
//  ||-- element: struct (containsNull = true)
//  |||-- element: long (nullable = true)

df.show()
{code}

Exception thrown:

{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
block -1 in file 
file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: Expected instance of group converter 
but got 
"org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
at 
org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 26 more
{noformat}

Spark 2.0.0-SNAPSHOT and Spark master also suffer this issue. To reproduce it 
using these versions, just replace {{sqlContext}} in the above snippet with 
{{spark}}.

The reason behind is related to Parquet backwards-compatibility rules for LIST 
types defined in [parquet-format 
spec|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists].

The Spark SQL schema shown above

{noformat}
root
 |-- f0: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- element: long (nullable = true)
{noformat}

is equivalent to the following SQL type:

{noformat}
STRUCT<
  f: ARRAY<
STRUCT
  >
>
{noformat}

According to the parquet-format spec, the standard layout of a LIST-like 
structure is a 3-level layout:

{noformat}
 group  (LIST) {
  repeated group list {