[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558088#comment-15558088
 ] 

Hyukjin Kwon commented on SPARK-8128:
-

I am not 100% sure but I recall I saw similar issue was resolved. Could you 
confirm this is still happening in the recent versions? - [~brdwrd]

> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> 
>
> Key: SPARK-8128
> URL: https://issues.apache.org/jira/browse/SPARK-8128
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and 
> loading it into one dataframe so schema merging is involved. There is some 
> bug with the schema merging that you print the schema and it shows all the 
> attributes. However when you run a query and filter on that attribute is 
> errors saying it's not in the schema. The query is incorrectly going to one 
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  ...
> \|-- items: array (nullable = true)
>  \||-- element: struct (containsNull = true)
>  \|||-- _id: string (nullable = true)
>  \|||-- addedOn: string (nullable = true)
>  \|||-- authorId: string (nullable = true)
>  \|||-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
> stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: 
> Column [mediaProcessingState] was not found in schema!
> at parquet.Preconditions.checkArgument(Preconditions.java:47)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at 
> 

[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

2015-06-30 Thread Brad Willard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609100#comment-14609100
 ] 

Brad Willard commented on SPARK-8128:
-

This is only broken on queries. I can load the dataframe with the unquery-able 
parquet files, and re-save it as a new parquet file and the schema merging will 
then make a query-able parquet file.

So my current workaround is to generate the partial parquet files and use the 
schema merging to then make a new final parquet file which sucks, but at least 
I can query it.

 Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
 

 Key: SPARK-8128
 URL: https://issues.apache.org/jira/browse/SPARK-8128
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Brad Willard

 I'm loading a folder of parquet files with about 600 parquet files and 
 loading it into one dataframe so schema merging is involved. There is some 
 bug with the schema merging that you print the schema and it shows all the 
 attributes. However when you run a query and filter on that attribute is 
 errors saying it's not in the schema. The query is incorrectly going to one 
 of the parquet files that does not have that attribute.
 sdf = sql_context.parquet('/parquet/big_data_folder')
 sdf.printSchema()
 root
  \|-- _id: string (nullable = true)
  \|-- addedOn: string (nullable = true)
  \|-- attachment: string (nullable = true)
  ...
 \|-- items: array (nullable = true)
  \||-- element: struct (containsNull = true)
  \|||-- _id: string (nullable = true)
  \|||-- addedOn: string (nullable = true)
  \|||-- authorId: string (nullable = true)
  \|||-- mediaProcessingState: long (nullable = true)
  \|-- mediaProcessingState: long (nullable = true)
  \|-- title: string (nullable = true)
  \|-- key: string (nullable = true)
 sdf.filter(sdf.mediaProcessingState == 3).count()
 causes this exception
 Py4JJavaError: An error occurred while calling o67.count.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
 stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: 
 Column [mediaProcessingState] was not found in schema!
 at parquet.Preconditions.checkArgument(Preconditions.java:47)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
 at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 at 
 parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
 at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 at 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 at 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 at 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 at 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at