[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
[ https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15558088#comment-15558088 ] Hyukjin Kwon commented on SPARK-8128: - I am not 100% sure but I recall I saw similar issue was resolved. Could you confirm this is still happening in the recent versions? - [~brdwrd] > Schema Merging Broken: Dataframe Fails to Recognize Column in Schema > > > Key: SPARK-8128 > URL: https://issues.apache.org/jira/browse/SPARK-8128 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Brad Willard > > I'm loading a folder of parquet files with about 600 parquet files and > loading it into one dataframe so schema merging is involved. There is some > bug with the schema merging that you print the schema and it shows all the > attributes. However when you run a query and filter on that attribute is > errors saying it's not in the schema. The query is incorrectly going to one > of the parquet files that does not have that attribute. > sdf = sql_context.parquet('/parquet/big_data_folder') > sdf.printSchema() > root > \|-- _id: string (nullable = true) > \|-- addedOn: string (nullable = true) > \|-- attachment: string (nullable = true) > ... > \|-- items: array (nullable = true) > \||-- element: struct (containsNull = true) > \|||-- _id: string (nullable = true) > \|||-- addedOn: string (nullable = true) > \|||-- authorId: string (nullable = true) > \|||-- mediaProcessingState: long (nullable = true) > \|-- mediaProcessingState: long (nullable = true) > \|-- title: string (nullable = true) > \|-- key: string (nullable = true) > sdf.filter(sdf.mediaProcessingState == 3).count() > causes this exception > Py4JJavaError: An error occurred while calling o67.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in > stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: > Column [mediaProcessingState] was not found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at >
[jira] [Commented] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
[ https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609100#comment-14609100 ] Brad Willard commented on SPARK-8128: - This is only broken on queries. I can load the dataframe with the unquery-able parquet files, and re-save it as a new parquet file and the schema merging will then make a query-able parquet file. So my current workaround is to generate the partial parquet files and use the schema merging to then make a new final parquet file which sucks, but at least I can query it. Schema Merging Broken: Dataframe Fails to Recognize Column in Schema Key: SPARK-8128 URL: https://issues.apache.org/jira/browse/SPARK-8128 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Brad Willard I'm loading a folder of parquet files with about 600 parquet files and loading it into one dataframe so schema merging is involved. There is some bug with the schema merging that you print the schema and it shows all the attributes. However when you run a query and filter on that attribute is errors saying it's not in the schema. The query is incorrectly going to one of the parquet files that does not have that attribute. sdf = sql_context.parquet('/parquet/big_data_folder') sdf.printSchema() root \|-- _id: string (nullable = true) \|-- addedOn: string (nullable = true) \|-- attachment: string (nullable = true) ... \|-- items: array (nullable = true) \||-- element: struct (containsNull = true) \|||-- _id: string (nullable = true) \|||-- addedOn: string (nullable = true) \|||-- authorId: string (nullable = true) \|||-- mediaProcessingState: long (nullable = true) \|-- mediaProcessingState: long (nullable = true) \|-- title: string (nullable = true) \|-- key: string (nullable = true) sdf.filter(sdf.mediaProcessingState == 3).count() causes this exception Py4JJavaError: An error occurred while calling o67.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in stage 4.0 (TID 70565, XXX): java.lang.IllegalArgumentException: Column [mediaProcessingState] was not found in schema! at parquet.Preconditions.checkArgument(Preconditions.java:47) at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at