pin_zhang created SPARK-23371: --------------------------------- Summary: Parquet Footer data is wrong on window in parquet format partition table Key: SPARK-23371 URL: https://issues.apache.org/jira/browse/SPARK-23371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.2, 2.1.1 Reporter: pin_zhang
On window Run SQL in spark shell spark.sql("create table part_test (id string )partitioned by( index int) stored as parquet") spark.sql("insert into part_test partition (index =1) values ('1')") Get exception when query spark.sql("select * from part_test ").show() For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the version info in spark on window. Classloader try to get version in the parquet-format-2.3.0-incubating.jar 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_ by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*)) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt atistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq uetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque tFileReader.java:583) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF ileReader.java:513) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe xt(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt erator(FileScanRDD.scala:184) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo wIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon $1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org