Hi Don, I guess this should be fixed from 2.0.1. Please refer this PR. https://github.com/apache/spark/pull/14339
On 1 Sep 2016 2:48 a.m., "Don Drake" <dondr...@gmail.com> wrote: > I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark > 2.0 and have encountered some interesting issues. > > First, it seems the SQL parsing is different, and I had to rewrite some > SQL that was doing a mix of inner joins (using where syntax, not inner) and > outer joins to get the SQL to work. It was complaining about columns not > existing. I can't reproduce that one easily and can't share the SQL. Just > curious if anyone else is seeing this? > > I do have a showstopper problem with Parquet dataset that have fields > containing a "." in the field name. This data comes from an external > provider (CSV) and we just pass through the field names. This has worked > flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these > parquet files. > > I've reproduced a trivial example below. Jira created: https://issues. > apache.org/jira/browse/SPARK-17341 > > > Spark context available as 'sc' (master = local[*], app id = > local-1472664486578). > Spark session available as 'spark'. > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.0 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.7.0_51) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * > i)).toDF("value", "squared.value") > 16/08/31 12:28:44 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.2.0 > 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, > returning NoSuchObjectException > squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: > int] > > scala> squaresDF.take(2) > res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4]) > > scala> squaresDF.write.parquet("squares") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: > Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory > Scaling row group sizes to 96.54% for 7 writers > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 0 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 0 > Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: > Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory > Scaling row group sizes to 84.47% for 8 writers > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 0 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 16 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 16 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 16 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 16 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: > Flushing mem columnStore to file. allocated memory: 16 > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: > written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, > encodings: [BIT_PACKED, PLAIN] > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: > written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, > encodings: [BIT_PACKED, PLAIN] > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: > written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, > encodings: [BIT_PACKED, PLAIN] > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: > written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, > encodings: [BIT_PACKED, PLAIN] > Aug 31, 2016 12:29:08 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: > written 39B for [value] INT > scala> val inSquare = spark.read.parquet("squares") > inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value: int] > > scala> inSquare.take(2) > org.apache.spark.sql.AnalysisException: Unable to resolve squared.value > given [value, squared.value]; > at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ > anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ > anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ > anonfun$resolve$1.apply(LogicalPlan.scala:133) > at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ > anonfun$resolve$1.apply(LogicalPlan.scala:129) > at scala.collection.TraversableLike$$anonfun$map$ > 1.apply(TraversableLike.scala:234) > at scala.collection.TraversableLike$$anonfun$map$ > 1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan. > resolve(LogicalPlan.scala:129) > at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply( > FileSourceStrategy.scala:87) > at org.apache.spark.sql.catalyst.planning.QueryPlanner$$ > anonfun$1.apply(QueryPlanner.scala:60) > at org.apache.spark.sql.catalyst.planning.QueryPlanner$$ > anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan( > QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan( > SparkPlanner.scala:47) > at org.apache.spark.sql.execution.SparkPlanner$$ > anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at org.apache.spark.sql.execution.SparkPlanner$$ > anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$ > transformUp$1.apply(TreeNode.scala:301) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$ > transformUp$1.apply(TreeNode.scala:301) > at org.apache.spark.sql.catalyst.trees.CurrentOrigin$. > withOrigin(TreeNode.scala:69) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp( > TreeNode.scala:300) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4. > apply(TreeNode.scala:298) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4. > apply(TreeNode.scala:298) > at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5. > apply(TreeNode.scala:321) > at org.apache.spark.sql.catalyst.trees.TreeNode. > mapProductIterator(TreeNode.scala:179) > at org.apache.spark.sql.catalyst.trees.TreeNode. > transformChildren(TreeNode.scala:319) > at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp( > TreeNode.scala:298) > at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply( > SparkPlanner.scala:48) > at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply( > SparkPlanner.scala:48) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute( > QueryExecution.scala:78) > at org.apache.spark.sql.execution.QueryExecution. > sparkPlan(QueryExecution.scala:76) > at org.apache.spark.sql.execution.QueryExecution. > executedPlan$lzycompute(QueryExecution.scala:83) > at org.apache.spark.sql.execution.QueryExecution. > executedPlan(QueryExecution.scala:83) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558) > at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) > ... 48 elided > > scala> > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > https://twitter.com/dondrake <http://www.MailLaunder.com/> > 800-733-2143 >