Re: Spark 2.0 - Parquet data with fields containing periods "."

Hyukjin Kwon Wed, 31 Aug 2016 16:49:39 -0700

Hi Don, I guess this should be fixed from 2.0.1.

Please refer this PR. https://github.com/apache/spark/pull/14339


On 1 Sep 2016 2:48 a.m., "Don Drake" <dondr...@gmail.com> wrote:

> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
> 2.0 and have encountered some interesting issues.
>
> First, it seems the SQL parsing is different, and I had to rewrite some
> SQL that was doing a mix of inner joins (using where syntax, not inner) and
> outer joins to get the SQL to work.  It was complaining about columns not
> existing.  I can't reproduce that one easily and can't share the SQL.  Just
> curious if anyone else is seeing this?
>
> I do have a showstopper problem with Parquet dataset that have fields
> containing a "." in the field name.  This data comes from an external
> provider (CSV) and we just pass through the field names.  This has worked
> flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
> parquet files.
>
> I've reproduced a trivial example below. Jira created: https://issues.
> apache.org/jira/browse/SPARK-17341
>
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i *
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value:
> int]
>
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
>
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager:
> Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 96.54% for 7 writers
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager:
> Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 84.47% for 8 writers
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter:
> Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
> encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
> encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
> encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
> encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
> written 39B for [value] INT
> scala> val inSquare = spark.read.parquet("squares")
> inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
>
> scala> inSquare.take(2)
> org.apache.spark.sql.AnalysisException: Unable to resolve squared.value
> given [value, squared.value];
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$
> anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$
> anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$
> anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$
> anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.
> resolve(LogicalPlan.scala:129)
>   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(
> FileSourceStrategy.scala:87)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$
> anonfun$1.apply(QueryPlanner.scala:60)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$
> anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(
> QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(
> SparkPlanner.scala:47)
>   at org.apache.spark.sql.execution.SparkPlanner$$
> anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at org.apache.spark.sql.execution.SparkPlanner$$
> anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$
> transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.
> withOrigin(TreeNode.scala:69)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(
> TreeNode.scala:300)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.
> apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.
> apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.
> apply(TreeNode.scala:321)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.
> mapProductIterator(TreeNode.scala:179)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.
> transformChildren(TreeNode.scala:319)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(
> TreeNode.scala:298)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(
> SparkPlanner.scala:48)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(
> SparkPlanner.scala:48)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(
> QueryExecution.scala:78)
>   at org.apache.spark.sql.execution.QueryExecution.
> sparkPlan(QueryExecution.scala:76)
>   at org.apache.spark.sql.execution.QueryExecution.
> executedPlan$lzycompute(QueryExecution.scala:83)
>   at org.apache.spark.sql.execution.QueryExecution.
> executedPlan(QueryExecution.scala:83)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>   ... 48 elided
>
> scala>
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake <http://www.MailLaunder.com/>
> 800-733-2143
>

Re: Spark 2.0 - Parquet data with fields containing periods "."

Reply via email to