[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423402#comment-17423402 ] Senthil Kumar edited comment on SPARK-36861 at 10/1/21, 7:24 PM: - Yes in Spark 3.3, hour column is created as "DateType" but I could see hour part in subdirs created === Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 2)).toDF("hour", "i") df: org.apache.spark.sql.DataFrame = [hour: string, i: int] scala> df.write.partitionBy("hour").parquet("/tmp/t1") scala> spark.read.parquet("/tmp/t1").schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true)) scala> === and subdirs created are === ls -l total 0 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02 === It will be helpful if you share the list of sub-dirs created in your case. was (Author: senthh): Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part in subdirs created === Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292) Type in expressions to have them evaluated. Type :help for more information. scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 2)).toDF("hour", "i") df: org.apache.spark.sql.DataFrame = [hour: string, i: int] scala> df.write.partitionBy("hour").parquet("/tmp/t1") scala> spark.read.parquet("/tmp/t1").schema res1: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true)) scala> === and subdirs created are === ls -l total 0 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02 === It will be helpful if you share the list of sub-dirs created in your case. > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Tanel Kiis >Priority: Blocker > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420549#comment-17420549 ] Gengliang Wang edited comment on SPARK-36861 at 9/27/21, 8:06 AM: -- Hmm, the PR https://github.com/apache/spark/pull/33709 is only on master. I can't reproduce your case on 3.2.0 RC4 with: {code:scala} > val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", > 2)).toDF("hour", "i") > df.write.partitionBy("hour").parquet("/tmp/t1") > spark.read.parquet("/tmp/t1").schema res2: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,StringType,true)) {code} The issue can be reproduced on Spark master though. was (Author: gengliang.wang): Hmm, the PR https://github.com/apache/spark/pull/33709 is only on master. I can't reproduce your case on RC4 with: {code:scala} > val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", > 2)).toDF("hour", "i") > df.write.partitionBy("hour").parquet("/tmp/t1") > spark.read.parquet("/tmp/t1").schema res2: org.apache.spark.sql.types.StructType = StructType(StructField(i,IntegerType,true), StructField(hour,StringType,true)) {code} > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates
[ https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420532#comment-17420532 ] Tanel Kiis edited comment on SPARK-36861 at 9/27/21, 7:45 AM: -- [~Gengliang.Wang] I think, that this should be considered as a blocker for the 3.2 release was (Author: tanelk): [~Gengliang.Wang] I think, that this should be considered as a blocker. > Partition columns are overly eagerly parsed as dates > > > Key: SPARK-36861 > URL: https://issues.apache.org/jira/browse/SPARK-36861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > I have an input directory with subdirs: > * hour=2021-01-01T00 > * hour=2021-01-01T01 > * hour=2021-01-01T02 > * ... > in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it > is parsed as date type and the hour part is lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org