[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-10-01 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423402#comment-17423402
 ] 

Senthil Kumar edited comment on SPARK-36861 at 10/1/21, 7:24 PM:
-

Yes in Spark 3.3, hour column is created as "DateType" but I could see hour 
part in subdirs created

===

Spark session available as 'spark'.
 Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/_,_/_/ /_/_\ version 3.3.0-SNAPSHOT
 /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
 Type in expressions to have them evaluated.
 Type :help for more information.

scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), 
("2021-01-01T02", 2)).toDF("hour", "i")
 df: org.apache.spark.sql.DataFrame = [hour: string, i: int]

scala> df.write.partitionBy("hour").parquet("/tmp/t1")

scala> spark.read.parquet("/tmp/t1").schema
 res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true))

scala>

===

 

and subdirs created are

===

ls -l
 total 0
 -rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01
 drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02

===

 

It will be helpful if you share the list of sub-dirs created in your case.


was (Author: senthh):
Yes in Spark 3.3 hour column is created as "DateType" but I could see hour part 
in subdirs created

===

Spark session available as 'spark'.
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT
 /_/
 
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), 
("2021-01-01T02", 2)).toDF("hour", "i")
df: org.apache.spark.sql.DataFrame = [hour: string, i: int]

scala> df.write.partitionBy("hour").parquet("/tmp/t1")
 
scala> spark.read.parquet("/tmp/t1").schema
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,DateType,true))

scala>

===

 

and subdirs created are

===

ls -l
total 0
-rw-r--r-- 1 senthilkumar wheel 0 Oct 2 00:44 _SUCCESS
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T00
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T01
drwxr-xr-x 4 senthilkumar wheel 128 Oct 2 00:44 hour=2021-01-01T02

===

 

It will be helpful if you share the list of sub-dirs created in your case.

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-27 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420549#comment-17420549
 ] 

Gengliang Wang edited comment on SPARK-36861 at 9/27/21, 8:06 AM:
--

Hmm, the PR https://github.com/apache/spark/pull/33709 is only on master. I 
can't reproduce your case on 3.2.0 RC4 with:

{code:scala}
> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 
> 2)).toDF("hour", "i")
> df.write.partitionBy("hour").parquet("/tmp/t1")
> spark.read.parquet("/tmp/t1").schema
res2: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,StringType,true))
{code}

The issue can be reproduced on Spark master though.



was (Author: gengliang.wang):
Hmm, the PR https://github.com/apache/spark/pull/33709 is only on master. I 
can't reproduce your case on RC4 with:

{code:scala}
> val df = Seq(("2021-01-01T00", 0), ("2021-01-01T01", 1), ("2021-01-01T02", 
> 2)).toDF("hour", "i")
> df.write.partitionBy("hour").parquet("/tmp/t1")
> spark.read.parquet("/tmp/t1").schema
res2: org.apache.spark.sql.types.StructType = 
StructType(StructField(i,IntegerType,true), StructField(hour,StringType,true))
{code}


> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-27 Thread Tanel Kiis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420532#comment-17420532
 ] 

Tanel Kiis edited comment on SPARK-36861 at 9/27/21, 7:45 AM:
--

[~Gengliang.Wang] I think, that this should be considered as a blocker for the 
3.2 release
 


was (Author: tanelk):
[~Gengliang.Wang] I think, that this should be considered as a blocker.
 

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org