[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-14 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248899#comment-17248899
 ] 

Simon commented on SPARK-33571:
---

[~maxgekk]OK, all clear. Thanks again for the clarifications!

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246717#comment-17246717
 ] 

Maxim Gekk commented on SPARK-33571:


> The behavior of the to be introduced in Spark 3.1 
> `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for 
> `datetimeRebaseModeIn*`?

Yes.

> So Spark will check the parquet metadata for Spark version and the 
> `datetimeRebaseModeInRead` metadata key and use the correct behavior.

Correct, except of names of metadata keys. Spark checks , see 
https://github.com/MaxGekk/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L58-L68

> If those are not set it will raise an exception and ask the user to define 
> the mode. Is that correct?

Yes. Spark should raise the exception if it is not clear which calendar the 
writer used.

> but from my testing Spark 3 does the same by default, not sure if that aligns 
> with your findings?

Spark 3.0.0-SNAPSHOT saved timestamps as TIMESTAMP_MICROS in parquet till 
https://github.com/apache/spark/pull/28450 . I just wanted to say that the 
configs datetimeRebaseModeIn* you pointed out don't impact on INT96 in Spark 
3.0.

> What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with 
> regards to this?

The same as for DATE type. Spark takes into account the same SQL configs and 
metdata keys from parquet files.



> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-08 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245969#comment-17245969
 ] 

Simon commented on SPARK-33571:
---

[~maxgekk] Thanks for taking the time to look into this, for the updates to the 
documentation and for the explanation!
The actual data I ran into this issue with used the year 220 so that's why I 
used it, of course that's the one century with a 0 day diff :P The table with 
the different diffs between the two calendars cleared it up a lot, I used some 
different dates and can now also see the differences between the two read modes.

If you don't mind I have two additional questions:
> Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config 
> `datetimeRebaseModeInRead` does not influence on reading such types in Spark 
> 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added 
> separate configs for INT96...

The behavior of the to be introduced in Spark 3.1 
`spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for 
`datetimeRebaseModeIn*`? So Spark will check the parquet metadata for Spark 
version and the `datetimeRebaseModeInRead` metadata key and use the correct 
behavior. If those are not set it will raise an exception and ask the user to 
define the mode. Is that correct?

(P.S. You explicitly mention Spark 2.4.5 writes timestamps as INT96, but from 
my testing Spark 3 does the same by default, not sure if that aligns with your 
findings?)

> For INT96, it seems it is correct behavior. We should observe different 
> results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config 
> spark.sql.parquet.outputTimestampType.

What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with 
regards to this?

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compared to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-04 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243856#comment-17243856
 ] 

Apache Spark commented on SPARK-33571:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30604

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
> Fix For: 3.1.0
>
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-03 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243441#comment-17243441
 ] 

Maxim Gekk commented on SPARK-33571:


I opened the PR [https://github.com/apache/spark/pull/30596] with some 
improvements for config docs. [~hyukjin.kwon] [~cloud_fan] could you review it, 
please.

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243440#comment-17243440
 ] 

Apache Spark commented on SPARK-33571:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30596

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241408#comment-17241408
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 
2020-10-01 belong to the Gregorian calendar, so, should be no diffs.

For the date 0220-10-01, please, have a look at the table which I built in the 
PR: https://github.com/apache/spark/pull/28067 . The table shows that there is 
no diffs between 2 calendars for the year.

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241400#comment-17241400
 ] 

Maxim Gekk commented on SPARK-33571:


Spark 3.0.1 shows different results as well:
{code:scala}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275)
scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.

scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"LEGACY")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+


scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", 
"CORRECTED")

scala> 
spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false)
+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+

{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241379#comment-17241379
 ] 

Maxim Gekk commented on SPARK-33571:


I have tried to reproduce the issue on the master branch by reading the file 
saved by Spark 2.4.5 
(https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data):
{code:scala}
  test("SPARK-33571: read ancient dates saved by Spark 2.4.5") {
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
LEGACY.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> 
CORRECTED.toString) {
  val path = 
getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet")
  val df = spark.read.parquet(path)
  df.show(false)
}
  }
{code}

The results are different in LEGACY and in CORRECTED modes:
{code}
+--+--+
|dict  |plain |
+--+--+
|1001-01-01|1001-01-01|
|1001-01-01|1001-01-02|
|1001-01-01|1001-01-03|
|1001-01-01|1001-01-04|
|1001-01-01|1001-01-05|
|1001-01-01|1001-01-06|
|1001-01-01|1001-01-07|
|1001-01-01|1001-01-08|
+--+--+

+--+--+
|dict  |plain |
+--+--+
|1001-01-07|1001-01-07|
|1001-01-07|1001-01-08|
|1001-01-07|1001-01-09|
|1001-01-07|1001-01-10|
|1001-01-07|1001-01-11|
|1001-01-07|1001-01-12|
|1001-01-07|1001-01-13|
|1001-01-07|1001-01-14|
+--+--+
{code}

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-12-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241339#comment-17241339
 ] 

Maxim Gekk commented on SPARK-33571:


[~simonvanderveldt] Thank you for the detailed description and your 
investigation. Let me clarify a few things:

> From our testing we're seeing several issues:
> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` which contain timestamps before the 
> above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5

Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config 
`datetimeRebaseModeInRead` does not influence on reading such types in Spark 
3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added 
separate configs for INT96:
* https://github.com/apache/spark/pull/30056
* https://github.com/apache/spark/pull/30121

The changes will be released with Spark 3.1.0.

> Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that 
> contains fields of type `TimestampType` or `DateType` which contain dates or 
> timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.

For INT96, it seems it is correct behavior. We should observe different results 
for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config 
spark.sql.parquet.outputTimestampType.

The DATE case is more interesting as we must see a difference in results for 
ancient dates. I will investigate this case. 

 

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-11-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240456#comment-17240456
 ] 

Hyukjin Kwon commented on SPARK-33571:
--

cc [~maxgekk] FYI

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the 
> dataframe as when using `CORRECTED`, so it seems like no rebasing is 
> happening.
> I've made some scripts to help with testing/show the behavior, it uses 
> pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here 
> [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the 
> outputs in a comment below as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-11-26 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239204#comment-17239204
 ] 

Simon commented on SPARK-33571:
---

Below the output of the date testscript with the noise removed
Writing without additional config works as expected. Spark 3.0.1. throws a 
`SparkUpgradeException`
Reading without additional config works as expected. Spark 3.0.1. throws a 
`SparkUpgradeException` when reading parquet files written with Spark 2.4.5 in 
Spark 3.0.1.

Reading using the two different `datetimeRebaseModeInRead` modes doesn't work 
though, it shows no difference

{code:java}
Spark version: 3.0.1
Spark conf [('spark.app.name', 'read-data'), 
('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), 
('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), 
('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')]
Reading parquet files from output/datespark245/*.parquet
root
 |-- row: string (nullable = true)
 |-- date: date (nullable = true)

+---+--+
|row|  date|
+---+--+
|  1|0220-10-01|
|  2|1880-10-01|
|  3|2020-10-01|
+---+--+

done
...
Spark version: 3.0.1
Spark conf [('spark.app.name', 'read-data'), 
('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), 
('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), 
('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')]
Reading parquet files from output/datespark245/*.parquet
root
 |-- row: string (nullable = true)
 |-- date: date (nullable = true)

+---+--+
|row|  date|
+---+--+
|  1|0220-10-01|
|  2|1880-10-01|
|  3|2020-10-01|
+---+--+

done
{code}
Note no difference in the dates shown

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should 
> show the same values in Spark 3.0.1. with for example `df.show()` as they did 
> in Spark 2.4.5
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time and 
> `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps 
> should show different values in Spark 3.0.1. with for example `df.show()` as 
> they did in Spark 2.4.5
>  * When writing parqet files with Spark > 3.0.0 which contain dates or 
> timestamps before the above mentioned moment in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite`
> First of all I'm not 100% sure all of this is correct. I've been unable to 
> find any clear documentation on the expected behavior. The understanding I 
> have was pieced together from the mailing list 
> ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)]
>  the blog post linked there and looking at the Spark code.
> From our testing we're seeing several issues:
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` which contain timestamps before 
> the above mentioned moments in time without `datetimeRebaseModeInRead` set 
> doesn't raise the `SparkUpgradeException`, it succeeds without any changes to 
> the resulting dataframe compares to that dataframe in Spark 2.4.5
>  * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. 
> that contains fields of type `TimestampType` or `DateType` which contain 
> dates or timestamps before the above mentioned moments in time with 
> `datetimeRebaseModeInRead` set to `LEGACY` results in the same 

[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly

2020-11-26 Thread Simon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239193#comment-17239193
 ] 

Simon commented on SPARK-33571:
---

Below the output of the timestamp test script with the noise removed
#Writing:
```
Spark version: 2.4.5
Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 
'client'), ('spark.app.name', 'generate-timestamp-data'), 
('spark.ui.showConsoleProgress', 'true')]
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:50:38|
|  2|1880-10-01 10:50:38|
|  3|2020-10-01 10:10:10|
+---+---+

Writing parquet files to output/timestampspark245/
done
...
Spark version: 2.4.6
Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 
'client'), ('spark.app.name', 'generate-timestamp-data'), 
('spark.ui.showConsoleProgress', 'true')]
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:50:38|
|  2|1880-10-01 10:50:38|
|  3|2020-10-01 10:10:10|
+---+---+

Writing parquet files to output/timestampspark246/
done
...
Spark version: 3.0.1
Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), 
('spark.submit.deployMode', 'client'), ('spark.app.name', 
'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')]
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+   
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:10:10|
|  2|1880-10-01 10:10:10|
|  3|2020-10-01 10:10:10|
+---+---+

Writing parquet files to output/timestampspark301/
done
```
Note not exception was raised when writing old timestamps to parquet in spark 
3.0.1

# Reading
```
Spark version: 3.0.1
Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), 
('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), 
('spark.ui.showConsoleProgress', 'true')]
Reading parquet files from output/timestampspark245/*.parquet
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:50:38|
|  2|1880-10-01 10:50:38|
|  3|2020-10-01 10:10:10|
+---+---+

done
...
Spark version: 3.0.1
Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), 
('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), 
('spark.ui.showConsoleProgress', 'true')]
Reading parquet files from output/timestampspark246/*.parquet
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:50:38|
|  2|1880-10-01 10:50:38|
|  3|2020-10-01 10:10:10|
+---+---+

done
...
Spark version: 3.0.1
Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), 
('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), 
('spark.ui.showConsoleProgress', 'true')]
Reading parquet files from output/timestampspark301/*.parquet
root
 |-- row: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+---+---+
|row|  timestamp|
+---+---+
|  1|0220-10-01 10:10:10|
|  2|1880-10-01 10:10:10|
|  3|2020-10-01 10:10:10|
+---+---+

done
```

> Handling of hybrid to proleptic calendar when reading and writing Parquet 
> data not working correctly
> 
>
> Key: SPARK-33571
> URL: https://issues.apache.org/jira/browse/SPARK-33571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Simon
>Priority: Major
>
> The handling of old dates written with older Spark versions (<2.4.6) using 
> the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working 
> correctly.
> From what I understand it should work like this:
>  * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before 
> 1900-01-01T00:00:00Z
>  * Only applies when reading or writing parquet files
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments in time a 
> `SparkUpgradeException` should be raised informing the user to choose either 
> `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead`
>  * When reading parquet files written with Spark < 2.4.6 which contain dates 
> or timestamps before the above mentioned moments