[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248899#comment-17248899 ] Simon commented on SPARK-33571: --- [~maxgekk]OK, all clear. Thanks again for the clarifications! > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compared to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246717#comment-17246717 ] Maxim Gekk commented on SPARK-33571: > The behavior of the to be introduced in Spark 3.1 > `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for > `datetimeRebaseModeIn*`? Yes. > So Spark will check the parquet metadata for Spark version and the > `datetimeRebaseModeInRead` metadata key and use the correct behavior. Correct, except of names of metadata keys. Spark checks , see https://github.com/MaxGekk/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L58-L68 > If those are not set it will raise an exception and ask the user to define > the mode. Is that correct? Yes. Spark should raise the exception if it is not clear which calendar the writer used. > but from my testing Spark 3 does the same by default, not sure if that aligns > with your findings? Spark 3.0.0-SNAPSHOT saved timestamps as TIMESTAMP_MICROS in parquet till https://github.com/apache/spark/pull/28450 . I just wanted to say that the configs datetimeRebaseModeIn* you pointed out don't impact on INT96 in Spark 3.0. > What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with > regards to this? The same as for DATE type. Spark takes into account the same SQL configs and metdata keys from parquet files. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compared to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue].
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245969#comment-17245969 ] Simon commented on SPARK-33571: --- [~maxgekk] Thanks for taking the time to look into this, for the updates to the documentation and for the explanation! The actual data I ran into this issue with used the year 220 so that's why I used it, of course that's the one century with a 0 day diff :P The table with the different diffs between the two calendars cleared it up a lot, I used some different dates and can now also see the differences between the two read modes. If you don't mind I have two additional questions: > Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config > `datetimeRebaseModeInRead` does not influence on reading such types in Spark > 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added > separate configs for INT96... The behavior of the to be introduced in Spark 3.1 `spark.sql.legacy.parquet.int96RebaseModeIn*` is the same as for `datetimeRebaseModeIn*`? So Spark will check the parquet metadata for Spark version and the `datetimeRebaseModeInRead` metadata key and use the correct behavior. If those are not set it will raise an exception and ask the user to define the mode. Is that correct? (P.S. You explicitly mention Spark 2.4.5 writes timestamps as INT96, but from my testing Spark 3 does the same by default, not sure if that aligns with your findings?) > For INT96, it seems it is correct behavior. We should observe different > results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config > spark.sql.parquet.outputTimestampType. What is the expected behavior for TIMESTAMP_MICROS and TIMESTAMP_MILLIS with regards to this? > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compared to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243856#comment-17243856 ] Apache Spark commented on SPARK-33571: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30604 > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > Fix For: 3.1.0 > > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243441#comment-17243441 ] Maxim Gekk commented on SPARK-33571: I opened the PR [https://github.com/apache/spark/pull/30596] with some improvements for config docs. [~hyukjin.kwon] [~cloud_fan] could you review it, please. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243440#comment-17243440 ] Apache Spark commented on SPARK-33571: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30596 > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241408#comment-17241408 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Looking at the dates, you tested, both dates 1880-10-01 and 2020-10-01 belong to the Gregorian calendar, so, should be no diffs. For the date 0220-10-01, please, have a look at the table which I built in the PR: https://github.com/apache/spark/pull/28067 . The table shows that there is no diffs between 2 calendars for the year. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241400#comment-17241400 ] Maxim Gekk commented on SPARK-33571: Spark 3.0.1 shows different results as well: {code:scala} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275) scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) 20/12/01 12:31:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED") scala> spark.read.parquet("/Users/maximgekk/proj/parquet-read-2_4_5_files/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet").show(false) +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241379#comment-17241379 ] Maxim Gekk commented on SPARK-33571: I have tried to reproduce the issue on the master branch by reading the file saved by Spark 2.4.5 (https://github.com/apache/spark/tree/master/sql/core/src/test/resources/test-data): {code:scala} test("SPARK-33571: read ancient dates saved by Spark 2.4.5") { withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> CORRECTED.toString) { val path = getResourceParquetFilePath("test-data/before_1582_date_v2_4_5.snappy.parquet") val df = spark.read.parquet(path) df.show(false) } } {code} The results are different in LEGACY and in CORRECTED modes: {code} +--+--+ |dict |plain | +--+--+ |1001-01-01|1001-01-01| |1001-01-01|1001-01-02| |1001-01-01|1001-01-03| |1001-01-01|1001-01-04| |1001-01-01|1001-01-05| |1001-01-01|1001-01-06| |1001-01-01|1001-01-07| |1001-01-01|1001-01-08| +--+--+ +--+--+ |dict |plain | +--+--+ |1001-01-07|1001-01-07| |1001-01-07|1001-01-08| |1001-01-07|1001-01-09| |1001-01-07|1001-01-10| |1001-01-07|1001-01-11| |1001-01-07|1001-01-12| |1001-01-07|1001-01-13| |1001-01-07|1001-01-14| +--+--+ {code} > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17241339#comment-17241339 ] Maxim Gekk commented on SPARK-33571: [~simonvanderveldt] Thank you for the detailed description and your investigation. Let me clarify a few things: > From our testing we're seeing several issues: > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` which contain timestamps before the > above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 Spark 2.4.5 writes timestamps as parquet INT96 type. The SQL config `datetimeRebaseModeInRead` does not influence on reading such types in Spark 3.0.1, so, Spark performs rebasing always (LEGACY mode). We recently added separate configs for INT96: * https://github.com/apache/spark/pull/30056 * https://github.com/apache/spark/pull/30121 The changes will be released with Spark 3.1.0. > Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. that > contains fields of type `TimestampType` or `DateType` which contain dates or > timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. For INT96, it seems it is correct behavior. We should observe different results for TIMESTAMP_MICROS and TIMESTAMP_MILLIS types, see the SQL config spark.sql.parquet.outputTimestampType. The DATE case is more interesting as we must see a difference in results for ancient dates. I will investigate this case. > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17240456#comment-17240456 ] Hyukjin Kwon commented on SPARK-33571: -- cc [~maxgekk] FYI > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same values in the > dataframe as when using `CORRECTED`, so it seems like no rebasing is > happening. > I've made some scripts to help with testing/show the behavior, it uses > pyspark 2.4.5, 2.4.6 and 3.0.1. You can find them here > [https://github.com/simonvanderveldt/spark3-rebasemode-issue]. I'll post the > outputs in a comment below as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239204#comment-17239204 ] Simon commented on SPARK-33571: --- Below the output of the date testscript with the noise removed Writing without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` Reading without additional config works as expected. Spark 3.0.1. throws a `SparkUpgradeException` when reading parquet files written with Spark 2.4.5 in Spark 3.0.1. Reading using the two different `datetimeRebaseModeInRead` modes doesn't work though, it shows no difference {code:java} Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'LEGACY'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/datespark245/*.parquet root |-- row: string (nullable = true) |-- date: date (nullable = true) +---+--+ |row| date| +---+--+ | 1|0220-10-01| | 2|1880-10-01| | 3|2020-10-01| +---+--+ done {code} Note no difference in the dates shown > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `LEGACY` the dates and timestamps should > show the same values in Spark 3.0.1. with for example `df.show()` as they did > in Spark 2.4.5 > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time and > `datetimeRebaseModeInRead` is set to `CORRECTED` the dates and timestamps > should show different values in Spark 3.0.1. with for example `df.show()` as > they did in Spark 2.4.5 > * When writing parqet files with Spark > 3.0.0 which contain dates or > timestamps before the above mentioned moment in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInWrite` > First of all I'm not 100% sure all of this is correct. I've been unable to > find any clear documentation on the expected behavior. The understanding I > have was pieced together from the mailing list > ([http://apache-spark-user-list.1001560.n3.nabble.com/Spark-3-0-1-new-Proleptic-Gregorian-calendar-td38914.html)] > the blog post linked there and looking at the Spark code. > From our testing we're seeing several issues: > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` which contain timestamps before > the above mentioned moments in time without `datetimeRebaseModeInRead` set > doesn't raise the `SparkUpgradeException`, it succeeds without any changes to > the resulting dataframe compares to that dataframe in Spark 2.4.5 > * Reading parquet data with Spark 3.0.1 that was written with Spark 2.4.5. > that contains fields of type `TimestampType` or `DateType` which contain > dates or timestamps before the above mentioned moments in time with > `datetimeRebaseModeInRead` set to `LEGACY` results in the same
[jira] [Commented] (SPARK-33571) Handling of hybrid to proleptic calendar when reading and writing Parquet data not working correctly
[ https://issues.apache.org/jira/browse/SPARK-33571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17239193#comment-17239193 ] Simon commented on SPARK-33571: --- Below the output of the timestamp test script with the noise removed #Writing: ``` Spark version: 2.4.5 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark245/ done ... Spark version: 2.4.6 Spark conf [('spark.master', 'local[*]'), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark246/ done ... Spark version: 3.0.1 Spark conf [('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'generate-timestamp-data'), ('spark.ui.showConsoleProgress', 'true')] root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ Writing parquet files to output/timestampspark301/ done ``` Note not exception was raised when writing old timestamps to parquet in spark 3.0.1 # Reading ``` Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark245/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark246/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:50:38| | 2|1880-10-01 10:50:38| | 3|2020-10-01 10:10:10| +---+---+ done ... Spark version: 3.0.1 Spark conf [('spark.app.name', 'read-data'), ('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.ui.showConsoleProgress', 'true')] Reading parquet files from output/timestampspark301/*.parquet root |-- row: string (nullable = true) |-- timestamp: timestamp (nullable = true) +---+---+ |row| timestamp| +---+---+ | 1|0220-10-01 10:10:10| | 2|1880-10-01 10:10:10| | 3|2020-10-01 10:10:10| +---+---+ done ``` > Handling of hybrid to proleptic calendar when reading and writing Parquet > data not working correctly > > > Key: SPARK-33571 > URL: https://issues.apache.org/jira/browse/SPARK-33571 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1 >Reporter: Simon >Priority: Major > > The handling of old dates written with older Spark versions (<2.4.6) using > the hybrid calendar in Spark 3.0.0 and 3.0.1 seems to be broken/not working > correctly. > From what I understand it should work like this: > * Only relevant for `DateType` before 1582-10-15 or `TimestampType` before > 1900-01-01T00:00:00Z > * Only applies when reading or writing parquet files > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments in time a > `SparkUpgradeException` should be raised informing the user to choose either > `LEGACY` or `CORRECTED` for the `datetimeRebaseModeInRead` > * When reading parquet files written with Spark < 2.4.6 which contain dates > or timestamps before the above mentioned moments