[jira] [Created] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

Bruce Robbins (Jira) Tue, 25 Feb 2020 17:44:46 -0800

Bruce Robbins created SPARK-30951:
-------------------------------------

             Summary: Potential data loss for legacy applications after switch 
to proleptic Gregorian calendar
                 Key: SPARK-30951
                 URL: https://issues.apache.org/jira/browse/SPARK-30951
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Bruce Robbins



>From SPARK-26651:
{quote}"The changes might impact on the results for dates and timestamps before 
October 15, 1582 (Gregorian)
{quote}
We recently discovered that some large scale Spark 2.x applications rely on 
dates before October 15, 1582.

Two cases came up recently:
 * An application that uses a commercial third-party library to encode 
sensitive dates. On insert, the library encodes the actual date as some other 
date. On select, the library decodes the date back to the original date. The 
encoded value could be any date, including one before October 15, 1582 (e.g., 
"0602-04-04").
 * An application that uses a specific unlikely date (e.g., "1200-01-01") as a 
marker to indicate "unknown date" (in lieu of null)

Both sites ran into problems after another component in their system was 
upgraded to use the proleptic Gregorian calendar. Spark applications that read 
files created by the upgraded component were interpreting encoded or marker 
dates incorrectly, and vice versa. Also, their data now had a mix of calendars 
(hybrid and proleptic Gregorian) with no metadata to indicate which file used 
which calendar.

Both sites had enormous amounts of existing data, so re-encoding the dates 
using some other scheme was not a feasible solution.

This is relevant to Spark 3:

Any Spark 2 application that uses such date-encoding schemes may run into 
trouble when run on Spark 3. The application may not properly interpret the 
dates previously written by Spark 2. Also, once the Spark 3 version of the 
application writes data, the tables will have a mix of calendars (hybrid and 
proleptic gregorian) with no metadata to indicate which file uses which 
calendar.

Similarly, sites might run with mixed Spark versions, resulting in data written 
by one version that cannot be interpreted by the other. And as above, the 
tables will now have a mix of calendars with no way to detect which file uses 
which calendar.

As with the two real-life example cases, these applications may have enormous 
amounts of legacy data, so re-encoding the dates using some other scheme may 
not be feasible.

We might want to consider a configuration setting to allow the user to specify 
the calendar for storing and retrieving date and timestamp values (not sure how 
such a flag would affect other date and timestamp-related functions). I realize 
the change is far bigger than just adding a configuration setting.

Here's a quick example of where trouble may happen, using the real-life case of 
the marker date.

In Spark 2.4:
{noformat}
scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 1
scala>
{noformat}
In Spark 3.0 (reading from the same legacy file):
{noformat}
scala> spark.read.orc(s"$home/data/datefile").filter("dt == '1200-01-01'").count
res0: Long = 0
scala> 
{noformat}
By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
proleptic Gregorian calendar between 2.x and 3.x. After some upgrade headaches 
related to dates before 1582, the Hive community made the following changes:
 * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
checks a configuration setting to determine which calendar to use.
 * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
stores the calendar type in the metadata.
 * When reading date or timestamp data from ORC, Parquet, and Avro files, Hive 
checks the metadata for the calendar type.
 * When reading date or timestamp data from ORC, Parquet, and Avro files that 
lack calendar metadata, Hive's behavior is determined by a configuration 
setting. This allows Hive to read legacy data (note: if the data already 
consists of a mix of calendar types with no metadata, there is no good 
solution).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

Reply via email to