[jira] [Created] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

Attila Zsolt Piros (Jira) Mon, 07 Aug 2023 21:02:37 -0700

Attila Zsolt Piros created SPARK-44717:
------------------------------------------


             Summary: "pyspark.pandas.resample" is incorrect when DST is 
overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not 
applied
                 Key: SPARK-44717
                 URL: https://issues.apache.org/jira/browse/SPARK-44717
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark
    Affects Versions: 3.4.1, 3.4.0, 4.0.0
            Reporter: Attila Zsolt Piros


Use one of the existing test:
- "11H" case of test_dataframe_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 
-"1001H" case of test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests) 

After setting the TZ for example to New York. Like by using the following 
python code in a "setUpClass":  
{noformat}
os.environ["TZ"] = 'America/New_York'
{noformat}

You will get the error for the latter one:

{noformat}
======================================================================
FAIL [4.219s]: test_series_resample 
(pyspark.pandas.tests.test_resample.ResampleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
276, in test_series_resample
    self._test_resample(self.pdf3.A, self.psdf3.A, ["1001H"], "right", "right", 
"sum")
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_resample.py", line 
259, in _test_resample
    self.assert_eq(
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 457, in 
assert_eq
    _assert_pandas_almost_equal(lobj, robj)
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 228, in 
_assert_pandas_almost_equal
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: [DIFFERENT_PANDAS_SERIES] 
Series are not almost equal:
Left:
Freq: 1001H
float64
Right:
float64
{noformat}

The problem is the in the pyspark resample there will be more resampled rows in 
the result. The DST change will cause those extra lines as the computed 
__tmp_resample_bin_col__ be something like:

{noformat}
| __index_level_0__.    | __tmp_resample_bin_col__ | A
.....
|2011-03-08 00:00:00|2011-03-26 11:00:00     |0.3980551570183919  |
|2011-03-09 00:00:00|2011-03-26 11:00:00     |0.6511376673995046  |
|2011-03-10 00:00:00|2011-03-26 11:00:00     |0.6141085426890365  |
|2011-03-11 00:00:00|2011-03-26 11:00:00     |0.11557638066163867 |
|2011-03-12 00:00:00|2011-03-26 11:00:00     |0.4517788243490799  |
|2011-03-13 00:00:00|2011-03-26 11:00:00     |0.8637060550157284  |
|2011-03-14 00:00:00|2011-03-26 10:00:00     |0.8169499149450166  |
|2011-03-15 00:00:00|2011-03-26 10:00:00     |0.4585916249356583  |
|2011-03-16 00:00:00|2011-03-26 10:00:00     |0.8362472880832088  |
|2011-03-17 00:00:00|2011-03-26 10:00:00     |0.026716901748386812|
|2011-03-18 00:00:00|2011-03-26 10:00:00     |0.9086816462089563  |
{noformat}

You can see the extra lines around when the DST kicked in on 2011-03-13 in New 
York.

Even setting the conf "spark.sql.timestampType" to"TIMESTAMP_NTZ" does not help.

You can see my tests here:
https://github.com/attilapiros/spark/pull/5

Pandas timestamps are TZ less:
`
{noformat}
import pandas as pd
a = pd.Timestamp(year=2011, month=3, day=13, hour=1)
b = pd.Timedelta(hours=1)

>> a 
Timestamp('2011-03-13 01:00:00')
>>> a+b
Timestamp('2011-03-13 02:00:00')
>>> a+b+b
Timestamp('2011-03-13 03:00:00')
{noformat}

But pyspark TimestampType uses TZ and DST:

{noformat}
>>> sql("select  TIMESTAMP '2011-03-13 01:00:00'").show()
+-------------------------------+
|TIMESTAMP '2011-03-13 01:00:00'|
+-------------------------------+
|            2011-03-13 01:00:00|
+-------------------------------+

>>> sql("select  TIMESTAMP '2011-03-13 01:00:00' + 
>>> make_interval(0,0,0,0,1,0,0)").show()
+--------------------------------------------------------------------+
|TIMESTAMP '2011-03-13 01:00:00' + make_interval(0, 0, 0, 0, 1, 0, 0)|
+--------------------------------------------------------------------+
|                                                 2011-03-13 03:00:00|
+--------------------------------------------------------------------+
{noformat}

The current resample code uses the above interval based calculation.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44717) "pyspark.pandas.resample" is incorrect when DST is overlapped and setting "spark.sql.timestampType" to TIMESTAMP_NTZ does not applied

Reply via email to