[ 
https://issues.apache.org/jira/browse/SPARK-35515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351015#comment-17351015
 ] 

Martin Studer commented on SPARK-35515:
---------------------------------------

I'm happy to provide a PR if this seems like a sensible improvement.

> TimestampType: OverflowError: mktime argument out of range 
> -----------------------------------------------------------
>
>                 Key: SPARK-35515
>                 URL: https://issues.apache.org/jira/browse/SPARK-35515
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.1.1
>            Reporter: Martin Studer
>            Priority: Minor
>
> This issue occurs, for example, when trying to create a data frame from 
> Python {{datetime}} objects that are "out of range" where "out of range" is 
> platform-dependent due to the use of 
> [{{time.mktime}}|https://docs.python.org/3/library/time.html#time.mktime] in 
> {{TimestampType.toInternal}}:
> {code}
> import datetime
> spark_session.createDataFrame([(datetime.datetime(9999, 12, 31, 0, 0),)])
> {code}
> A more direct way to reproduce the issue is by invoking 
> {{TimestampType.toInternal}} directly:
> {code}
> import datetime
> from pyspark.sql.types import TimestampType
> dt = datetime.datetime(9999, 12, 31, 0, 0)
> TimestampType().toInternal(dt)
> {code}
> The suggested improvement is to avoid using {{time.mktime}} to increase the 
> range of {{datetime}} values. A possible implementation may look as follows:
> {code}
> import datetime
> import pytz
> EPOCH_UTC = datetime.datetime(1970, 1, 1).replace(tzinfo=pytz.utc)
> LOCAL_TZ = datetime.datetime.now().astimezone().tzinfo
> def toInternal(dt):
>       if dt is not None:
>               dt = dt if dt.tzinfo else dt.replace(tzinfo=LOCAL_TZ)
>               dt_utc = dt.astimezone(pytz.utc)
>               td = dt_utc - EPOCH_UTC
>               return (td.days * 86400 + td.seconds) * 10 ** 6 + 
> td.microseconds
> {code}
> This relies on the ability to derive the local timezone. Other mechanisms may 
> be used to what is suggested above.
> Test cases include:
> {code}
> dt1 = datetime.datetime(2021, 5, 25, 12, 23)
> dt2 = dt1.replace(tzinfo=pytz.timezone('Europe/Zurich'))
> dt3 = datetime.datetime(9999, 12, 31, 0, 0)
> dt4 = dt3.replace(tzinfo=pytz.timezone('Europe/Zurich'))
> toInternal(dt1) == TimestampType().toInternal(dt1)
> toInternal(dt2) == TimestampType().toInternal(dt2)
> toInternal(dt3) # TimestampType().toInternal(dt3) fails
> toInternal(dt4) == TimestampType().toInternal(dt4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to