[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353519#comment-16353519
 ] 

Apache Spark commented on SPARK-23290:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20515

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352140#comment-16352140
 ] 

Apache Spark commented on SPARK-23290:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20506

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-05 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352120#comment-16352120
 ] 

Takuya Ueshin commented on SPARK-23290:
---

Thanks [~amenck] for clarifying.
I'll submit a pr for modifying to use {{datetime.date}} for date type and let's 
see feedback from the community.

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-02 Thread Andre Menck (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350917#comment-16350917
 ] 

Andre Menck commented on SPARK-23290:
-

Hey [~ueshin] apologies, I tried to come up with a simpler example of the 
failure I saw, and ended up with an incorrect one! Here is a more 
straight-forward example of the failure in 2.3, specifically due to joining on 
columns with different (but similar) types:
{code}
>>> pdf = df.toPandas()
>>> pdf.dtypes
datedatetime64[ns]
num  int64
dtype: object
>>> type(pdf['date'][0])

>>> user_provided_pdf
 date  num
0  2015-01-011
>>> user_provided_pdf.dtypes
dateobject
num  int64
dtype: object
>>> type(user_provided_pdf['date'][0])

{code}
At this point, a simple example of change in functionality would be checking 
equality:
{code}
>>> pdf.loc[0,'date'] == user_provided_pdf.loc[0,'date']
False
{code}
In reality, I hit this when executing a join with a pandas dataframe obtained 
from another source:
{code}
>>> pdf.merge(user_provided_pdf, on=['date'], how='inner')
Empty DataFrame
Columns: [date, num_x, num_y]
Index: []
{code}
For 2.2, the equality above would hold and this join would produce a 
non-trivial output.

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-02 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350860#comment-16350860
 ] 

Sameer Agarwal commented on SPARK-23290:


[~amenck] [~aash] any updates here?

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-01 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349796#comment-16349796
 ] 

Takuya Ueshin commented on SPARK-23290:
---

Thanks for the report!

I'm afraid I couldn't figure out what's going on because your example is 
something wrong.

In your first example, the dtype of {{pdf['date']}} seems {{object}}, but the 
actual type is {{str}}:

{code:python}
>>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
>>> pdf.dtypes
dateobject
num  int64
dtype: object
>>> type(pdf['date'][0])

{code}

So the lambda should work because the function in the lambda is for string type:

{code:python}
>>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
02015-01-01
Name: date, dtype: object
{code}

Whereas Spark returns {{datetime.date}} in 2.2 and {{pd.Timestamp}} in 2.3:

{code:python}
>>> df = spark.createDataFrame([('2015-01-01', 1)], ['date', 
>>> 'num']).selectExpr("cast(date as date)", "num")
>>> df.printSchema()
root
 |-- date: date (nullable = true)
 |-- num: long (nullable = true)

>>> df.show()
+--+---+
|  date|num|
+--+---+
|2015-01-01|  1|
+--+---+
{code}

in 2.2:

{code:python}
>>> pdf = df.toPandas()
>>> pdf.dtypes
dateobject
num  int64
dtype: object
>>> type(pdf['date'][0])

{code}

in 2.3:

{code:python}
>>> pdf = df.toPandas()
>>> pdf.dtypes
datedatetime64[ns]
num  int64
dtype: object
>>> type(pdf['date'][0])

{code}

In this case, the lambda shouldn't work anyway.

Could you provide some other example to elaborate the problem?

IIUC, {{datetime.date}} and {{pd.Timestamp}} are kind of compatible, so we can 
handle them in the same way. cc: [~bryanc]

Thanks!


> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Blocker
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe

2018-02-01 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348223#comment-16348223
 ] 

Nick Pentreath commented on SPARK-23290:


cc [~bryanc]

> inadvertent change in handling of DateType when converting to pandas dataframe
> --
>
> Key: SPARK-23290
> URL: https://issues.apache.org/jira/browse/SPARK-23290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andre Menck
>Priority: Major
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 02015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> dateobject
> num  int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> datedatetime64[ns]
> num  int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "", line 1, in 
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org