[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353519#comment-16353519 ] Apache Spark commented on SPARK-23290: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20515 > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352140#comment-16352140 ] Apache Spark commented on SPARK-23290: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20506 > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352120#comment-16352120 ] Takuya Ueshin commented on SPARK-23290: --- Thanks [~amenck] for clarifying. I'll submit a pr for modifying to use {{datetime.date}} for date type and let's see feedback from the community. > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350917#comment-16350917 ] Andre Menck commented on SPARK-23290: - Hey [~ueshin] apologies, I tried to come up with a simpler example of the failure I saw, and ended up with an incorrect one! Here is a more straight-forward example of the failure in 2.3, specifically due to joining on columns with different (but similar) types: {code} >>> pdf = df.toPandas() >>> pdf.dtypes datedatetime64[ns] num int64 dtype: object >>> type(pdf['date'][0]) >>> user_provided_pdf date num 0 2015-01-011 >>> user_provided_pdf.dtypes dateobject num int64 dtype: object >>> type(user_provided_pdf['date'][0]) {code} At this point, a simple example of change in functionality would be checking equality: {code} >>> pdf.loc[0,'date'] == user_provided_pdf.loc[0,'date'] False {code} In reality, I hit this when executing a join with a pandas dataframe obtained from another source: {code} >>> pdf.merge(user_provided_pdf, on=['date'], how='inner') Empty DataFrame Columns: [date, num_x, num_y] Index: [] {code} For 2.2, the equality above would hold and this join would produce a non-trivial output. > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350860#comment-16350860 ] Sameer Agarwal commented on SPARK-23290: [~amenck] [~aash] any updates here? > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349796#comment-16349796 ] Takuya Ueshin commented on SPARK-23290: --- Thanks for the report! I'm afraid I couldn't figure out what's going on because your example is something wrong. In your first example, the dtype of {{pdf['date']}} seems {{object}}, but the actual type is {{str}}: {code:python} >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) >>> pdf.dtypes dateobject num int64 dtype: object >>> type(pdf['date'][0]) {code} So the lambda should work because the function in the lambda is for string type: {code:python} >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) 02015-01-01 Name: date, dtype: object {code} Whereas Spark returns {{datetime.date}} in 2.2 and {{pd.Timestamp}} in 2.3: {code:python} >>> df = spark.createDataFrame([('2015-01-01', 1)], ['date', >>> 'num']).selectExpr("cast(date as date)", "num") >>> df.printSchema() root |-- date: date (nullable = true) |-- num: long (nullable = true) >>> df.show() +--+---+ | date|num| +--+---+ |2015-01-01| 1| +--+---+ {code} in 2.2: {code:python} >>> pdf = df.toPandas() >>> pdf.dtypes dateobject num int64 dtype: object >>> type(pdf['date'][0]) {code} in 2.3: {code:python} >>> pdf = df.toPandas() >>> pdf.dtypes datedatetime64[ns] num int64 dtype: object >>> type(pdf['date'][0]) {code} In this case, the lambda shouldn't work anyway. Could you provide some other example to elaborate the problem? IIUC, {{datetime.date}} and {{pd.Timestamp}} are kind of compatible, so we can handle them in the same way. cc: [~bryanc] Thanks! > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23290) inadvertent change in handling of DateType when converting to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348223#comment-16348223 ] Nick Pentreath commented on SPARK-23290: cc [~bryanc] > inadvertent change in handling of DateType when converting to pandas dataframe > -- > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andre Menck >Priority: Major > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 02015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > dateobject > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > datedatetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "", line 1, in > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org