[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356189#comment-16356189
 ] 

Apache Spark commented on SPARK-23314:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/20537

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-06 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354665#comment-16354665
 ] 

Li Jin commented on SPARK-23314:


I figured out what the issue is. Will have a patch soon.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-06 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354045#comment-16354045
 ] 

Li Jin commented on SPARK-23314:


I think this is related to how Pandas deals with timestamp localization. I will 
spend some more time today.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351333#comment-16351333
 ] 

Felix Cheung commented on SPARK-23314:
--

I've isolated this down to this particular file

[https://raw.githubusercontent.com/BuzzFeedNews/2016-04-federal-surveillance-planes/master/data/feds/feds3.csv]

without converting to pandas it seems to read fine, so not if it's a data 
problem.

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351188#comment-16351188
 ] 

Felix Cheung commented on SPARK-23314:
--

Thanks. I have isolated this to a different subset of data, but not yet able to 
pinpoint the exact row (mostly the value displayed is local but the data is 
UTC, and there is no match after adjusting for time zone) It might be with the 
data so in such case is there a way to help debug this?


> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350971#comment-16350971
 ] 

Li Jin commented on SPARK-23314:


Hi [~felixcheung]

Thanks for the information. However, I still cannot reproduce with python2, 
pandas 0.22.0 and pyarrow 0.8.0 ...

(Although I do have to drop the "flight_id" column because the type is parsed 
to decimal)

Is it possible you have more than one pandas on your path?

 
{code:java}
>>> flights.printSchema()

root

|-- adshex: string (nullable = true)

|-- latitude: double (nullable = true)

|-- longitude: double (nullable = true)

|-- altitude: integer (nullable = true)

|-- speed: integer (nullable = true)

|-- track: integer (nullable = true)

|-- squawk: integer (nullable = true)

|-- type: string (nullable = true)

|-- timestamp: timestamp (nullable = true)

|-- name: string (nullable = true)

|-- other_names1: string (nullable = true)

|-- other_names2: string (nullable = true)

|-- n_number: string (nullable = true)

|-- serial_number: string (nullable = true)

|-- mfr_mdl_code: integer (nullable = true)

|-- mfr: string (nullable = true)

|-- model: string (nullable = true)

|-- year_mfr: integer (nullable = true)

|-- type_aircraft: integer (nullable = true)

|-- agency: string (nullable = true)

>>> flights.show()

+--++--++-+-+--++---+++++-+++-++-+--+

|adshex|latitude| longitude|altitude|speed|track|squawk|type|          
timestamp|                name|        other_names1|        
other_names2|n_number|serial_number|mfr_mdl_code|                 
mfr|model|year_mfr|type_aircraft|agency|

+--++--++-+-+--++---+++++-+++-++-+--+

|A72AA1| 33.2552|-117.91699|    5499|  111|  137|  4401|B350|2015-08-18 
03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1| 33.2659|  -117.928|    5500|  109|  138|  4401|B350|2015-08-18 
03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1| 33.2741|-117.93599|    5500|  109|  137|  4401|B350|2015-08-18 
03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1|33.28251|  -117.945|    5500|  112|  138|  4401|B350|2015-08-18 
03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

|A72AA1|33.29341|-117.95699|    5500|  102|  134|  4401|B350|2015-08-18 
03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|    2010|            
5|   dhs|

+--++--++-+-+--++---+++++-+++-++-+--+



>>> from pyspark.sql.functions import pandas_udf, PandasUDFType

>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)

... def subtract_mean_year_mfr(pdf):

...     return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())

...

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g.show()

+--++--++-+-+--++---+++++-+++-++-+--+

|adshex|latitude| longitude|altitude|speed|track|squawk|type|          
timestamp|                name|        other_names1|        
other_names2|n_number|serial_number|mfr_mdl_code|                 
mfr|model|year_mfr|type_aircraft|agency|

+--++--++-+-+--++---+++++-+++-++-+--+

|A72AA1| 33.2552|-117.91699|    5499|  111|  137|  4401|B350|2015-08-18 
03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        FM-36|     4220012|HAWKER BEECHCRAFT...|B300C|       0|            
5|   dhs|

|A72AA1| 33.2659|  -117.928|    5500|  109|  138|  4401|B350|2015-08-18 
03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...|    
561A|        

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350819#comment-16350819
 ] 

Felix Cheung commented on SPARK-23314:
--

Im running python 2
Pandas 0.22.0
Pyarrow 0.8.0



> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350599#comment-16350599
 ] 

Li Jin commented on SPARK-23314:


[~felixcheung], what's the version of pandas you are using in your environment? 
I cannot seem to reproduce with pandas 0.19.2 and 0.21.0 (python3)

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-02 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350568#comment-16350568
 ] 

Li Jin commented on SPARK-23314:


I am taking a look at this

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349899#comment-16349899
 ] 

Felix Cheung commented on SPARK-23314:
--

[~icexelloss] [~bryanc]

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349898#comment-16349898
 ] 

Felix Cheung commented on SPARK-23314:
--

log


[Stage 3:=> (195 + 5) / 
200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 
205)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in 
_check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return self._delegate_method(name, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", 
line 131, in _delegate_method
result = method(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 
118, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", 
line 1858, in tz_localize
errors=errors)
File "pandas/_libs/tslib.pyx", line 3593, in 
pandas._libs.tslib.tz_localize_to_utc
AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
01:29:30'), try using the 'ambiguous' argument

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/02/01 19:17:26 WARN TaskSetManager: Lost task 7.0 in stage 3.0 (TID 205, 
localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
229, in main
process()
File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
224, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in 
dump_stream
batch = _create_batch(series, self._timezone)
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in 
_create_batch
arrs = [create_array(s, t) for s, t in series]
File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in 
create_array
s = _check_series_convert_timestamps_internal(s.fillna(0), timezone)
File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in 
_check_series_convert_timestamps_internal
return s.dt.tz_localize(tz).dt.tz_convert('UTC')
File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 
115, in f
return 

[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349897#comment-16349897
 ] 

Felix Cheung commented on SPARK-23314:
--

code

 

>>> flights = spark.read.option("inferSchema", True).option("header", 
>>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv")
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP)
... def subtract_mean_year_mfr(pdf):
... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean())
...
g = flights.groupby('mfr').apply(subtract_mean_year_mfr)

>>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr)
>>>
>>> g.count()

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-01 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349896#comment-16349896
 ] 

Felix Cheung commented on SPARK-23314:
--

data sample

adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency
A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US 
DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs
A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US
 DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & 
MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For detailed on repo, see Comment box



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org