[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356189#comment-16356189 ] Apache Spark commented on SPARK-23314: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/20537 > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354665#comment-16354665 ] Li Jin commented on SPARK-23314: I figured out what the issue is. Will have a patch soon. > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354045#comment-16354045 ] Li Jin commented on SPARK-23314: I think this is related to how Pandas deals with timestamp localization. I will spend some more time today. > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351333#comment-16351333 ] Felix Cheung commented on SPARK-23314: -- I've isolated this down to this particular file [https://raw.githubusercontent.com/BuzzFeedNews/2016-04-federal-surveillance-planes/master/data/feds/feds3.csv] without converting to pandas it seems to read fine, so not if it's a data problem. > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351188#comment-16351188 ] Felix Cheung commented on SPARK-23314: -- Thanks. I have isolated this to a different subset of data, but not yet able to pinpoint the exact row (mostly the value displayed is local but the data is UTC, and there is no match after adjusting for time zone) It might be with the data so in such case is there a way to help debug this? > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350971#comment-16350971 ] Li Jin commented on SPARK-23314: Hi [~felixcheung] Thanks for the information. However, I still cannot reproduce with python2, pandas 0.22.0 and pyarrow 0.8.0 ... (Although I do have to drop the "flight_id" column because the type is parsed to decimal) Is it possible you have more than one pandas on your path? {code:java} >>> flights.printSchema() root |-- adshex: string (nullable = true) |-- latitude: double (nullable = true) |-- longitude: double (nullable = true) |-- altitude: integer (nullable = true) |-- speed: integer (nullable = true) |-- track: integer (nullable = true) |-- squawk: integer (nullable = true) |-- type: string (nullable = true) |-- timestamp: timestamp (nullable = true) |-- name: string (nullable = true) |-- other_names1: string (nullable = true) |-- other_names2: string (nullable = true) |-- n_number: string (nullable = true) |-- serial_number: string (nullable = true) |-- mfr_mdl_code: integer (nullable = true) |-- mfr: string (nullable = true) |-- model: string (nullable = true) |-- year_mfr: integer (nullable = true) |-- type_aircraft: integer (nullable = true) |-- agency: string (nullable = true) >>> flights.show() +--++--++-+-+--++---+++++-+++-++-+--+ |adshex|latitude| longitude|altitude|speed|track|squawk|type| timestamp| name| other_names1| other_names2|n_number|serial_number|mfr_mdl_code| mfr|model|year_mfr|type_aircraft|agency| +--++--++-+-+--++---+++++-+++-++-+--+ |A72AA1| 33.2552|-117.91699| 5499| 111| 137| 4401|B350|2015-08-18 03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1| 33.2659| -117.928| 5500| 109| 138| 4401|B350|2015-08-18 03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1| 33.2741|-117.93599| 5500| 109| 137| 4401|B350|2015-08-18 03:58:28|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1|33.28251| -117.945| 5500| 112| 138| 4401|B350|2015-08-18 03:58:13|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| |A72AA1|33.29341|-117.95699| 5500| 102| 134| 4401|B350|2015-08-18 03:57:58|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 2010| 5| dhs| +--++--++-+-+--++---+++++-+++-++-+--+ >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP) ... def subtract_mean_year_mfr(pdf): ... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean()) ... >>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr) >>> g.show() +--++--++-+-+--++---+++++-+++-++-+--+ |adshex|latitude| longitude|altitude|speed|track|squawk|type| timestamp| name| other_names1| other_names2|n_number|serial_number|mfr_mdl_code| mfr|model|year_mfr|type_aircraft|agency| +--++--++-+-+--++---+++++-+++-++-+--+ |A72AA1| 33.2552|-117.91699| 5499| 111| 137| 4401|B350|2015-08-18 03:58:54|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A| FM-36| 4220012|HAWKER BEECHCRAFT...|B300C| 0| 5| dhs| |A72AA1| 33.2659| -117.928| 5500| 109| 138| 4401|B350|2015-08-18 03:58:39|US DEPARTMENT OF ...|US CUSTOMS & BORD...|OFFICE OF AIR & M...| 561A|
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350819#comment-16350819 ] Felix Cheung commented on SPARK-23314: -- Im running python 2 Pandas 0.22.0 Pyarrow 0.8.0 > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350599#comment-16350599 ] Li Jin commented on SPARK-23314: [~felixcheung], what's the version of pandas you are using in your environment? I cannot seem to reproduce with pandas 0.19.2 and 0.21.0 (python3) > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350568#comment-16350568 ] Li Jin commented on SPARK-23314: I am taking a look at this > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349899#comment-16349899 ] Felix Cheung commented on SPARK-23314: -- [~icexelloss] [~bryanc] > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For detailed on repo, see Comment box -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349898#comment-16349898 ] Felix Cheung commented on SPARK-23314: -- log [Stage 3:=> (195 + 5) / 200]18/02/01 19:17:26 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 205) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main process() File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in dump_stream batch = _create_batch(series, self._timezone) File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in create_array s = _check_series_convert_timestamps_internal(s.fillna(0), timezone) File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in _check_series_convert_timestamps_internal return s.dt.tz_localize(tz).dt.tz_convert('UTC') File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 115, in f return self._delegate_method(name, *args, **kwargs) File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/accessors.py", line 131, in _delegate_method result = method(*args, **kwargs) File "/usr/local/lib/python2.7/site-packages/pandas/util/_decorators.py", line 118, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python2.7/site-packages/pandas/core/indexes/datetimes.py", line 1858, in tz_localize errors=errors) File "pandas/_libs/tslib.pyx", line 3593, in pandas._libs.tslib.tz_localize_to_utc AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 01:29:30'), try using the 'ambiguous' argument at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:164) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:114) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/02/01 19:17:26 WARN TaskSetManager: Lost task 7.0 in stage 3.0 (TID 205, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in main process() File "/Users/felixcheung/spark/python/lib/pyspark.zip/pyspark/worker.py", line 224, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 257, in dump_stream batch = _create_batch(series, self._timezone) File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 235, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/Users/felixcheung/spark/python/pyspark/serializers.py", line 230, in create_array s = _check_series_convert_timestamps_internal(s.fillna(0), timezone) File "/Users/felixcheung/spark/python/pyspark/sql/types.py", line 1733, in _check_series_convert_timestamps_internal return s.dt.tz_localize(tz).dt.tz_convert('UTC') File "/usr/local/lib/python2.7/site-packages/pandas/core/accessor.py", line 115, in f return
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349897#comment-16349897 ] Felix Cheung commented on SPARK-23314: -- code >>> flights = spark.read.option("inferSchema", True).option("header", >>> True).option("dateFormat", "-MM-dd HH:mm:ss").csv("data*.csv") >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> @pandas_udf(flights.schema, PandasUDFType.GROUPED_MAP) ... def subtract_mean_year_mfr(pdf): ... return pdf.assign(year_mfr=pdf.year_mfr - pdf.year_mfr.mean()) ... g = flights.groupby('mfr').apply(subtract_mean_year_mfr) >>> g = flights.groupby('mfr').apply(subtract_mean_year_mfr) >>> >>> g.count() > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For detailed on repo, see Comment box -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349896#comment-16349896 ] Felix Cheung commented on SPARK-23314: -- data sample adshex,flight_id,latitude,longitude,altitude,speed,track,squawk,type,timestamp,name,other_names1,other_names2,n_number,serial_number,mfr_mdl_code,mfr,model,year_mfr,type_aircraft,agency A72AA1,72791e8,33.2552,-117.91699,5499,111,137,4401,B350,2015-08-18T07:58:54Z,US DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs A72AA1,72791e8,33.2659,-117.928,5500,109,138,4401,B350,2015-08-18T07:58:39Z,US DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs A72AA1,72791e8,33.2741,-117.93599,5500,109,137,4401,B350,2015-08-18T07:58:28Z,US DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs A72AA1,72791e8,33.28251,-117.945,5500,112,138,4401,B350,2015-08-18T07:58:13Z,US DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs A72AA1,72791e8,33.29341,-117.95699,5500,102,134,4401,B350,2015-08-18T07:57:58Z,US DEPARTMENT OF HOMELAND SECURITY,US CUSTOMS & BORDER PROTECTION,OFFICE OF AIR & MARINE,561A,FM-36,4220012,HAWKER BEECHCRAFT CORP,B300C,2010,5,dhs > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For detailed on repo, see Comment box -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org