[jira] [Comment Edited] (SPARK-27519) Pandas udf corrupting data

Bryan Cutler (JIRA) Tue, 30 Apr 2019 15:50:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830743#comment-16830743
 ]


Bryan Cutler edited comment on SPARK-27519 at 4/30/19 10:49 PM:
----------------------------------------------------------------

Problem does not happen when running the latest master. Marking resolved.


was (Author: bryanc):
Problem does not happen when running the latest master.

> Pandas udf corrupting data
> --------------------------
>
>                 Key: SPARK-27519
>                 URL: https://issues.apache.org/jira/browse/SPARK-27519
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Jeff gold
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: Pandas UDF Bug.py
>
>
> While trying to use a pandas udf, i sent the udf 2 columns, a string and a 
> list of a list of strings. The second argument structure for example: 
> [['1'],['2'],['3']]
> But when getting this same value in the udf, i receive something like this: 
> [['1','2'],['3'],[]]
> I checked and the same row in the table has the list with the correct 
> structure, only in the udf did it change.
>  
> I don't know why this happens, but i do know it has something to do with the 
> fact that that row was the 10,001th row and last row in it's partition. 
> Pandas batch size is 10,000 so that row was sent as a second batch alone, and 
> that's the only thing that seems to cause it, having 1 or 2 rows in a second 
> batch of the partition. I was also able to get this with a second batch of 2 
> rows, the list wasn't changed except an empty list was added to the end. 
> Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27519) Pandas udf corrupting data

Reply via email to