[ https://issues.apache.org/jira/browse/SPARK-27519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830743#comment-16830743 ]
Bryan Cutler edited comment on SPARK-27519 at 4/30/19 10:49 PM: ---------------------------------------------------------------- Problem does not happen when running the latest master. Marking resolved. was (Author: bryanc): Problem does not happen when running the latest master. > Pandas udf corrupting data > -------------------------- > > Key: SPARK-27519 > URL: https://issues.apache.org/jira/browse/SPARK-27519 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Reporter: Jeff gold > Priority: Major > Fix For: 3.0.0 > > Attachments: Pandas UDF Bug.py > > > While trying to use a pandas udf, i sent the udf 2 columns, a string and a > list of a list of strings. The second argument structure for example: > [['1'],['2'],['3']] > But when getting this same value in the udf, i receive something like this: > [['1','2'],['3'],[]] > I checked and the same row in the table has the list with the correct > structure, only in the udf did it change. > > I don't know why this happens, but i do know it has something to do with the > fact that that row was the 10,001th row and last row in it's partition. > Pandas batch size is 10,000 so that row was sent as a second batch alone, and > that's the only thing that seems to cause it, having 1 or 2 rows in a second > batch of the partition. I was also able to get this with a second batch of 2 > rows, the list wasn't changed except an empty list was added to the end. > Hope you can help me understand what is going on, thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org