Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-14 Thread Micah Kornfield
Hi, >From the error it looks like this might potentially be some sort of integer overflow, but it is hard to say. Could you try to get a minimal reproduction of the error [1] , and open a JIRA Issue [2] with it? Thanks, Micah [1] https://stackoverflow.com/help/mcve [2]

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-10 Thread Abdeali Kothari
Hi, any help on this would be much appreciated. I've not been able to figure out any reason for this to happen yet On Sat, Mar 2, 2019, 11:50 Abdeali Kothari wrote: > Hi Li Jin, thanks for the note. > > I get this error only for larger data - when I reduce the number of > records or the number

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Hi Li Jin, thanks for the note. I get this error only for larger data - when I reduce the number of records or the number or columns in my data it all works fine - so if it is binary incompatibility it should be something related to large data. I am using Spark 2.3.1 on Amazon EMR for this

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Li Jin
The 2G limit that Uwe mentioned definitely exists, Spark serialize each group as a single RecordBatch currently. The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is strange, I think Spark is on an older version of the Java side (0.10 for Spark 2.4 and 0.8 for Spark 2.3). I

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Forgot to mention: The above testing is with 0.11.1 I tried 0.12.1 as you suggested - and am getting the OversizedAllocationException with the 80char column. And getting read length must be positive or -1 without that. So, both the issues are reproducible with pyarrow 0.12.1 On Sat, Mar 2, 2019

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
That was spot on! I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes I removed these columns and replaced each with 10 doubleType columns (so it would still be 80 bytes of data) - and this error didn't come up anymore. I also removed all the other columns and just kept 1 column with

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
There is currently the limitation that a column in a single RecordBatch can only hold 2G on the Java side. We work around this by splitting the DataFrame under the hood into multiple RecordBatches. I'm not familiar with the Spark<->Arrow code but I guess that in this case, the Spark code can

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Is there a limitation that a single column cannot be more than 1-2G ? One of my columns definitely would be around 1.5GB of memory. I cannot split my DF into more partitions as I have only 1 ID and I'm grouping by that ID. So, the UDAF would only run on a single pandasDF I do have a requirement

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
Hello Abdeali, a problem could here be that a single column of your dataframe is using more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame into more partitions before applying the UDAF. Cheers Uwe On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote: > I was using

OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
I was using arrow with spark+python and when I'm trying some pandas-UDAF functions I am getting this error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)