[jira] [Created] (ARROW-4890) Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-03-15 Thread Abdeali Kothari (JIRA)
Abdeali Kothari created ARROW-4890: -- Summary: Spark+Arrow Grouped pandas UDAF - read length must be positive or -1 Key: ARROW-4890 URL: https://issues.apache.org/jira/browse/ARROW-4890 Project

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-10 Thread Abdeali Kothari
Hi, any help on this would be much appreciated. I've not been able to figure out any reason for this to happen yet On Sat, Mar 2, 2019, 11:50 Abdeali Kothari wrote: > Hi Li Jin, thanks for the note. > > I get this error only for larger data - when I reduce the number of > records o

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
k 2.3). I forgot whether there is binary > incompatibility between these versions and pyarrow 0.12. > > On Fri, Mar 1, 2019 at 3:32 PM Abdeali Kothari > wrote: > > > Forgot to mention: The above testing is with 0.11.1 > > I tried 0.12.1 as you suggested - and am getting the

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
at 1:57 AM Abdeali Kothari wrote: > That was spot on! > I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes > I removed these columns and replaced each with 10 doubleType columns (so > it would still be 80 bytes of data) - and this error didn't come up anymore. >

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
exact size of your columns. We support 2G > per column, if it is only 1.5G, then there is probably a rounding error in > the Arrow. Alternatively, you might also be in luck that the following > patch > https://github.com/apache/arrow/commit/bfe6865ba8087a46bd7665679e48af3a77987cef >

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
ry splitting your DataFrame > into more partitions before applying the UDAF. > > Cheers > Uwe > > On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote: > > I was using arrow with spark+python and when I'm trying some pandas-UDAF &

OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
I was using arrow with spark+python and when I'm trying some pandas-UDAF functions I am getting this error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)