Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Hi Li Jin, thanks for the note. I get this error only for larger data - when I reduce the number of records or the number or columns in my data it all works fine - so if it is binary incompatibility it should be something related to large data. I am using Spark 2.3.1 on Amazon EMR for this

Re: [C++] Help with windows build failure

2019-03-01 Thread Micah Kornfield
Just to finish off this thread. Antoine's advice was spot on (need to pass Debug and Static to b2). There was still another build issue with doube precision but I was able to bypass it my making the specific test that was failing. On Tue, Feb 26, 2019 at 3:49 AM Antoine Pitrou wrote: > > Le

Re: Boost and manylinux CI builds

2019-03-01 Thread Ravindra Pindikura
Thanks Uwe. For the record (in case someone needs to do it again), these are the steps : 1. Make the change in build_boost.sh 2. Setup an account on quay.io and link to your GitHub account 3. In quay.io , Add a new repository using : A. Link to GitHub

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Could someone give me write/edit access to confluence? Thank you, François On Fri, Mar 1, 2019 at 3:55 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > I'll take this. > > On Fri, Mar 1, 2019 at 3:55 PM Wes McKinney wrote: > >> We could create a page on the wiki that shows all

Re: [Format] Redundant information in Time type?

2019-03-01 Thread Wes McKinney
As I recall there might have been the desire to permit 64-bit representation of SECOND and MILLI time values, but I would opt for YAGNI (until we actually do) and deprecate this bit width field in Schema.fbs (we shouldn't outright remove it -- for backwards compatibility -- unless it's actively

[jira] [Created] (ARROW-4738) [JS] NullVector should include a null data buffer

2019-03-01 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4738: -- Summary: [JS] NullVector should include a null data buffer Key: ARROW-4738 URL: https://issues.apache.org/jira/browse/ARROW-4738 Project: Apache Arrow Issue

Re: Flaky Travis CI builds on master

2019-03-01 Thread Wes McKinney
We could create a page on the wiki that shows all open and resolved issues relating to unexpected CI / build failures. Would someone like to give this a go? There are probably many historical issues that can be tagged with the label On Fri, Mar 1, 2019 at 12:45 PM Francois Saint-Jacques wrote: >

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Li Jin
The 2G limit that Uwe mentioned definitely exists, Spark serialize each group as a single RecordBatch currently. The "pyarrow.lib.ArrowIOError: read length must be positive or -1" is strange, I think Spark is on an older version of the Java side (0.10 for Spark 2.4 and 0.8 for Spark 2.3). I

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
I agree with adding a tag/label for this and even marking the failure as critical. On Fri, Mar 1, 2019 at 12:18 PM Micah Kornfield wrote: > Moving away from the tactical for a minute, I think being able to track > these over time would be useful. I can think of a couple of high level >

[jira] [Created] (ARROW-4737) [C#] tests are not running in CI

2019-03-01 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-4737: --- Summary: [C#] tests are not running in CI Key: ARROW-4737 URL: https://issues.apache.org/jira/browse/ARROW-4737 Project: Apache Arrow Issue Type: Bug

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Forgot to mention: The above testing is with 0.11.1 I tried 0.12.1 as you suggested - and am getting the OversizedAllocationException with the 80char column. And getting read length must be positive or -1 without that. So, both the issues are reproducible with pyarrow 0.12.1 On Sat, Mar 2, 2019

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
That was spot on! I had 3 columns with 80characters => 80*21*10^6 = 1.56 bytes I removed these columns and replaced each with 10 doubleType columns (so it would still be 80 bytes of data) - and this error didn't come up anymore. I also removed all the other columns and just kept 1 column with

[jira] [Created] (ARROW-4736) [Go] Optimize memory usage for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4736: - Summary: [Go] Optimize memory usage for CSV writer Key: ARROW-4736 URL: https://issues.apache.org/jira/browse/ARROW-4736 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-4735) [Go] Benchmark strconv.Format vs. fmt.Sprintf for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4735: - Summary: [Go] Benchmark strconv.Format vs. fmt.Sprintf for CSV writer Key: ARROW-4735 URL: https://issues.apache.org/jira/browse/ARROW-4735 Project: Apache Arrow

[jira] [Created] (ARROW-4734) [Go] Add option to write a header for CSV writer

2019-03-01 Thread Anson Qian (JIRA)
Anson Qian created ARROW-4734: - Summary: [Go] Add option to write a header for CSV writer Key: ARROW-4734 URL: https://issues.apache.org/jira/browse/ARROW-4734 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-4733) [C++] Add CI entry that builds without the conda-forge toolchain but with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4733: -- Summary: [C++] Add CI entry that builds without the conda-forge toolchain but with system packages Key: ARROW-4733 URL: https://issues.apache.org/jira/browse/ARROW-4733

[jira] [Created] (ARROW-4732) [C++] Add docker-compose entry for testing Debian Testing build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4732: -- Summary: [C++] Add docker-compose entry for testing Debian Testing build with system packages Key: ARROW-4732 URL: https://issues.apache.org/jira/browse/ARROW-4732

[jira] [Created] (ARROW-4731) [C++] Add docker-compose entry for testing Ubuntu Xenial build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4731: -- Summary: [C++] Add docker-compose entry for testing Ubuntu Xenial build with system packages Key: ARROW-4731 URL: https://issues.apache.org/jira/browse/ARROW-4731

[jira] [Created] (ARROW-4730) [C++] Add docker-compose entry for testing Fedora build with system packages

2019-03-01 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4730: -- Summary: [C++] Add docker-compose entry for testing Fedora build with system packages Key: ARROW-4730 URL: https://issues.apache.org/jira/browse/ARROW-4730 Project:

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
There is currently the limitation that a column in a single RecordBatch can only hold 2G on the Java side. We work around this by splitting the DataFrame under the hood into multiple RecordBatches. I'm not familiar with the Spark<->Arrow code but I guess that in this case, the Spark code can

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
Is there a limitation that a single column cannot be more than 1-2G ? One of my columns definitely would be around 1.5GB of memory. I cannot split my DF into more partitions as I have only 1 ID and I'm grouping by that ID. So, the UDAF would only run on a single pandasDF I do have a requirement

Re: Flaky Travis CI builds on master

2019-03-01 Thread Micah Kornfield
Moving away from the tactical for a minute, I think being able to track these over time would be useful. I can think of a couple of high level approaches and I was wondering what others think. 1. Use tags appropriately in JIRA and try to generate a report from that. 2. Create a new confluence

[jira] [Created] (ARROW-4729) [C++] Improve buffer symbolic index

2019-03-01 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4729: - Summary: [C++] Improve buffer symbolic index Key: ARROW-4729 URL: https://issues.apache.org/jira/browse/ARROW-4729 Project: Apache Arrow

Re: Flaky Travis CI builds on master

2019-03-01 Thread Francois Saint-Jacques
Also just created https://issues.apache.org/jira/browse/ARROW-4728 On Thu, Feb 28, 2019 at 3:53 AM Ravindra Pindikura wrote: > > > > On Feb 28, 2019, at 2:10 PM, Antoine Pitrou wrote: > > > > > > Le 28/02/2019 à 07:53, Ravindra Pindikura a écrit : > >> > >> > >>> On Feb 27, 2019, at 1:48 AM,

[jira] [Created] (ARROW-4728) [Javascript] Failing test Table#assign with a zero-length Null column round-trips through serialization

2019-03-01 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4728: - Summary: [Javascript] Failing test Table#assign with a zero-length Null column round-trips through serialization Key: ARROW-4728 URL:

Re: OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Uwe L. Korn
Hello Abdeali, a problem could here be that a single column of your dataframe is using more than 2GB of RAM (possibly also just 1G). Try splitting your DataFrame into more partitions before applying the UDAF. Cheers Uwe On Fri, Mar 1, 2019, at 9:09 AM, Abdeali Kothari wrote: > I was using

Re: Nightly binary packages

2019-03-01 Thread Krisztián Szűcs
On Wed, Feb 27, 2019 at 9:30 PM Kouhei Sutou wrote: > Hi, > > > - How should We handle the signing procedure? Simply omit? > > For .deb and .rpm, we need to sign them to install them by > apt/yum. > > We should use a GPG key only for nightly for this > propose. We should not use GPG keys in >

Re: [ANNOUNCE] New Arrow committer: Paddy Horan

2019-03-01 Thread Andy Grove
Congratulations, Paddy! Great to have you here. On Fri, Mar 1, 2019 at 8:45 AM Krisztián Szűcs wrote: > Congrats Paddy! > > On Fri, Mar 1, 2019 at 3:19 AM Chao Sun wrote: > > > Congratulations Paddy! > > > > On Thu, Feb 28, 2019 at 5:52 PM paddy horan > > wrote: > > > > > Thanks All, > > > >

Re: [ANNOUNCE] New Arrow committer: Chao Sun

2019-03-01 Thread Andy Grove
Congratulations, Chao! Great to have you here. On Fri, Mar 1, 2019 at 8:45 AM Krisztián Szűcs wrote: > Congrats Chao! > > On Fri, Mar 1, 2019 at 3:19 AM Chao Sun wrote: > > > Thanks everyone. Looking forward to contributing more! > > > > Chao > > > > On Thu, Feb 28, 2019 at 4:24 PM Renjie Liu

Re: [ANNOUNCE] New Arrow committer: Chao Sun

2019-03-01 Thread Krisztián Szűcs
Congrats Chao! On Fri, Mar 1, 2019 at 3:19 AM Chao Sun wrote: > Thanks everyone. Looking forward to contributing more! > > Chao > > On Thu, Feb 28, 2019 at 4:24 PM Renjie Liu > wrote: > > > Congrats! > > > > Micah Kornfield 于 2019年3月1日周五 上午7:26写道: > > > > > Congrats! > > > > > > On Thu, Feb

Re: [ANNOUNCE] New Arrow committer: Paddy Horan

2019-03-01 Thread Krisztián Szűcs
Congrats Paddy! On Fri, Mar 1, 2019 at 3:19 AM Chao Sun wrote: > Congratulations Paddy! > > On Thu, Feb 28, 2019 at 5:52 PM paddy horan > wrote: > > > Thanks All, > > > > I honored to be a part of such a great, talented community. > > > > P > > > > > > From:

OversizedAllocationException for pandas_udf in pyspark

2019-03-01 Thread Abdeali Kothari
I was using arrow with spark+python and when I'm trying some pandas-UDAF functions I am getting this error: org.apache.arrow.vector.util.OversizedAllocationException: Unable to expand the buffer at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:457)