Problem with master build failing

2020-07-02 Thread Fan Liya
Dear all, Currently, master build is failing occasionally. After investigation, we find it was caused by a cyclic dependency when class loading. We have provided a patch for it [1]. Please take a look. Best, Liya Fan [1] https://github.com/apache/arrow/pull/7628

Re: [RESULT] [VOTE] Add a "Feature" enum to Schema.fbs

2020-07-02 Thread Micah Kornfield
I added JIRAs for incorporating this into implementations. On Thu, Jul 2, 2020 at 6:25 AM Wes McKinney wrote: > Forwarding with [RESULT] subject line > > On Wed, Jul 1, 2020 at 1:24 AM Micah Kornfield > wrote: > > > > The vote carries with 4 binding +1 votes and 0 non-binding +1. I will > > mer

RE: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Kazuaki Ishizaki
I have seen this failure multiple times. However, it is not addressed yet. https://travis-ci.community/t/s390x-no-space-left-on-device/8953 It is fine with me until we see more stable results. Regards, Kazuaki Ishizaki From: Wes McKinney To: dev Date: 2020/07/03 05:32 Subject:

[RESULT] [VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 release

2020-07-02 Thread Wes McKinney
The vote carries with 6 binding +1 votes and 2 non-binding +1 On Tue, Jun 30, 2020 at 4:03 PM Sutou Kouhei wrote: > > +1 (binding) > > In > "[VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 > release" on Mon, 29 Jun 2020 16:42:45 -0500, > Wes McKinney wrote: > > > Hi,

[RESULT] [VOTE] Permitting unsigned integers for Arrow dictionary indices

2020-07-02 Thread Wes McKinney
The vote carries with 6 binding +1 and 1 non-binding +1. Thanks all On Tue, Jun 30, 2020 at 10:07 AM Francois Saint-Jacques wrote: > > +1 (binding) > > On Tue, Jun 30, 2020 at 10:55 AM Neal Richardson > wrote: > > > > +1 (binding) > > > > On Tue, Jun 30, 2020 at 2:52 AM Antoine Pitrou wrote: >

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Thanks! > You should be able to store different length vectors in Parquet. Think of > strings simply as an array of bytes, and those are variable length. You > would want to make sure you don’t use DICTIONARY_ENCODING in that case. > Interesting. We'll look at that. > No, I'm not aware of any

[RESULT] [VOTE] Removing validity bitmap from Arrow union types

2020-07-02 Thread Wes McKinney
The vote carries with 3 binding +1 votes, 2 non-binding +1, and 1 +0 Thanks all for voting. I will update the Format PR and plan to merge the C++ PR soon thereafter On Tue, Jun 30, 2020 at 4:00 PM Sutou Kouhei wrote: > > +1 (binding) > > In > "[VOTE] Removing validity bitmap from Arrow union

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Nicholas Poorman
Joaquin, > Do you know whether there any activity on supporting partial read/writes in arrow or fastparquet? I’m not entirely sure about the status of partial read/writes in Arrow’s Parquet implementation but https://github.com/xitongsys/parquet-go for example has this capability. > Even then, t

Re: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Wes McKinney
Just looking at https://travis-ci.org/github/apache/arrow/builds the failure rate on master (which should be green > 95% of the time) is really high. I'm going to open a patch adding to allow_failures until we see this become less flaky On Thu, Jul 2, 2020 at 8:39 AM Antoine Pitrou wrote: > > > I

Re: Timeline for next major Arrow release (1.0.0)

2020-07-02 Thread Wes McKinney
hi folks, I hope you and your families are all well. We're heading into a holiday weekend here in the US -- I would guess given the state of the backlog and nightly builds that the earliest we could contemplate making the release will be the week of July 13. That should give enough time next week

Re: Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
I can confirm what Uwe said, manylinux doesn't cause issues. Here I've build inside a manylinux2010 docker a C++ Python extension (using the C++ of Arrow): https://github.com/vaexio/vaex-arrow-ext/runs/831763024?check_suite_focus=true It's built with the manylinux1 and manylinux2010 pyarrow wheel

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Antoine Pitrou
Well, it depends how important speed is, but LZ4 has extremely fast decompression, even compared to Snappy: https://github.com/lz4/lz4#benchmarks Regards Antoine. Le 02/07/2020 à 19:47, Christian Hudon a écrit : > At least for us, the advantages of Parquet are speed and interoperability > in

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-02 Thread Christian Hudon
At least for us, the advantages of Parquet are speed and interoperability in the context of longer-term data storage, so I would tend to say "reasonably conservative". Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou a écrit : > > I don't have a sense of how conservative Parquet users generally

Re: Arrow for low-latency streaming of small batches?

2020-07-02 Thread Christian Hudon
Very interesting. This is something that I would potentially also be interested in, so if there were some code available out there, I could potentially contribute or at least use. At least, I'd love for something that allows Arrow to work with both larger and very small record batches (a few rows)

Re: Upcoming JS fixes and release timeline

2020-07-02 Thread Wes McKinney
Since publishing artifacts to NPM is somewhat independent from the Apache source release, if you aren't ready to push to NPM then the release manager can just not push the artifacts Note that the plan hasn't been to go from 1.0.0 to 1.1.0, rather that almost every Apache release (aside from patch

Re: Developing a C++ Python extension

2020-07-02 Thread Tim Paine
We build pyarrow in the docker image because auditwheel complains about pyarrow otherwise which causes our wheels to fail auditwheel and not allow the manylinux tag. But assuming we build pyarrow in the docker image, our manylinux wheels that result are then compatible with the pyarrow manylinux

Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I did try the approach to not link against pyarrow but leave out the symbols, just ensure pyarrow is imported before the vaex extension. This works out-of-the-box on macOS but fails on Linux as symbols have a scope there. Adding the following lines to load Arrow into the global scope made it wor

Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
Hello Tim, thanks for the hint. I see that you build arrow by yourselves in the Dockerfile. Could it be that in the end you statically link the arrow libraries? As there are no wheel on PyPI, I couldn't verify whether that assumption is true. Best Uwe On Thu, Jul 2, 2020, at 4:53 PM, Tim Pain

Re: Performance of ArrowJS in the DOM

2020-07-02 Thread Tim Paine
The virtual table a sounds a lot like regular-table: https://github.com/jpmorganchase/regular-table Used in perspective: https://perspective.finos.org/ We use arrow c++ compiled with webassembly and some front end grid and chart plugins, perspective can run in a client server fashion and only se

Re: [Discuss] Extremely dubious Python equality semantics

2020-07-02 Thread Wes McKinney
On Wed, Jul 1, 2020 at 9:52 AM Joris Van den Bossche wrote: > > I am personally fine with removing the compute dunder methods again (i.e. > Array.__richcmp__), if that resolves the ambiguity. Although they *are* > convenient IMO, even for developers (question might also come up if we want > to add

Re: Developing a C++ Python extension

2020-07-02 Thread Tim Paine
We spent a ton of time on this for perspective, the end result is a mostly compatible set of wheels for most platforms, I believe we skipped py2 but nobody cares about those anyway. We link against libarrow and libarrow_python on Linux, on windows we vendor them all into our library. Feel free t

Re: Decimal128 scale limits

2020-07-02 Thread Wes McKinney
I think the intention so far has been to support precision between 0 and 38 and scale <= precision. 128-bit integers max out at 38 digits, I think that's the rationale for the limit. See e.g. the Impala docs (also uses 128-bit decimals) [1] [1]: https://impala.apache.org/docs/build/html/topics/imp

Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I had so much fun with the wheels in the past, I'm now a happy member of conda-forge core instead :D The good thing first: * The C++ ABI didn't change between the manylinux versions, it is the old one in all cases. So you mix & match manylinux versions. The sad things: * The manylinuxX standa

Performance of ArrowJS in the DOM

2020-07-02 Thread Matthias Vallentin
Hi folks, We are reaching out to better understand the performance of ArrowJS when it comes to viewing large amounts of data (> 1M records) in the browser’s DOM. Our backend (https://github.com/tenzir/vast) spits out record batches, which we are accumulating in the frontend with a RecordBatchReade

Re: Arrow as a common open standard for machine learning data

2020-07-02 Thread Joaquin Vanschoren
Hi Nick, all, Thanks! I updated the blog post to specify the requirements better. First, we plan to store the datasets in S3 (on min.io). I agree this works nicely with Parquet. Do you know whether there any activity on supporting partial read/writes in arrow or fastparquet? That would change th

Re: [CI] Reliability of s390x Travis CI build

2020-07-02 Thread Antoine Pitrou
In my experience, both the s390x and ARM builds are flaky on Travis-Ci, for reasons which seem unrelated to Arrow. The infrastructure seems a bit unreliable. Regards Antoine. Le 02/07/2020 à 15:15, Wes McKinney a écrit : > I would be interested to know the empirical reliability of the s390x

[RESULT] [VOTE] Add a "Feature" enum to Schema.fbs

2020-07-02 Thread Wes McKinney
Forwarding with [RESULT] subject line On Wed, Jul 1, 2020 at 1:24 AM Micah Kornfield wrote: > > The vote carries with 4 binding +1 votes and 0 non-binding +1. I will > merge the change and open some JIRAs about reading/writing the new field > from reference implementations (hopefully tomorrow). >

[CI] Reliability of s390x Travis CI build

2020-07-02 Thread Wes McKinney
I would be interested to know the empirical reliability of the s390x Travis CI build, but my guess is that it is flaking at least 20% of the time, maybe more than that. If that's the case, then I think it should be added back to allow_failures and at best we can look at it perioidically to make sur

Re: Sharing our experience adopting (py) Arrow in Vaex

2020-07-02 Thread Wes McKinney
On Thu, Jul 2, 2020 at 3:32 AM Maarten Breddels wrote: > > Hi, > > in the process of adding Arrow support in Vaex (natively, not converting to > Numpy as we did before), one of our biggest pain points is (surprisingly) > the name mismatch between NumPy's .tolist() and Arrow's .to_pylist(). > Espec

Re: Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
Ok, thanks! I'm setting up a repo with an example here, using pybind11: https://github.com/vaexio/vaex-arrow-ext and I'll just try all possible combinations and report back. cheers, Maarten Breddels Software engineer / consultant / data scientist Python / C++ / Javascript / Jupyter www.maartenb

Re: Developing a C++ Python extension

2020-07-02 Thread Joris Van den Bossche
Also no concrete answer, but one such example is turbodbc, I think. But it seems they only have conda binary packages, and don't distribute wheels .. (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html), so not that relevant as comparison (they also need to build against an odbc d

[NIGHTLY] Arrow Build Report for Job nightly-2020-07-02-0

2020-07-02 Thread Crossbow
Arrow Build Report for Job nightly-2020-07-02-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0 Failed Tasks: - test-conda-cpp-valgrind: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-02-0-github-test-conda-cpp-valgrind

Re: Developing a C++ Python extension

2020-07-02 Thread Antoine Pitrou
Hi Maarten, Le 02/07/2020 à 10:53, Maarten Breddels a écrit : > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a vaex > extension distributed as a 2010 wheel, and build with the pyarrow 2010 > wheel, work in an environment where someone installed a pyarrow 2014 > wheel, or

Developing a C++ Python extension

2020-07-02 Thread Maarten Breddels
Hi, again, in the process of adopting Arrow in Vaex, we need to have some legacy c++ code in Vaex itself, and we might want to add some new functions in c++ that might not be suitable for core Apache Arrow, or we need to ship ourselves due to time constraints. I am a bit worried about the C++ ABI

Sharing our experience adopting (py) Arrow in Vaex

2020-07-02 Thread Maarten Breddels
Hi, in the process of adding Arrow support in Vaex (natively, not converting to Numpy as we did before), one of our biggest pain points is (surprisingly) the name mismatch between NumPy's .tolist() and Arrow's .to_pylist(). Especially in code that deals with both types of arrays, this is a bit of