[Rust] [Discuss] Generalising Eq, Neq Kernel Functions Beyond Numeric Types

2019-05-13 Thread Neville Dipale
Hi Arrow[Rust] developers, I came across an instance where I wanted to compare 2 arrays that aren't numeric (bool, string, list?), and couldn't conveniently leverage the comparison array_ops for this. This is due to the trait bounds that require that PrimitiveArray satisfy T: ArrowNumericType. Us

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Bryan Cutler
Congratulations Neville! On Mon, May 13, 2019, 8:33 AM Neville Dipale wrote: > Thanks everyone for the invite and privilege. > > Neville > > On Mon, 13 May 2019 at 15:19, Wes McKinney wrote: > > > Congrats! > > > > On Mon, May 13, 2019 at 4:25 AM Krisztián Szűcs > > wrote: > > > > > > Congrats

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen
Thanks Ted! Assuming this works I can probably move sorting out of my fast path. My pressing need would then be to slice pre-sorted record batches using binary search. On Mon, May 13, 2019 at 1:39 PM Ted Gooch wrote: > At least for the filtering part, isn't it already possible via gandiva > fi

[jira] [Created] (ARROW-5314) [Go] Incorrect Printing for String Arrays with Offsets

2019-05-13 Thread James Walker (JIRA)
James Walker created ARROW-5314: --- Summary: [Go] Incorrect Printing for String Arrays with Offsets Key: ARROW-5314 URL: https://issues.apache.org/jira/browse/ARROW-5314 Project: Apache Arrow Is

[jira] [Created] (ARROW-5313) [Format] Comments on Field table are a bit confusing

2019-05-13 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5313: Summary: [Format] Comments on Field table are a bit confusing Key: ARROW-5313 URL: https://issues.apache.org/jira/browse/ARROW-5313 Project: Apache Arrow Iss

[jira] [Created] (ARROW-5312) [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so

2019-05-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5312: --- Summary: [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so Key: ARROW-5312 URL: https://issues.apache.org/jira/browse/ARROW-5312 Pr

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread Ted Gooch
At least for the filtering part, isn't it already possible via gandiva filters[1]? I had a similar question about pushing record-level filtering into the parquet reader. [1] https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100 On Mon, May 13, 2019 at 8:51 AM W

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-13 Thread Wes McKinney
Both of those require the use of a schema compiler (like Protocol Buffers) which may or may not be practical for many applications On Thu, May 9, 2019 at 6:29 PM Tim Swast wrote: > > I just remembered two serialization formats that are very similar to what > you describe: > > Flatbuffers > https:

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread Wes McKinney
https://issues.apache.org/jira/browse/ARROW-1558 On Mon, May 13, 2019 at 10:47 AM Micah Kornfield wrote: > > There are also some open JIRA issues for these sorting in > cpp/src/arrow/compute [1][2]. I couldn't find one for filtering but I'm > surprised one doesn't exist. > > [1] https://issue

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread Micah Kornfield
There are also some open JIRA issues for these sorting in cpp/src/arrow/compute [1][2]. I couldn't find one for filtering but I'm surprised one doesn't exist. [1] https://issues.apache.org/jira/browse/ARROW-4631

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread Wes McKinney
hi John -- I'd recommend implementing these capabilities as Kernel functions under cpp/src/arrow/compute, then they can be exposed in Python easily. - Wes On Mon, May 13, 2019 at 9:01 AM John Muehlhausen wrote: > > Does pyarrow currently support filter/sort/search without conversion to > pandas?

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
On Mon, May 13, 2019 at 10:28 AM John Muehlhausen wrote: > > ``perhaps the right way forward is to start by gathering a > number of interested parties and start designing a proposal'' > > YES! How do we go about this? > I'd recommend writing a proposal document (using Google Docs or whatever too

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Neville Dipale
Thanks everyone for the invite and privilege. Neville On Mon, 13 May 2019 at 15:19, Wes McKinney wrote: > Congrats! > > On Mon, May 13, 2019 at 4:25 AM Krisztián Szűcs > wrote: > > > > Congrats Neville! > > > > On Mon, May 13, 2019 at 11:02 AM Fan Liya wrote: > > > > > Congrats!!! > > > > > >

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
``perhaps the right way forward is to start by gathering a number of interested parties and start designing a proposal'' YES! How do we go about this? ``There are some early experiments to populate Arrow nodes in microbatches from Kafka'' (cf link in thread) Who did this? -John On Mon, May 13

[jira] [Created] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5311: Summary: [C++] Return more specific invalid Status in Take kernel Key: ARROW-5311 URL: https://issues.apache.org/jira/browse/ARROW-5311 Project: Apache

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-05-13 Thread Wes McKinney
As I've ventured further in working on this I've realized that it's not practical (or even a good idea) to continue to maintain the "fixed dictionary" path. Since the IPC protocol can have evolving dictionaries, nearly all code paths in the codebase have to change to work for the variable case, whi

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Antoine Pitrou
Hi John, We are strongly committed to backwards compatibility in the Arrow format specification. You should not fear any compatibility-breaking changes in the future. People sometimes express uncertainty because we have not reached 1.0 yet, but that's because we have not yet implemented all th

[jira] [Created] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5310: Summary: [Python] better error message on creating ParquetDataset from empty directory Key: ARROW-5310 URL: https://issues.apache.org/jira/browse/ARROW-5310

Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen
Does pyarrow currently support filter/sort/search without conversion to pandas? I don’t see anything but want to be sure. Sorry if I overlooked it. Specific needs: 1- filter an arrow record batch and sort the results into a new batch 2- find slice locations for a sorted batch using binary search

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Micah, yes, it all works at the moment. How have we staked out that it will always work in the future as people continue to work on the spec? That is my concern. Also, it would be extremely useful if someone opening a file had my nil rows hidden from them without needing to analyze the app-specif

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Micah Kornfield
This is already implicit in the spec because there it requires 8 byte alignment and padding bit recommends 64. I'd be ok updating the spec to explicitly state buffers might be oversized but I agree with Wes I don't think a format change is warranted. On Mon, May 13, 2019 at 6:29 AM John Muehlhaus

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
Furthermore, we already have a "custom_metadata" field on Message where you could indicate that a RecordBatch is underfilled; there's no need to change the protocol https://github.com/apache/arrow/blob/master/format/Message.fbs#L98 On Mon, May 13, 2019 at 8:30 AM Micah Kornfield wrote: > > Hi Jo

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Micah Kornfield
Hi John, To expand on this I don't think there is anything preventing you in the current spec from over provisioning the underlying buffers. So you can effectively split "capacity" from "length" by subtracting the size of the buffer from the amount of space taken by the rows indicated in the batc

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Thanks Wes, do you have any comment on the following from the zdnet story I linked? ``But the missing piece is streaming, where the velocity of incoming data poses a special challenge. There are some early experiments to populate Arrow nodes in microbatches from Kafka. And, as the edge gets smarte

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Wes McKinney
Congrats! On Mon, May 13, 2019 at 4:25 AM Krisztián Szűcs wrote: > > Congrats Neville! > > On Mon, May 13, 2019 at 11:02 AM Fan Liya wrote: > > > Congrats!!! > > > > On Sun, May 12, 2019 at 10:10 AM Philipp Moritz > > wrote: > > > > > Congrats Neville! > > > > > > On Sat, May 11, 2019 at 6:09 P

[jira] [Created] (ARROW-5309) [Python] Add clarifications to Python "append" methods that return new objects

2019-05-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5309: --- Summary: [Python] Add clarifications to Python "append" methods that return new objects Key: ARROW-5309 URL: https://issues.apache.org/jira/browse/ARROW-5309 Project: A

Re: Read Arrow 0.9.0 output using newer pyarrow version

2019-05-13 Thread Wes McKinney
hi Rares, Like I said the files should be forward compatible. Can you open a JIRA issue and give code to reproduce the issue? Thanks On Mon, May 13, 2019 at 7:43 AM Rares Vernica wrote: > > Hi Wes, > > Thanks for your answer. I finally got to test this out. To recap, I'm > writing Arrow files f

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread Wes McKinney
hi John, Sorry, there's a number of fairly long e-mails in this thread; I'm having a hard time following all of the details. I suspect the most parsimonious thing would be to have some "sidecar" metadata that tracks the state of your writes into pre-allocated Arrow blocks so that readers know to

[jira] [Created] (ARROW-5308) [Go] remove deprecated Feather format

2019-05-13 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5308: -- Summary: [Go] remove deprecated Feather format Key: ARROW-5308 URL: https://issues.apache.org/jira/browse/ARROW-5308 Project: Apache Arrow Issue Type: Bu

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow std::vector terminology) Thanks, John On Thu, May 9, 2019 at 2:46 PM John Muehlhausen wrote: > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split > the "length" parameter into "theoretical max len

Re: Read Arrow 0.9.0 output using newer pyarrow version

2019-05-13 Thread Rares Vernica
Hi Wes, Thanks for your answer. I finally got to test this out. To recap, I'm writing Arrow files from C++ using Arrow 0.9.0. Then, I'm trying to read these files from Python. I tried Python 2.7.15 and PyArrow 0.10.0 to 0.13.0. In all these cases I get an error. (PyArrow 0.9.0 works fine, as expe

[jira] [Created] (ARROW-5307) [CI] [GLib] Enable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5307: --- Summary: [CI] [GLib] Enable GTK-Doc Key: ARROW-5307 URL: https://issues.apache.org/jira/browse/ARROW-5307 Project: Apache Arrow Issue Type: New Feature

[jira] [Created] (ARROW-5306) [CI] [GLib] Disable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5306: --- Summary: [CI] [GLib] Disable GTK-Doc Key: ARROW-5306 URL: https://issues.apache.org/jira/browse/ARROW-5306 Project: Apache Arrow Issue Type: New Feature

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Krisztián Szűcs
Congrats Neville! On Mon, May 13, 2019 at 11:02 AM Fan Liya wrote: > Congrats!!! > > On Sun, May 12, 2019 at 10:10 AM Philipp Moritz > wrote: > > > Congrats Neville! > > > > On Sat, May 11, 2019 at 6:09 PM Renjie Liu > > wrote: > > > > > Congrats! > > > > > > Chao Sun 于 2019年5月12日周日 上午12:38写道

Re: [ANNOUNCE] New Arrow committer: Neville Dipale

2019-05-13 Thread Fan Liya
Congrats!!! On Sun, May 12, 2019 at 10:10 AM Philipp Moritz wrote: > Congrats Neville! > > On Sat, May 11, 2019 at 6:09 PM Renjie Liu > wrote: > > > Congrats! > > > > Chao Sun 于 2019年5月12日周日 上午12:38写道: > > > > > Congrats Neville! > > > > > > On Sat, May 11, 2019 at 9:36 AM Micah Kornfield > >