Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2022-02-09 Thread Phillip Cloud
I don't think memcpy is feasible. The bytes may be different for different languages. Many languages' compilers reorder struct fields and pad structs for efficiency reasons so the bytes in metadata coming from language X may be meaningless to language Y. On Wed, Feb 9, 2022, 08:41 Dewey

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2022-02-09 Thread Dewey Dunnington
I'm new to this discussion and Arrow generally and appreciate Joris bringing this up. While the forward- and backward-compatability of this is over my head, I think that it is important to provide a path for extension types to have serialized metadata as binary because the alternatives are (in my

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2022-02-04 Thread Joris Van den Bossche
Reviving this thread, I don't think anything happened in the meantime on this topic? >From rereading the thread, it seems David mentioned two possible ideas: - A new [byte] binary_value field in the existing KeyValue type, next to the existing string value field. And if you have valid string

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-09-01 Thread Micah Kornfield
It still sounds like adding a new type might be the safest approach (and marking the old type as discouraged). On Mon, Aug 23, 2021 at 11:18 AM David Li wrote: > I believe so. > > The encoding of a string in Flatbuffers is [byte] with a null terminator > not included in the length, so old files

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-23 Thread David Li
I believe so. The encoding of a string in Flatbuffers is [byte] with a null terminator not included in the length, so old files should still be readable (they would simply not see the terminator anymore). And conversely, continuing to write the null terminator means new files should still be

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-23 Thread Antoine Pitrou
Le 23/08/2021 à 17:52, David Li a écrit : Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-23 Thread David Li
Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations validating UTF8-ness would still be unable to

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-18 Thread David Li
This isn't too thought out yet but: 1. Any file which stuffs binary data into the value is already unreadable for anyone directly using Flatbuffers. So we can specify that the field must be valid UTF-8, but implementations can permit relaxed validation/reading as binary data instead in order

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-16 Thread Micah Kornfield
I agree with you any thoughts on a way forward for at least hardening the spec (or should this be done at the same time as adding the new field)? On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney wrote: > I've been poking around the project, and I'm growing concerned that > our use of the KeyValue

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-16 Thread Wes McKinney
I've been poking around the project, and I'm growing concerned that our use of the KeyValue field has already been non-compliant in many cases since we do not validate UTF8-ness. Since we also use KeyValue to handle opaque data serialization for extension types [1], the fact that the specification

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-10 Thread Wes McKinney
Ah, that's definitely a no-go then (I believe we verify messages unconditionally in C++). That's unfortunate (and I feel responsible for missing this, but I suppose we had a lot of opportunities to fix it prior to the 1.0.0 format version) — so to have actual binary values (which was the intention

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-08-09 Thread Micah Kornfield
One issue with changing it to byte is it would effectively break any reader that is validating flatbuffer data, because flatbuffers verifies null termination [1]. While this might comply with forward compatibility guarantees it seems like a pretty large blast radius. [1]

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-16 Thread Micah Kornfield
> > It sounds like we may want to discuss some potential evolutions of the > Arrow binary protocol (for example: new Message types). This sounds like a good path forward. On Fri, Jul 9, 2021 at 8:26 AM Wes McKinney wrote: > It sounds like we may want to discuss some potential evolutions of

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-12 Thread Wes McKinney
pyarrow at least treats the KeyValue values as binary and not UTF-8. On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield wrote: > > I think other languages (e.g. java, python) might make more of distinction > between utf-8 compatible strings and raw bytes. For python it might be less > of a

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-11 Thread Micah Kornfield
I think other languages (e.g. java, python) might make more of distinction between utf-8 compatible strings and raw bytes. For python it might be less of a concern if the c++ wrapper already makes the value field look like a bytes field On Sunday, July 11, 2021, Wes McKinney wrote: > We could

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-11 Thread Wes McKinney
We could certainly "upgrade" KeyValue to have a binary value field everywhere KeyValue is used, but there is some risk of code in the wild expecting there to be a null terminator after the string data. The Flatbuffers-generated accessor APIs do not depend on the existence of the null terminator,

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-09 Thread Wes McKinney
The cost of an empty vector in Flatbuffers appears to be 4 bytes. On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield wrote: > > Retitling and forking the discussion to talk about key value pairs. > > What is the byte cost of an empty list? Another option would be to > introduce a new BinaryKeyValue

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-09 Thread Wes McKinney
It sounds like we may want to discuss some potential evolutions of the Arrow binary protocol (for example: new Message types). Certainly a can of worms but rather than trying to bolt some new functionality onto the existing structures, it might be better to support the new use cases through some

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread David Li
To summarize so far, it sounds like schema evolution is neither sufficient nor necessary for either Gosh or Nate's use-cases here? It could be useful for FlightSQL but even there I don't think it's a requirement. For Nate - it almost sounds like what you need is some way to slice up a record

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Nate Bauernfeind
> Flatbuffers does not support modifying structs > in any forwards or backwards compatible way > (only tables support evolution). Bah. I did not realize that. To reiterate the feature that would be ideal: I realize the specific feature I am missing is the ability to encode that a field (i.e. its

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Micah Kornfield
> > Might there be interest in adding a "field_id" to the FieldNode (which is > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible > upgrade (by either keying off of 0, or explicitly set the field default to > -1) which would allow the sender to "skip" fields that have 1)

[DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Micah Kornfield
Retitling and forking the discussion to talk about key value pairs. What is the byte cost of an empty list? Another option would be to introduce a new BinaryKeyValue table and add binary metadata. On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind < natebauernfe...@deephaven.io> wrote: >

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Nate Bauernfeind
Deephaven and I are very supportive of "upgrading" the value half of the kv pair to a byte vector. What is the best way to find out if there is sufficient interest? I've been stewing on the ideas here around schema evolution, and I realize the specific feature I am missing is the ability to

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Wes McKinney
On Wed, Jul 7, 2021 at 2:53 PM David Li wrote: > > From the Flatbuffers internals doc[1] it appears they are the same: "Strings > are simply a vector of bytes, and are always null-terminated." I see. I took a look at flatbuffers.h, and it appears that changing this field from string to [byte]

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread David Li
>From the Flatbuffers internals doc[1] it appears they are the same: "Strings >are simply a vector of bytes, and are always null-terminated." [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html -David On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote: > On Tue, Jul 6, 2021 at

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-07 Thread Wes McKinney
On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield wrote: > > > > > Right, I had wanted to focus the discussion on Flight as I think schema > > evolution or multiplexing streams (more so the latter) is a property of the > > transport and not the stream format itself. If we are leaning towards just >

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-07-06 Thread Micah Kornfield
> > Right, I had wanted to focus the discussion on Flight as I think schema > evolution or multiplexing streams (more so the latter) is a property of the > transport and not the stream format itself. If we are leaning towards just > schema evolution then maybe it makes sense to discuss it for the

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-28 Thread David Li
Right, I had wanted to focus the discussion on Flight as I think schema evolution or multiplexing streams (more so the latter) is a property of the transport and not the stream format itself. If we are leaning towards just schema evolution then maybe it makes sense to discuss it for the IPC

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-27 Thread Gosh Arzumanyan
Hi guys, 1. Regarding IPC vs Flight: in fact my initial suggestion was to add this feature starting from the IPC(I moved initial write up steps to the bottom of the doc). Afterwards David suggested focusing on Flight and that's how we ended up with the protobufs change in the proposal. This being

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-26 Thread Nate Bauernfeind
> > > > makes it more difficult to bring schema evolution back into the > > > IPC Stream format (i.e. it would live only in flight) > > > > Gosh's proposal extends the flatbuffer structures not the protobufs. Can > > you help me understand how difficult it would be to bring the `schema_id` > >

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-26 Thread David Li
Replies inline: On Fri, Jun 25, 2021, at 20:18, Nate Bauernfeind wrote: > > makes it more difficult to bring schema evolution back into the > > IPC Stream format (i.e. it would live only in flight) > > Gosh's proposal extends the flatbuffer structures not the protobufs. Can > you help me

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Nate Bauernfeind
> makes it more difficult to bring schema evolution back into the > IPC Stream format (i.e. it would live only in flight) Gosh's proposal extends the flatbuffer structures not the protobufs. Can you help me understand how difficult it would be to bring the `schema_id` approach to the IPC stream

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Micah Kornfield
Sorry for the second reply: >1. In our case we do expect relatively frequent changes in the schema >of the batch being sent out. I don't see that pattern changing in the mid >term for a good reason. However long term maybe it will be possible to >leverage separate RPC calls. I

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Micah Kornfield
> > >1. Re complexity of "one schema at a time" vs "schema id based": i >think they are not much different, right? In fact the second one is more of >an optimization to the first one which is beneficial to us. Anyways even >with the first approach you need to add some logic of

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Gosh Arzumanyan
Hi Micah, Sure, let me do it here: 1. In our case we do expect relatively frequent changes in the schema of the batch being sent out. I don't see that pattern changing in the mid term for a good reason. However long term maybe it will be possible to leverage separate RPC calls. I

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Micah Kornfield
> > 1. It seems like renaming stream_id to schema_id and delegating "logical > stream" distinction to app_metadata mitigates the "multiplexing" point > while at the same time it gives enough flexibility to address both Nate's > and our use cases. I don't think this is the case. It seems that

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-25 Thread Gosh Arzumanyan
Hi guys, Thanks for sharing your insights/concerns! I also left some comments based on the discussion we had. Briefly: 1. It seems like renaming stream_id to schema_id and delegating "logical stream" distinction to app_metadata mitigates the "multiplexing" point while at the same time it gives

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
Thanks for chiming in - I've replied in the doc. Scoping it to just schema evolution would be preferable, but I'm not sure if Gosh's usecase requires more flexibility than that or not. Again, though, given that 1) gRPC recycles a connection, so repeated calls aren't necessarily expensive and

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Nate Bauernfeind
Thanks for writing this up! I added a few general comments, but have a question on the approach because it's not quite what I was expecting. I am slightly concerned that the proposal looks more like support for "multiplexing" IPC streams into a single RPC stream rather than support for a changing

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
Ah to be clear, the API is indeed inconsistent - DoExchange was added some time later (and by its nature returning a FlightDataStream would not have been possible, since it's meant to be able to interleave reading/writing). But really, DoGet is indeed the odd one out in the C++ API and it may

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David, Got you. In fact I was looking at this more from the point of view of consistency of the API in terms of "inputs" and thought DoExchange is kind of a DoGet+ so might make sense to have the same classes being utilized in both places. But again, I might be missing something and I get the

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread David Li
It's mostly a quirk of implementation (and just for clarification, they're all nearly identical on the format/protocol level). DoGet is conceptualized as your application returning a readable stream of batches, instead of your application imperatively writing batches to the client. (This is

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-23 Thread Gosh Arzumanyan
Hi David, Going through the ArrowFlight API: got confused a bit on DoGet and DoPut/DoExachange apis: the former one expects FlightDataStream which talks in already serialized message terms while the latter to accept FlightMessageReader/Writer which expect the user to pass in RecordBatches etc. Is

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Thanks David! I also responded/added more suggestions/questions to the doc. I think it makes sense to have two sections: one purely protocol oriented and second API oriented(examples in c++ or in any other language should make the idea easier to digest). Thanks for the reference too! Cheers,

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread David Li
Thanks! I've left some initial comments/suggestions to expand it in terms of the format definitions and not the C++ APIs. I'll also note something like this was proposed a long time ago - there's not very much discussion about it there but for reference:

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Ah sorry, comments should work now. Cheers, Gosh On Mon., 21 Jun. 2021, 14:18 David Li, wrote: > Thanks! Will give it a look. > > Would you mind opening it up for comments? > > -David > > On 2021/06/21 11:56:24, Gosh Arzumanyan wrote: > > Hi folks, > > > > Started putting some thoughts

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread David Li
Thanks! Will give it a look. Would you mind opening it up for comments? -David On 2021/06/21 11:56:24, Gosh Arzumanyan wrote: > Hi folks, > > Started putting some thoughts together here: > https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing > Any

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-21 Thread Gosh Arzumanyan
Hi folks, Started putting some thoughts together here: https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing Any feedback is welcome! Cheers, Gosh

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-18 Thread Gosh Arzumanyan
Hi David, Thanks for poking me on this. I have been thinking it out but have not got to crafting a doc. Let me put together a rough proposal this weekend. Afterwards I'll do need your help for bringing it to a reviewable state. Cheers, Gosh On Fri., 18 Jun. 2021, 18:11 David Li, wrote: >

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-06-18 Thread David Li
Following up here - Gosh, did you get a chance to put something together? Do you need/want help on this? This would also potentially be useful for FlightSQL. (See the discussion on GitHub: https://github.com/apache/arrow/pull/9368#discussion_r572941765) Best, David On Fri, Apr 16, 2021, at

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-16 Thread Gosh Arzumanyan
Hi guys! Thanks for the feedback/info. Let me try to put a proposal together. Though I guess I'll need some assistance on crafting it both in terms of the structure of a proposal expected in the Arrow community as well as technical guidance. WIll share a doc with some ideas shortly so that we

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-13 Thread Nate Bauernfeind
> possibly in coordination with the Deephaven/Barrage team, if they're also still interested Good opportunity for me to chime in =). I think we still have interest in this feature. On the other thread, it took a little cajoling, but I've come around to agree with the conclusions of taking a

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-13 Thread David Li
Thanks for the details. I'll note a few things, but adding schema evolution to Flight is reasonable, if you'd like to put together a proposal for discussion (possibly in coordination with the Deephaven/Barrage team, if they're also still interested). >3. Assume that there is a strong reason

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-13 Thread Gosh Arzumanyan
Hi David, Thanks for sharing the link! Here is how a potential use case might look like: 1. Assume that we have a service S which accepts expressions in some language X. 2. Assume that a typical query to this service requests entities A_1, A_2,..,A_K. Each of those entities

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-12 Thread David Li
Hi Gosh, There was indeed a discussion where schema evolution was proposed as a solution for another use case: https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E I am curious though, what is your use case here? Best,

[INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

2021-04-12 Thread Gosh Arzumanyan
Hi guys, hope you are well! Judging from the Flight API and from the documentation/examples out there, it seems like data schema is supposed to be fixed per stream in