Hey Matt,

It's not built out of the box but I think you're on the right track. That said, 
I'm curious about your use case here that you have pre-serialized bytes - is 
this to avoid using the Arrow reader at some point?

Descriptor can indeed be ignored here. app_metadata is optional.

The first message in a stream should be an IPC schema message. It should then 
be followed by DictionaryBatch messages, then RecordBatch messages. All of 
these follow the "encapsulated message format" [1] where the metadata 
flatbuffer goes in metadata and the message body goes in body_buffers. (The 
continuation token is omitted, but not the length, IIRC.)

So for schema, you would have the IPC schema flatbuffer in "metadata" and no 
body. For RecordBatch/DictionaryBatch, you would have the IPC record 
batch/dictionary batch in "metadata" and the data in "body". This means that 
you may need to parse your pre-serialized bytes (to some extent) to separate 
the two.

There's multiple body buffers so that we don't require concatenation. For 
instance, a RecordBatch in memory might be backed by multiple allocations. Only 
taking one body buffer would mean that we would have to concatenate everything 
before sending, which defeats the zero-copy goal. But if what you have is truly 
pre-serialized, then you can pass just the one buffer.

[1]: 
https://arrow.apache.org/docs/format/Columnar.html#encapsulated-message-format

-David

On Mon, Jan 10, 2022, at 02:32, Matt Youill wrote:
> Hi,
> 
> Have been hacking on this for a while, but wanted to make sure I'm on 
> the right track.
> 
> Is it possible to supply a pre-serialized IPC stream of data from a 
> Flight server's DoGet function? It looks like a table *object* (or 
> schema + record batches) can be supplied to the FlightDataStream 
> parameter (using a RecordBatchStream) but not plain bytes.
> 
> I've had a look at implementing a FlightDataStream for plain bytes. I 
> can see the byte stream needs to be split up into FlightPayloads, but 
> it's not clear what goes where in each one.
> 
> Given the following defs for FlightPayloads...
> 
> struct ARROW_FLIGHT_EXPORT FlightPayload {
>   std::shared_ptr<Buffer> descriptor;
>   std::shared_ptr<Buffer> app_metadata;
>   ipc::IpcPayload ipc_message;
> };
> 
> struct IpcPayload {
>   MessageType type = MessageType::NONE;
>   std::shared_ptr<Buffer> metadata;
>   std::vector<std::shared_ptr<Buffer>> body_buffers;
>   int64_t body_length = 0;
> };
> 
> AFAICT it looks like:
> 
> "descriptor" is ignored for DoGet
> 
> "app_metadata" can be ignored
> 
> "type" is set to whatever message it is (e.g. schema, batch, etc)
> 
> "metadata" buffer should contain a schema IPC bytes?
> 
> "body_buffers" should contain IPC bytes for each batch? (Why are there 
> multiple buffers? Is there any reason not to just use the first slot of 
> the buffers vector?)
> 
> Any advice appreciated.
> 
> Thanks, Matt
> 
> 

Reply via email to