Re: [Python] receiving an arrow record batch without an attached schema

Aldrin Mon, 15 Apr 2024 17:18:43 -0700

I could be wrong (there may have been many changes since I last experimented 
with IPC API), but in my experience this issue happens when I have mixed up IPC 
streaming types (feather/file vs in-memory).


I believe pyarrow.ipc.new_stream and open_stream are essentially feather stream 
format, as opposed to pyarrow.ipc.RecordBatchStreamReader and 
pyarrow.ipc.RecordBatchStreamWriter which are "in-memory". The basic difference 
being that feather and and "in-memory" have different headers. The main point 
is that whichever method you're using for writing the stream, you should double 
check you're using the associated method for reading the same type of stream.

For sending over sockets I recommend using the RecordBatchStream equivalents. 
If you're writing to (or reading from) files on either end, then you'll need to 
do a bit of playing around to see what functions are ergonomic and efficient.



# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Monday, April 15th, 2024 at 13:42, Kevin Liu <[email protected]> wrote:

> What is the code used to send bytes over the wire?
> My hunch is that there's an issue from the sending side which caused the 
> bytes to be smaller than expected.
> 

> The example in the doc constructed a writer using a provided Schema.
> ```
> sink = pa.BufferOutputStream()
> 

> with pa.ipc.new_stream(sink, batch.schema) as writer:
> for i in range(5):
> writer.write_batch(batch)
> ```
> 

> On Mon, Apr 15, 2024 at 10:56 AM Amanda Weirich <[email protected]> wrote:
> 

> > When I try this I receive the following error:
> > 

> > Expected to be able to read 824 bytes for message body, got 384
> > 

> > I'm assuming this is because the expected schema is missing?
> > 

> > 

> > On Mon, Apr 15, 2024 at 1:49 PM Kevin Liu <[email protected]> wrote:
> > 

> > > From the example in the Streaming, Serialization, and IPC doc, it looks 
> > > like you don't need to create/open a stream with a schema, the schema can 
> > > be inferred from the RecordBatchStreamReader object.
> > > 

> > > ```
> > > with pa.ipc.open_stream(buf) as reader:
> > > schema = reader.schema
> > > batches = [b for b in reader]
> > > ```
> > > 

> > > On Mon, Apr 15, 2024 at 10:35 AM Amanda Weirich <[email protected]> 
> > > wrote:
> > > 

> > > > Hello,
> > > > I have an incoming arrow record batch without a schema attached coming 
> > > > in over a UDP port as buf.to_pybytes. We dont want to attach the schema 
> > > > because the schema is already known. So in my receive script I create 
> > > > my schema, and I am trying to create an arrow stream reader where I 
> > > > pass in the batch and the schema, but it says only one argument can be 
> > > > accepted (meaning it only expected to see the stream and not the 
> > > > schema):
> > > > 

> > > > # Receive data
> > > > received_data, addr = sock.recvfrom(1024)
> > > > # make byte array
> > > > byte_stream = bytearray(received_data)
> > > > # Create a BytesIO object from the received data
> > > > stream = io.BytesIO(byte_stream)
> > > > 

> > > > schema = create_schema()
> > > > reader = pa.ipc.open_stream(stream, schema=schema)
> > > > 

> > > > How can I create an arrow stream reader for this use case?
> > > > 

> > > > Thank you in advance!
> > > > 

> > > > Amanda

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: [Python] receiving an arrow record batch without an attached schema

Reply via email to