Re: Schemaless serialization

2020-02-17 Thread Antoine Pitrou
Hi Tewfik, It would be good to step back a bit and explain what your data is, and what the consumer is going to do with it. Regards Antoine. On Fri, 14 Feb 2020 15:08:57 -0800 Tewfik Zeghmi wrote: > Hi Micah, > > The primary language is Python. I'm hoping the that the small overhead of >

Re: Schemaless serialization

2020-02-17 Thread Wes McKinney
hi Micah and Tewfik, The functionality is exposed in Python, see e.g. https://github.com/apache/arrow/blob/apache-arrow-0.16.0/python/pyarrow/tests/test_ipc.py#L685 As Micah said, very small batches aren't necessarily optimized for compactness (for example buffers are padded to multiples of 8).

Re: Schemaless serialization

2020-02-16 Thread Micah Kornfield
I should note, it isn't necessarily just the extra metadata. For single row values, there is also an overhead for padding requirements. You should be able to measure this by looking at the size of the buffer you are using before writing any batches to the stream (I believe the schema is written e

Re: Schemaless serialization

2020-02-14 Thread Tewfik Zeghmi
Hi Micah, The primary language is Python. I'm hoping the that the small overhead of metadata is small compared to the schema information. thank you! On Fri, Feb 14, 2020 at 3:07 PM Micah Kornfield wrote: > Hi Tewfik, > What language? it is possible to serialize them separately but the right

Re: Schemaless serialization

2020-02-14 Thread Micah Kornfield
Hi Tewfik, What language? it is possible to serialize them separately but the right hooks might not be exposed in all languages. There is still going to be a higher overhead for single row values in Arrow compared to Avro due to metadata requirements. Thanks, Micah On Fri, Feb 14, 2020 at 1:33

Schemaless serialization

2020-02-14 Thread Tewfik Zeghmi
Hi, I have a use case of creating a feature store to serve low latency traffic. Given a key, we need the ability to save and read a feature vector in a low latency Key Value store. Serializing an Arrow table with one row is takes 1344 bytes, while the same singular row serialized with AVRO without