GAH! It looks like it might be my problem, not pyarrow; type code S is a null-terminated data:
https://numpy.org/doc/stable/reference/arrays.dtypes.html 'S', 'a' zero-terminated bytes (not recommended) Now I have to figure out why I'm getting that S code (it's generated through some sort of operation via numpy) On 2020/11/04 23:05:13, Jason Sachs <[email protected]> wrote: > It looks like pyarrow.Table.from_pydict() cuts off binary data after an > embedded 00 byte. Is this a known bug? > > (py3) C:\>python > Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: > Anaconda, Inc. on win32 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pyarrow as pa > >>> > >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!', > .. b'\x00Baz!', b'half\x00baked', b''], dtype='|S13') > >>> t = pa.Table.from_pydict({'data':data}) > >>> t.to_pandas() > data > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'' > 6 b'half' > 7 b'' > >>> import pandas as pd > >>> pd.DataFrame(data) > 0 > 0 b'' > 1 b'' > 2 b'' > 3 b'Foo!!' > 4 b'Bar!!' > 5 b'\x00Baz!' > 6 b'half\x00baked' > 7 b'' > >>> >
