The IPC API seems to work for the most part, however is there a way to specify compression level with IpcWriteOptions? It doesn't seem to be exposed. I'm currently using zstd, so not sure what level it defaults to otherwise: Additionally, should I be enabling the allow_64bit bool? I have nanosecond timestamps which would be truncated if it this option acts the way I think it does.
``` """ Serialization options for the IPC format. Parameters ---------- metadata_version : MetadataVersion, default MetadataVersion.V5 The metadata version to write. V5 is the current and latest, V4 is the pre-1.0 metadata version (with incompatible Union layout). allow_64bit: bool, default False If true, allow field lengths that don't fit in a signed 32-bit int. use_legacy_format : bool, default False Whether to use the pre-Arrow 0.15 IPC format. compression: str or None If not None, compression codec to use for record batch buffers. May only be "lz4", "zstd" or None. use_threads: bool Whether to use the global CPU thread pool to parallelize any computational tasks like compression. emit_dictionary_deltas: bool Whether to emit dictionary deltas. Default is false for maximum stream compatibility. """ On Tue, Jul 13, 2021 at 2:41 PM Weston Pace <[email protected]> wrote: > I can't speak to the intent. Adding a feather.write_table version > (equivalent to feather.read_table) seems like it would be reasonable. > > > Is the best way around this to do the following? > > What you have written does not work for me. This slightly different > version does: > > ```python3 > import pyarrow as pa > import pyarrow._feather as _feather > > table = pa.Table.from_pandas(df) > _feather.write_feather(table, '/tmp/foo.feather', > compression=compression, > compression_level=compression_level, > chunksize=chunksize, version=version) > ``` > > I'm not sure it's a great practice to be relying on pyarrow._feather > though as it is meant to be internal and subject to change without > much consideration. > > You might want to consider using the newer IPC API which should be > equivalent (write_feather is indirectly using a RecordBatchFileWriter > under the hood although it is buried in the C++[1]). A complete > example: > > ```python3 > import pandas as pd > import pyarrow as pa > import pyarrow.ipc > > df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']}) > compression = None > > options = pyarrow.ipc.IpcWriteOptions() > options.compression = compression > writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo2.feather', > schema=table.schema, options=options) > writer.write_table(table) > writer.close() > ``` > > If you need chunks it is slightly more work: > > ```python3 > options = pyarrow.ipc.IpcWriteOptions() > options.compression = compression > writer = pyarrow.ipc.RecordBatchFileWriter('/tmp/foo3.feather', > schema=table.schema, options=options) > batches = table.to_batches(chunksize) > for batch in batches: > writer.write_batch(batch) > writer.close() > ``` > > All three versions should be readable by pyarrow.feather.read_feather > and should yield the exact same dataframe. > > [1] > https://github.com/apache/arrow/blob/81ff679c47754692224f655dab32cc0936bb5f55/cpp/src/arrow/ipc/feather.cc#L796 > > On Tue, Jul 13, 2021 at 7:06 AM Arun Joseph <[email protected]> wrote: > > > > Hi, > > > > I've noticed that if I pass a pandas dataframe to write_feather > (hyperlink to relevant part of code), it will automatically drop the index. > Was this behavior intentionally chosen to only drop the index and not to > allow the user to specify? I assumed the behavior would match the default > behavior of converting from a pandas dataframe to an arrow table as > mentioned in the docs. > > > > Is the best way around this to do the following? > > > > ```python3 > > import pyarrow.lib as ext > > from pyarrow.lib import Table > > > > table = Table.from_pandas(df) > > ext.write_feather(table, dest, > > compression=compression, > compression_level=compression_level, > > chunksize=chunksize, version=version) > > ``` > > Thank You, > > -- > > Arun Joseph > > > -- Arun Joseph
