[pyarrow] Table.from_pandas fails with TypeError: infer_dtype() takes no keyword arguments
Hi, trying the example here https://arrow.apache.org/docs/python/pandas.html on DataFrames paragraph with python 2.7.17, numpy==1.16.2 , pandas ==0.20.3, pyarrow==0.16.0 or pyarrow==0.15.1 got an error (see below). import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": [1,2,3]}) table = pa.Table.from_pandas(df) fails with the following stacktrace: Traceback (most recent call last): File "", line 1, in File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas File "/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays types) File "/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py", line 234, in construct_metadata metadata = _get_simple_index_descriptor(level, name) File "/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py", line 255, in _get_simple_index_descriptor pandas_type = get_logical_type_from_numpy(level) File "/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py", line 113, in get_logical_type_from_numpy result = _pandas_api.infer_dtype(pandas_collection) File "pyarrow/pandas-shim.pxi", line 131, in pyarrow.lib._PandasAPIShim.infer_dtype File "pyarrow/pandas-shim.pxi", line 134, in pyarrow.lib._PandasAPIShim.infer_dtype TypeError: infer_dtype() takes no keyword arguments Is some known bug, some incompatibility or some mis-configuration? Thank you very much, Filippo Medri
[pyarrow] How to enable memory mapping in pyarrow.parquet.read_table
Hi, experimenting with : import pyarrow as pa import pyarrow.parquet as pq table = pq.read_table(source,memory_mapped=True) mem_bytes = pa.total_allocated_bytes() I have observed that mem_bytes is about the size of the parquet file on disk. If I remove the assignment and execute pq.read_table(source,memory_mapped=True) mem_bytes = pa.total_allocated_bytes() mem_bytes is 0 Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip install, the parquet file is made by saving 4 numpy arrays of doubles to an arrow table and then saving them to parquet with the write_table function. My goal is to read the parquet file in a memory mapped table and than reading it a record batch at a time, with: batches = tables.to_batches() for batch in batches: # do something with the batch then save it to disk At the present time I am able to load a parquet file in an arrow table, split it to batches, add columns and then write each RecordBatch to a parquet file, but the read_table function seems to be loading all data into memory. Is there a way to load a parquet file in a table in memory a record batch at a time? Or just stream RecordBatch from a parquet file without loading all the content in memory? Thanks in advance, Filippo Medri
Reading large csv file with pyarrow
Hi, by experimenting with arrow read_csv function to convert csv fie into parquet I found that it reads the data in memory. On a side the ReadOptions class allows to specify a blocksize parameter to limit how much bytes to process at a time, but by looking at the memory usage my understanding is that the underlying Table is filled with all data. Is there a way to at least specify a parameter to limit the read to a batch of rows? I see that I can skip rows from the beginning, but I am not finding a way to limit how many rows to read. Which is the intended way to read a csv file that does not fit into memory? Thanks in advance, Filippo Medri