[pyarrow] Table.from_pandas fails with TypeError: infer_dtype() takes no keyword arguments

2020-02-26 Thread filippo medri
Hi,
trying the example here https://arrow.apache.org/docs/python/pandas.html on
DataFrames paragraph with python 2.7.17, numpy==1.16.2 , pandas ==0.20.3,
pyarrow==0.16.0 or pyarrow==0.15.1 got an error (see below).

import pandas as pd
import pyarrow as pa

df  = pd.DataFrame({"a": [1,2,3]})
table = pa.Table.from_pandas(df)

fails with the following stacktrace:

Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/table.pxi", line 1177, in pyarrow.lib.Table.from_pandas
  File
"/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py",
line 593, in dataframe_to_arrays
types)
  File
"/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py",
line 234, in construct_metadata
metadata = _get_simple_index_descriptor(level, name)
  File
"/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py",
line 255, in _get_simple_index_descriptor
pandas_type = get_logical_type_from_numpy(level)
  File
"/home/ubuntu/.pyenv/versions/2.7.17/envs/py27/lib/python2.7/site-packages/pyarrow/pandas_compat.py",
line 113, in get_logical_type_from_numpy
result = _pandas_api.infer_dtype(pandas_collection)
  File "pyarrow/pandas-shim.pxi", line 131, in
pyarrow.lib._PandasAPIShim.infer_dtype
  File "pyarrow/pandas-shim.pxi", line 134, in
pyarrow.lib._PandasAPIShim.infer_dtype
TypeError: infer_dtype() takes no keyword arguments

Is some known bug, some incompatibility or some mis-configuration?
Thank you very much,
Filippo Medri


[pyarrow] How to enable memory mapping in pyarrow.parquet.read_table

2020-02-27 Thread filippo medri
Hi,
experimenting with :

import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table(source,memory_mapped=True)
mem_bytes = pa.total_allocated_bytes()

I have observed that mem_bytes is about the size of the parquet file on
disk.
If I remove the assignment and execute
pq.read_table(source,memory_mapped=True)
mem_bytes = pa.total_allocated_bytes()

mem_bytes is 0

Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip
install, the parquet file
is made by saving 4 numpy arrays of doubles to an arrow table and then
saving them to parquet with the write_table function.

My goal is to read the parquet file in a memory mapped table and than
reading it a record batch at a time, with:
batches = tables.to_batches()
for batch in batches:
   # do something with the batch then save it to disk

At the present time I am able to load a parquet file in an arrow table,
split it to batches, add columns and then write each RecordBatch to a
parquet file, but the read_table function seems to be loading all data into
memory.

Is there a way to load a parquet file in a table in memory a record batch
at a time? Or just stream RecordBatch from a parquet file without loading
all the content in memory?

Thanks in advance,
Filippo Medri


Reading large csv file with pyarrow

2020-02-14 Thread filippo medri
Hi,
by experimenting with arrow read_csv function to convert csv fie into
parquet I found that it reads the data in memory.
On a side the ReadOptions class allows to specify a blocksize parameter to
limit how much bytes to process at a time, but by looking at the memory
usage my understanding is that the underlying Table is filled with all data.
Is there a way to at least specify a parameter to limit the read to a batch
of rows? I see that I can skip rows from the beginning, but I am not
finding a way to limit how many rows to read.
Which is the intended way to read a csv file that does not fit into memory?
Thanks in advance,
Filippo Medri