Re: Reading large csv file with pyarrow

2020-02-18 Thread Daniel Nugent
Exposing streaming csv reads would be useful independent of the datasets api for ETL processes. On Feb 18, 2020, 03:25 -0500, Wes McKinney , wrote: > Yes, that looks right. There will need to be corresponding work in > Python to make this available (probably through the datasets API) > > On Mon, F

Re: Reading large csv file with pyarrow

2020-02-18 Thread Wes McKinney
Yes, that looks right. There will need to be corresponding work in Python to make this available (probably through the datasets API) On Mon, Feb 17, 2020 at 12:35 PM Daniel Nugent wrote: > > Arrow-3410 maybe? > On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote: > > I seem to recall discussions a

Re: Reading large csv file with pyarrow

2020-02-17 Thread Daniel Nugent
Arrow-3410 maybe? On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote: > I seem to recall discussions about 1 chunk-at-a-time reading of CSV > files. Such an API is not yet available in Python. This is also > required for the C++ Datasets API. If there are not one or more JIRA > issues about this I

Re: Reading large csv file with pyarrow

2020-02-17 Thread Wes McKinney
I seem to recall discussions about 1 chunk-at-a-time reading of CSV files. Such an API is not yet available in Python. This is also required for the C++ Datasets API. If there are not one or more JIRA issues about this I suggest that we open some to capture the use cases On Fri, Feb 14, 2020 at 3:

Reading large csv file with pyarrow

2020-02-14 Thread filippo medri
Hi, by experimenting with arrow read_csv function to convert csv fie into parquet I found that it reads the data in memory. On a side the ReadOptions class allows to specify a blocksize parameter to limit how much bytes to process at a time, but by looking at the memory usage my understanding is th