Re: Reading large csv file with pyarrow

2020-02-18 Thread Daniel Nugent
Exposing streaming csv reads would be useful independent of the datasets api 
for ETL processes.
On Feb 18, 2020, 03:25 -0500, Wes McKinney , wrote:
> Yes, that looks right. There will need to be corresponding work in
> Python to make this available (probably through the datasets API)
>
> On Mon, Feb 17, 2020 at 12:35 PM Daniel Nugent  wrote:
> >
> > Arrow-3410 maybe?
> > On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote:
> >
> > I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> > files. Such an API is not yet available in Python. This is also
> > required for the C++ Datasets API. If there are not one or more JIRA
> > issues about this I suggest that we open some to capture the use cases
> >
> > On Fri, Feb 14, 2020 at 3:16 PM filippo medri  
> > wrote:
> >
> >
> > Hi,
> > by experimenting with arrow read_csv function to convert csv fie into 
> > parquet I found that it reads the data in memory.
> > On a side the ReadOptions class allows to specify a blocksize parameter to 
> > limit how much bytes to process at a time, but by looking at the memory 
> > usage my understanding is that the underlying Table is filled with all data.
> > Is there a way to at least specify a parameter to limit the read to a batch 
> > of rows? I see that I can skip rows from the beginning, but I am not 
> > finding a way to limit how many rows to read.
> > Which is the intended way to read a csv file that does not fit into memory?
> > Thanks in advance,
> > Filippo Medri


Re: Reading large csv file with pyarrow

2020-02-18 Thread Wes McKinney
Yes, that looks right. There will need to be corresponding work in
Python to make this available (probably through the datasets API)

On Mon, Feb 17, 2020 at 12:35 PM Daniel Nugent  wrote:
>
> Arrow-3410 maybe?
> On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote:
>
> I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> files. Such an API is not yet available in Python. This is also
> required for the C++ Datasets API. If there are not one or more JIRA
> issues about this I suggest that we open some to capture the use cases
>
> On Fri, Feb 14, 2020 at 3:16 PM filippo medri  wrote:
>
>
> Hi,
> by experimenting with arrow read_csv function to convert csv fie into parquet 
> I found that it reads the data in memory.
> On a side the ReadOptions class allows to specify a blocksize parameter to 
> limit how much bytes to process at a time, but by looking at the memory usage 
> my understanding is that the underlying Table is filled with all data.
> Is there a way to at least specify a parameter to limit the read to a batch 
> of rows? I see that I can skip rows from the beginning, but I am not finding 
> a way to limit how many rows to read.
> Which is the intended way to read a csv file that does not fit into memory?
> Thanks in advance,
> Filippo Medri


Re: Reading large csv file with pyarrow

2020-02-17 Thread Daniel Nugent
Arrow-3410 maybe?
On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote:
> I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> files. Such an API is not yet available in Python. This is also
> required for the C++ Datasets API. If there are not one or more JIRA
> issues about this I suggest that we open some to capture the use cases
>
> On Fri, Feb 14, 2020 at 3:16 PM filippo medri  wrote:
> >
> > Hi,
> > by experimenting with arrow read_csv function to convert csv fie into 
> > parquet I found that it reads the data in memory.
> > On a side the ReadOptions class allows to specify a blocksize parameter to 
> > limit how much bytes to process at a time, but by looking at the memory 
> > usage my understanding is that the underlying Table is filled with all data.
> > Is there a way to at least specify a parameter to limit the read to a batch 
> > of rows? I see that I can skip rows from the beginning, but I am not 
> > finding a way to limit how many rows to read.
> > Which is the intended way to read a csv file that does not fit into memory?
> > Thanks in advance,
> > Filippo Medri


Re: Reading large csv file with pyarrow

2020-02-17 Thread Wes McKinney
I seem to recall discussions about 1 chunk-at-a-time reading of CSV
files. Such an API is not yet available in Python. This is also
required for the C++ Datasets API. If there are not one or more JIRA
issues about this I suggest that we open some to capture the use cases

On Fri, Feb 14, 2020 at 3:16 PM filippo medri  wrote:
>
> Hi,
> by experimenting with arrow read_csv function to convert csv fie into parquet 
> I found that it reads the data in memory.
> On a side the ReadOptions class allows to specify a blocksize parameter to 
> limit how much bytes to process at a time, but by looking at the memory usage 
> my understanding is that the underlying Table is filled with all data.
> Is there a way to at least specify a parameter to limit the read to a batch 
> of rows? I see that I can skip rows from the beginning, but I am not finding 
> a way to limit how many rows to read.
> Which is the intended way to read a csv file that does not fit into memory?
> Thanks in advance,
> Filippo Medri


Reading large csv file with pyarrow

2020-02-14 Thread filippo medri
Hi,
by experimenting with arrow read_csv function to convert csv fie into
parquet I found that it reads the data in memory.
On a side the ReadOptions class allows to specify a blocksize parameter to
limit how much bytes to process at a time, but by looking at the memory
usage my understanding is that the underlying Table is filled with all data.
Is there a way to at least specify a parameter to limit the read to a batch
of rows? I see that I can skip rows from the beginning, but I am not
finding a way to limit how many rows to read.
Which is the intended way to read a csv file that does not fit into memory?
Thanks in advance,
Filippo Medri