Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Gareth Western Fri, 23 Oct 2020 02:54:48 -0700

Thanks Paul and Nitin.

Yes, we are currently using the REST API, so I guess that caveat is the main 
issue. I am experimenting with JDBC and ODBC, but haven't made a successfully 
connection with those from our Python apps yet (issues not related to Drill but 
with the libraries I'm trying to use).


Our use case for Drill is using it to expose some source data files directly 
with the least amount of "preparation" possible (e.g. converting to Parquet 
before working with the data). Read performance isn't a priority yet just as 
long as we can actually get to the data.

I guess I'll port the app over to Java and try again with JDBC first.

Kind regards,
Gareth

On 23/10/2020, 09:08, "Paul Rogers" <[email protected]> wrote:

    Hi Gareth,

    As it turns out, SELECT * by itself should use a fixed amount of memory
    regardless of table size. (With two caveats.) Drill, as with most query
    engines, reads data in batches, then returns each batch to the client. So,
    if you do SELECT * FROM yourfile.csv, the execution engine will use only
    enough memory for one batch of data (which is likely to be in the 10s of
    meg in size.)

    The first caveat is if you do a "buffering" operation, such as a sort.
    SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all data.
    But, Drill spills to disk to relieve memory pressure.

    The other caveat is if you use the REST API to fetch data. Drill's REST API
    is not scalable. It buffers all data in memory in an extremely inefficient
    manner. If you use the JDBC, ODBC or native APIs, then you won't have this
    problem. (There is a pending fix we can do for a future release.) Are you
    using the REST API?

    Note that the above is just as true of Parquet as it is with CSV. However,
    as Nitin notes, Parquet is more efficient to read.

    Thanks,

    - Paul


    On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar <[email protected]>
    wrote:

    > Please convert CSV to parquet first and while doing so make sure you cast
    > each column to correct datatype
    >
    > once you have in paraquet, your queries should be bit faster.
    >
    > On Fri, Oct 23, 2020, 11:57 AM Gareth Western <[email protected]>
    > wrote:
    >
    > > I have a very large CSV file (nearly 13 million records) stored in Azure
    > > Storage and read via the Azure Storage plugin. The drillbit 
configuration
    > > has a modest 4GB heap size. Is there an effective way to select all the
    > > records from the file without running out of resources in Drill?
    > >
    > > SELECT * … is too big
    > >
    > > SELECT * with OFFSET and LIMIT sounds like the right approach, but 
OFFSET
    > > still requires scanning through the offset records, and this seems to 
hit
    > > the same memory issues even with small LIMITs once the offset is large
    > > enough.
    > >
    > > Would it help to switch the format to something other than CSV? Or move
    > it
    > > to a different storage mechanism? Or something else?
    > >
    >

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Reply via email to