Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Gareth Western Mon, 26 Oct 2020 07:15:48 -0700

Yes, after the second call it seems to work fine. If you point me to the
PRs I can have a look as well.


Mvh, Gareth

On Mon, 26 Oct 2020, 14:31 Charles Givre, <[email protected]> wrote:

> Hi Gareth,
> Once it is called, and the error is ignored does the connection work?
> There were a bunch of PRs to fix this issue.  I was supposed to test them,
> but haven't had time.
> --C
>
>
> > On Oct 26, 2020, at 9:29 AM, Gareth Western <[email protected]>
> wrote:
> >
> > Hi Charles,
> >
> > JPype 1.1.2
> > JayDeBeApi 1.2.3
> >
> > The connection code now looks like this:
> >
> >    drill_conn = jaydebeapi.connect(
> >        "org.apache.drill.jdbc.Driver",
> >        "jdbc:drill:drillbit=localhost",
> >        [],
> >        "/path/to/drill-jdbc-all-1.17.0.jar"
> >    )
> >
> > The first time this is called it throws a " NoClassDefFoundError ":
> >
> >       java.lang.NoClassDefFoundError:
> oadd/org/apache/drill/exec/store/StoragePluginRegistry.
> >
> > I think this is related to
> https://github.com/jpype-project/jpype/issues/869 or is it something else?
> >
> > Mvh,
> > Gareth
> >
> > On 26/10/2020, 12:57, "Charles Givre" <[email protected]> wrote:
> >
> >    Hey Gareth,
> >    There were some PRs floating about due to some issues with JayDeBeApi
> and it's dependency with JPype.  Do you happen to know what version of
> JayDeBeApi and JPype you are using?  Also would you mind posting the
> connection code?
> >    Thanks!
> >    -- C
> >
> >
> >> On Oct 26, 2020, at 5:10 AM, Gareth Western <[email protected]>
> wrote:
> >>
> >> Hi Paul,
> >>
> >> What is the "partial fix" related to in the REST API? The API has
> worked fine for our needs, except in the case I mentioned where we would
> like to select 12 million records all at once. I don't think this type of
> query will ever work with the REST API until the API supports a streaming
> protocol (e.g. gRPC or rsocket), right?
> >>
> >> Regarding the cleaning, I found out that there is actually a small
> cleaning step when the CSV is first created, so it should be possible to
> use this stage to convert the data to a format such as Parquet.
> >>
> >> Regarding the immediate solution for my problem, I got the JDBC driver
> working with Python using the JayDeBeApi library, and can keep the memory
> usage down by using the fetchMany method to stream batches of results from
> the server:
> https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb
> >>
> >> Mvh,
> >> Gareth
> >>
> >> On 23/10/2020, 22:44, "Paul Rogers" <[email protected]> wrote:
> >>
> >>   Hi Gareth,
> >>
> >>   The REST API is handy. We do have a partial fix queued up, but it got
> >>   stalled a bit because of lack of reviewers for the tricky part of the
> code
> >>   that is impacted. If the REST API will help your use case; perhaps
> you can
> >>   help with review of the fix, or trying it out in your environment.
> You'd
> >>   need a custom Drill build, but doing that is pretty easy.
> >>
> >>   One other thing to keep in mind: Drill will ready many kinds of "raw"
> data.
> >>   But, Drill does require that the data be clean. For CSV, that means
> >>   consistent columns and consistent formatting. (A column cannot be a
> number
> >>   in one place, and a string in another. If using headers, a column
> cannot be
> >>   called "foo" in one file, and "bar" in another.) If your files are
> messy,
> >>   it is very helpful to run an ETL step to clean up the data so you
> don't end
> >>   up with random failed queries. Since the data is rewritten for
> cleaning,
> >>   you might as well write the output to Parquet as Nitin suggests.
> >>
> >>   - Paul
> >>
> >>
> >>
> >>   On Fri, Oct 23, 2020 at 2:54 AM Gareth Western <
> [email protected]>
> >>   wrote:
> >>
> >>> Thanks Paul and Nitin.
> >>>
> >>> Yes, we are currently using the REST API, so I guess that caveat is the
> >>> main issue. I am experimenting with JDBC and ODBC, but haven't made a
> >>> successfully connection with those from our Python apps yet (issues not
> >>> related to Drill but with the libraries I'm trying to use).
> >>>
> >>> Our use case for Drill is using it to expose some source data files
> >>> directly with the least amount of "preparation" possible (e.g.
> converting
> >>> to Parquet before working with the data). Read performance isn't a
> priority
> >>> yet just as long as we can actually get to the data.
> >>>
> >>> I guess I'll port the app over to Java and try again with JDBC first.
> >>>
> >>> Kind regards,
> >>> Gareth
> >>>
> >>> On 23/10/2020, 09:08, "Paul Rogers" <[email protected]> wrote:
> >>>
> >>>   Hi Gareth,
> >>>
> >>>   As it turns out, SELECT * by itself should use a fixed amount of
> memory
> >>>   regardless of table size. (With two caveats.) Drill, as with most
> query
> >>>   engines, reads data in batches, then returns each batch to the
> client.
> >>> So,
> >>>   if you do SELECT * FROM yourfile.csv, the execution engine will use
> >>> only
> >>>   enough memory for one batch of data (which is likely to be in the 10s
> >>> of
> >>>   meg in size.)
> >>>
> >>>   The first caveat is if you do a "buffering" operation, such as a
> sort.
> >>>   SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all
> data.
> >>>   But, Drill spills to disk to relieve memory pressure.
> >>>
> >>>   The other caveat is if you use the REST API to fetch data. Drill's
> >>> REST API
> >>>   is not scalable. It buffers all data in memory in an extremely
> >>> inefficient
> >>>   manner. If you use the JDBC, ODBC or native APIs, then you won't have
> >>> this
> >>>   problem. (There is a pending fix we can do for a future release.) Are
> >>> you
> >>>   using the REST API?
> >>>
> >>>   Note that the above is just as true of Parquet as it is with CSV.
> >>> However,
> >>>   as Nitin notes, Parquet is more efficient to read.
> >>>
> >>>   Thanks,
> >>>
> >>>   - Paul
> >>>
> >>>
> >>>   On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar <
> [email protected]>
> >>>   wrote:
> >>>
> >>>> Please convert CSV to parquet first and while doing so make sure you
> >>> cast
> >>>> each column to correct datatype
> >>>>
> >>>> once you have in paraquet, your queries should be bit faster.
> >>>>
> >>>> On Fri, Oct 23, 2020, 11:57 AM Gareth Western <
> >>> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> I have a very large CSV file (nearly 13 million records) stored in
> >>> Azure
> >>>>> Storage and read via the Azure Storage plugin. The drillbit
> >>> configuration
> >>>>> has a modest 4GB heap size. Is there an effective way to select
> >>> all the
> >>>>> records from the file without running out of resources in Drill?
> >>>>>
> >>>>> SELECT * … is too big
> >>>>>
> >>>>> SELECT * with OFFSET and LIMIT sounds like the right approach, but
> >>> OFFSET
> >>>>> still requires scanning through the offset records, and this seems
> >>> to hit
> >>>>> the same memory issues even with small LIMITs once the offset is
> >>> large
> >>>>> enough.
> >>>>>
> >>>>> Would it help to switch the format to something other than CSV? Or
> >>> move
> >>>> it
> >>>>> to a different storage mechanism? Or something else?
> >>>>>
> >>>>
> >>>
> >
>
>

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Reply via email to