Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Charles Givre Mon, 26 Oct 2020 06:31:48 -0700

Hi Gareth, 
Once it is called, and the error is ignored does the connection work?
There were a bunch of PRs to fix this issue.  I was supposed to test them, but 
haven't had time.
--C



> On Oct 26, 2020, at 9:29 AM, Gareth Western <[email protected]> wrote:
> 
> Hi Charles,
> 
> JPype 1.1.2
> JayDeBeApi 1.2.3
> 
> The connection code now looks like this:
> 
>    drill_conn = jaydebeapi.connect(
>        "org.apache.drill.jdbc.Driver",
>        "jdbc:drill:drillbit=localhost",
>        [],
>        "/path/to/drill-jdbc-all-1.17.0.jar"
>    )
> 
> The first time this is called it throws a " NoClassDefFoundError ":
>       
>       java.lang.NoClassDefFoundError: 
> oadd/org/apache/drill/exec/store/StoragePluginRegistry. 
> 
> I think this is related to https://github.com/jpype-project/jpype/issues/869 
> or is it something else?
> 
> Mvh,
> Gareth
> 
> On 26/10/2020, 12:57, "Charles Givre" <[email protected]> wrote:
> 
>    Hey Gareth, 
>    There were some PRs floating about due to some issues with JayDeBeApi and 
> it's dependency with JPype.  Do you happen to know what version of JayDeBeApi 
> and JPype you are using?  Also would you mind posting the connection code?  
>    Thanks!
>    -- C
> 
> 
>> On Oct 26, 2020, at 5:10 AM, Gareth Western <[email protected]> wrote:
>> 
>> Hi Paul,
>> 
>> What is the "partial fix" related to in the REST API? The API has worked 
>> fine for our needs, except in the case I mentioned where we would like to 
>> select 12 million records all at once. I don't think this type of query will 
>> ever work with the REST API until the API supports a streaming protocol 
>> (e.g. gRPC or rsocket), right?
>> 
>> Regarding the cleaning, I found out that there is actually a small cleaning 
>> step when the CSV is first created, so it should be possible to use this 
>> stage to convert the data to a format such as Parquet.
>> 
>> Regarding the immediate solution for my problem, I got the JDBC driver 
>> working with Python using the JayDeBeApi library, and can keep the memory 
>> usage down by using the fetchMany method to stream batches of results from 
>> the server: https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb
>> 
>> Mvh,
>> Gareth
>> 
>> On 23/10/2020, 22:44, "Paul Rogers" <[email protected]> wrote:
>> 
>>   Hi Gareth,
>> 
>>   The REST API is handy. We do have a partial fix queued up, but it got
>>   stalled a bit because of lack of reviewers for the tricky part of the code
>>   that is impacted. If the REST API will help your use case; perhaps you can
>>   help with review of the fix, or trying it out in your environment. You'd
>>   need a custom Drill build, but doing that is pretty easy.
>> 
>>   One other thing to keep in mind: Drill will ready many kinds of "raw" data.
>>   But, Drill does require that the data be clean. For CSV, that means
>>   consistent columns and consistent formatting. (A column cannot be a number
>>   in one place, and a string in another. If using headers, a column cannot be
>>   called "foo" in one file, and "bar" in another.) If your files are messy,
>>   it is very helpful to run an ETL step to clean up the data so you don't end
>>   up with random failed queries. Since the data is rewritten for cleaning,
>>   you might as well write the output to Parquet as Nitin suggests.
>> 
>>   - Paul
>> 
>> 
>> 
>>   On Fri, Oct 23, 2020 at 2:54 AM Gareth Western <[email protected]>
>>   wrote:
>> 
>>> Thanks Paul and Nitin.
>>> 
>>> Yes, we are currently using the REST API, so I guess that caveat is the
>>> main issue. I am experimenting with JDBC and ODBC, but haven't made a
>>> successfully connection with those from our Python apps yet (issues not
>>> related to Drill but with the libraries I'm trying to use).
>>> 
>>> Our use case for Drill is using it to expose some source data files
>>> directly with the least amount of "preparation" possible (e.g. converting
>>> to Parquet before working with the data). Read performance isn't a priority
>>> yet just as long as we can actually get to the data.
>>> 
>>> I guess I'll port the app over to Java and try again with JDBC first.
>>> 
>>> Kind regards,
>>> Gareth
>>> 
>>> On 23/10/2020, 09:08, "Paul Rogers" <[email protected]> wrote:
>>> 
>>>   Hi Gareth,
>>> 
>>>   As it turns out, SELECT * by itself should use a fixed amount of memory
>>>   regardless of table size. (With two caveats.) Drill, as with most query
>>>   engines, reads data in batches, then returns each batch to the client.
>>> So,
>>>   if you do SELECT * FROM yourfile.csv, the execution engine will use
>>> only
>>>   enough memory for one batch of data (which is likely to be in the 10s
>>> of
>>>   meg in size.)
>>> 
>>>   The first caveat is if you do a "buffering" operation, such as a sort.
>>>   SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all data.
>>>   But, Drill spills to disk to relieve memory pressure.
>>> 
>>>   The other caveat is if you use the REST API to fetch data. Drill's
>>> REST API
>>>   is not scalable. It buffers all data in memory in an extremely
>>> inefficient
>>>   manner. If you use the JDBC, ODBC or native APIs, then you won't have
>>> this
>>>   problem. (There is a pending fix we can do for a future release.) Are
>>> you
>>>   using the REST API?
>>> 
>>>   Note that the above is just as true of Parquet as it is with CSV.
>>> However,
>>>   as Nitin notes, Parquet is more efficient to read.
>>> 
>>>   Thanks,
>>> 
>>>   - Paul
>>> 
>>> 
>>>   On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar <[email protected]>
>>>   wrote:
>>> 
>>>> Please convert CSV to parquet first and while doing so make sure you
>>> cast
>>>> each column to correct datatype
>>>> 
>>>> once you have in paraquet, your queries should be bit faster.
>>>> 
>>>> On Fri, Oct 23, 2020, 11:57 AM Gareth Western <
>>> [email protected]>
>>>> wrote:
>>>> 
>>>>> I have a very large CSV file (nearly 13 million records) stored in
>>> Azure
>>>>> Storage and read via the Azure Storage plugin. The drillbit
>>> configuration
>>>>> has a modest 4GB heap size. Is there an effective way to select
>>> all the
>>>>> records from the file without running out of resources in Drill?
>>>>> 
>>>>> SELECT * … is too big
>>>>> 
>>>>> SELECT * with OFFSET and LIMIT sounds like the right approach, but
>>> OFFSET
>>>>> still requires scanning through the offset records, and this seems
>>> to hit
>>>>> the same memory issues even with small LIMITs once the offset is
>>> large
>>>>> enough.
>>>>> 
>>>>> Would it help to switch the format to something other than CSV? Or
>>> move
>>>> it
>>>>> to a different storage mechanism? Or something else?
>>>>> 
>>>> 
>>> 
>

Re: What is the most memory-efficient technique for selecting several million records from a CSV file

Reply via email to