Hi Gareth, Once it is called, and the error is ignored does the connection work? There were a bunch of PRs to fix this issue. I was supposed to test them, but haven't had time. --C
> On Oct 26, 2020, at 9:29 AM, Gareth Western <[email protected]> wrote: > > Hi Charles, > > JPype 1.1.2 > JayDeBeApi 1.2.3 > > The connection code now looks like this: > > drill_conn = jaydebeapi.connect( > "org.apache.drill.jdbc.Driver", > "jdbc:drill:drillbit=localhost", > [], > "/path/to/drill-jdbc-all-1.17.0.jar" > ) > > The first time this is called it throws a " NoClassDefFoundError ": > > java.lang.NoClassDefFoundError: > oadd/org/apache/drill/exec/store/StoragePluginRegistry. > > I think this is related to https://github.com/jpype-project/jpype/issues/869 > or is it something else? > > Mvh, > Gareth > > On 26/10/2020, 12:57, "Charles Givre" <[email protected]> wrote: > > Hey Gareth, > There were some PRs floating about due to some issues with JayDeBeApi and > it's dependency with JPype. Do you happen to know what version of JayDeBeApi > and JPype you are using? Also would you mind posting the connection code? > Thanks! > -- C > > >> On Oct 26, 2020, at 5:10 AM, Gareth Western <[email protected]> wrote: >> >> Hi Paul, >> >> What is the "partial fix" related to in the REST API? The API has worked >> fine for our needs, except in the case I mentioned where we would like to >> select 12 million records all at once. I don't think this type of query will >> ever work with the REST API until the API supports a streaming protocol >> (e.g. gRPC or rsocket), right? >> >> Regarding the cleaning, I found out that there is actually a small cleaning >> step when the CSV is first created, so it should be possible to use this >> stage to convert the data to a format such as Parquet. >> >> Regarding the immediate solution for my problem, I got the JDBC driver >> working with Python using the JayDeBeApi library, and can keep the memory >> usage down by using the fetchMany method to stream batches of results from >> the server: https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb >> >> Mvh, >> Gareth >> >> On 23/10/2020, 22:44, "Paul Rogers" <[email protected]> wrote: >> >> Hi Gareth, >> >> The REST API is handy. We do have a partial fix queued up, but it got >> stalled a bit because of lack of reviewers for the tricky part of the code >> that is impacted. If the REST API will help your use case; perhaps you can >> help with review of the fix, or trying it out in your environment. You'd >> need a custom Drill build, but doing that is pretty easy. >> >> One other thing to keep in mind: Drill will ready many kinds of "raw" data. >> But, Drill does require that the data be clean. For CSV, that means >> consistent columns and consistent formatting. (A column cannot be a number >> in one place, and a string in another. If using headers, a column cannot be >> called "foo" in one file, and "bar" in another.) If your files are messy, >> it is very helpful to run an ETL step to clean up the data so you don't end >> up with random failed queries. Since the data is rewritten for cleaning, >> you might as well write the output to Parquet as Nitin suggests. >> >> - Paul >> >> >> >> On Fri, Oct 23, 2020 at 2:54 AM Gareth Western <[email protected]> >> wrote: >> >>> Thanks Paul and Nitin. >>> >>> Yes, we are currently using the REST API, so I guess that caveat is the >>> main issue. I am experimenting with JDBC and ODBC, but haven't made a >>> successfully connection with those from our Python apps yet (issues not >>> related to Drill but with the libraries I'm trying to use). >>> >>> Our use case for Drill is using it to expose some source data files >>> directly with the least amount of "preparation" possible (e.g. converting >>> to Parquet before working with the data). Read performance isn't a priority >>> yet just as long as we can actually get to the data. >>> >>> I guess I'll port the app over to Java and try again with JDBC first. >>> >>> Kind regards, >>> Gareth >>> >>> On 23/10/2020, 09:08, "Paul Rogers" <[email protected]> wrote: >>> >>> Hi Gareth, >>> >>> As it turns out, SELECT * by itself should use a fixed amount of memory >>> regardless of table size. (With two caveats.) Drill, as with most query >>> engines, reads data in batches, then returns each batch to the client. >>> So, >>> if you do SELECT * FROM yourfile.csv, the execution engine will use >>> only >>> enough memory for one batch of data (which is likely to be in the 10s >>> of >>> meg in size.) >>> >>> The first caveat is if you do a "buffering" operation, such as a sort. >>> SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all data. >>> But, Drill spills to disk to relieve memory pressure. >>> >>> The other caveat is if you use the REST API to fetch data. Drill's >>> REST API >>> is not scalable. It buffers all data in memory in an extremely >>> inefficient >>> manner. If you use the JDBC, ODBC or native APIs, then you won't have >>> this >>> problem. (There is a pending fix we can do for a future release.) Are >>> you >>> using the REST API? >>> >>> Note that the above is just as true of Parquet as it is with CSV. >>> However, >>> as Nitin notes, Parquet is more efficient to read. >>> >>> Thanks, >>> >>> - Paul >>> >>> >>> On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar <[email protected]> >>> wrote: >>> >>>> Please convert CSV to parquet first and while doing so make sure you >>> cast >>>> each column to correct datatype >>>> >>>> once you have in paraquet, your queries should be bit faster. >>>> >>>> On Fri, Oct 23, 2020, 11:57 AM Gareth Western < >>> [email protected]> >>>> wrote: >>>> >>>>> I have a very large CSV file (nearly 13 million records) stored in >>> Azure >>>>> Storage and read via the Azure Storage plugin. The drillbit >>> configuration >>>>> has a modest 4GB heap size. Is there an effective way to select >>> all the >>>>> records from the file without running out of resources in Drill? >>>>> >>>>> SELECT * … is too big >>>>> >>>>> SELECT * with OFFSET and LIMIT sounds like the right approach, but >>> OFFSET >>>>> still requires scanning through the offset records, and this seems >>> to hit >>>>> the same memory issues even with small LIMITs once the offset is >>> large >>>>> enough. >>>>> >>>>> Would it help to switch the format to something other than CSV? Or >>> move >>>> it >>>>> to a different storage mechanism? Or something else? >>>>> >>>> >>> >
