Yes, after the second call it seems to work fine. If you point me to the PRs I can have a look as well.
Mvh, Gareth On Mon, 26 Oct 2020, 14:31 Charles Givre, <[email protected]> wrote: > Hi Gareth, > Once it is called, and the error is ignored does the connection work? > There were a bunch of PRs to fix this issue. I was supposed to test them, > but haven't had time. > --C > > > > On Oct 26, 2020, at 9:29 AM, Gareth Western <[email protected]> > wrote: > > > > Hi Charles, > > > > JPype 1.1.2 > > JayDeBeApi 1.2.3 > > > > The connection code now looks like this: > > > > drill_conn = jaydebeapi.connect( > > "org.apache.drill.jdbc.Driver", > > "jdbc:drill:drillbit=localhost", > > [], > > "/path/to/drill-jdbc-all-1.17.0.jar" > > ) > > > > The first time this is called it throws a " NoClassDefFoundError ": > > > > java.lang.NoClassDefFoundError: > oadd/org/apache/drill/exec/store/StoragePluginRegistry. > > > > I think this is related to > https://github.com/jpype-project/jpype/issues/869 or is it something else? > > > > Mvh, > > Gareth > > > > On 26/10/2020, 12:57, "Charles Givre" <[email protected]> wrote: > > > > Hey Gareth, > > There were some PRs floating about due to some issues with JayDeBeApi > and it's dependency with JPype. Do you happen to know what version of > JayDeBeApi and JPype you are using? Also would you mind posting the > connection code? > > Thanks! > > -- C > > > > > >> On Oct 26, 2020, at 5:10 AM, Gareth Western <[email protected]> > wrote: > >> > >> Hi Paul, > >> > >> What is the "partial fix" related to in the REST API? The API has > worked fine for our needs, except in the case I mentioned where we would > like to select 12 million records all at once. I don't think this type of > query will ever work with the REST API until the API supports a streaming > protocol (e.g. gRPC or rsocket), right? > >> > >> Regarding the cleaning, I found out that there is actually a small > cleaning step when the CSV is first created, so it should be possible to > use this stage to convert the data to a format such as Parquet. > >> > >> Regarding the immediate solution for my problem, I got the JDBC driver > working with Python using the JayDeBeApi library, and can keep the memory > usage down by using the fetchMany method to stream batches of results from > the server: > https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb > >> > >> Mvh, > >> Gareth > >> > >> On 23/10/2020, 22:44, "Paul Rogers" <[email protected]> wrote: > >> > >> Hi Gareth, > >> > >> The REST API is handy. We do have a partial fix queued up, but it got > >> stalled a bit because of lack of reviewers for the tricky part of the > code > >> that is impacted. If the REST API will help your use case; perhaps > you can > >> help with review of the fix, or trying it out in your environment. > You'd > >> need a custom Drill build, but doing that is pretty easy. > >> > >> One other thing to keep in mind: Drill will ready many kinds of "raw" > data. > >> But, Drill does require that the data be clean. For CSV, that means > >> consistent columns and consistent formatting. (A column cannot be a > number > >> in one place, and a string in another. If using headers, a column > cannot be > >> called "foo" in one file, and "bar" in another.) If your files are > messy, > >> it is very helpful to run an ETL step to clean up the data so you > don't end > >> up with random failed queries. Since the data is rewritten for > cleaning, > >> you might as well write the output to Parquet as Nitin suggests. > >> > >> - Paul > >> > >> > >> > >> On Fri, Oct 23, 2020 at 2:54 AM Gareth Western < > [email protected]> > >> wrote: > >> > >>> Thanks Paul and Nitin. > >>> > >>> Yes, we are currently using the REST API, so I guess that caveat is the > >>> main issue. I am experimenting with JDBC and ODBC, but haven't made a > >>> successfully connection with those from our Python apps yet (issues not > >>> related to Drill but with the libraries I'm trying to use). > >>> > >>> Our use case for Drill is using it to expose some source data files > >>> directly with the least amount of "preparation" possible (e.g. > converting > >>> to Parquet before working with the data). Read performance isn't a > priority > >>> yet just as long as we can actually get to the data. > >>> > >>> I guess I'll port the app over to Java and try again with JDBC first. > >>> > >>> Kind regards, > >>> Gareth > >>> > >>> On 23/10/2020, 09:08, "Paul Rogers" <[email protected]> wrote: > >>> > >>> Hi Gareth, > >>> > >>> As it turns out, SELECT * by itself should use a fixed amount of > memory > >>> regardless of table size. (With two caveats.) Drill, as with most > query > >>> engines, reads data in batches, then returns each batch to the > client. > >>> So, > >>> if you do SELECT * FROM yourfile.csv, the execution engine will use > >>> only > >>> enough memory for one batch of data (which is likely to be in the 10s > >>> of > >>> meg in size.) > >>> > >>> The first caveat is if you do a "buffering" operation, such as a > sort. > >>> SELECT * FROM yourfile.csv ORDER BY someCol will need to hold all > data. > >>> But, Drill spills to disk to relieve memory pressure. > >>> > >>> The other caveat is if you use the REST API to fetch data. Drill's > >>> REST API > >>> is not scalable. It buffers all data in memory in an extremely > >>> inefficient > >>> manner. If you use the JDBC, ODBC or native APIs, then you won't have > >>> this > >>> problem. (There is a pending fix we can do for a future release.) Are > >>> you > >>> using the REST API? > >>> > >>> Note that the above is just as true of Parquet as it is with CSV. > >>> However, > >>> as Nitin notes, Parquet is more efficient to read. > >>> > >>> Thanks, > >>> > >>> - Paul > >>> > >>> > >>> On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar < > [email protected]> > >>> wrote: > >>> > >>>> Please convert CSV to parquet first and while doing so make sure you > >>> cast > >>>> each column to correct datatype > >>>> > >>>> once you have in paraquet, your queries should be bit faster. > >>>> > >>>> On Fri, Oct 23, 2020, 11:57 AM Gareth Western < > >>> [email protected]> > >>>> wrote: > >>>> > >>>>> I have a very large CSV file (nearly 13 million records) stored in > >>> Azure > >>>>> Storage and read via the Azure Storage plugin. The drillbit > >>> configuration > >>>>> has a modest 4GB heap size. Is there an effective way to select > >>> all the > >>>>> records from the file without running out of resources in Drill? > >>>>> > >>>>> SELECT * … is too big > >>>>> > >>>>> SELECT * with OFFSET and LIMIT sounds like the right approach, but > >>> OFFSET > >>>>> still requires scanning through the offset records, and this seems > >>> to hit > >>>>> the same memory issues even with small LIMITs once the offset is > >>> large > >>>>> enough. > >>>>> > >>>>> Would it help to switch the format to something other than CSV? Or > >>> move > >>>> it > >>>>> to a different storage mechanism? Or something else? > >>>>> > >>>> > >>> > > > >
