And boom! With just 3 extra lines of code to adjust the CBO to make the row count inversely proportional to the number of predicates, my little Poc works :-)
Now that I've achieved the instant gratification (relatively speaking!) of making something work, I think it's time to step back and start doing this the right way with the PR you mentioned. I would not have been able to get this working at all without all the fantastic support! Thanks, Andy. On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <[email protected]> wrote: > Hi Andy, > > Congratulations on making such fast progress! > > The code to do filter pushdowns is rather complex and, it seems, most > plugins copy/paste the same wad of code (with the same bugs). PR 1914 > provides a layer that converts the messy Drill logical plan into a nice, > simple set of predicates. You can then pick and choose which to push down, > allowing the framework to do the rest. > > Note that most of the plugins do push-down as part of physical planning. > While this works in most case, it WILL NOT work if you are doing push-down > in order to shard the scan. For example, in order to divide a time range up > into pieces for a time series scan. The PR thus does push-down in the > logical phase so that we can "do the right thing." > > When you say that getNewWithChildren() is for an earlier instance, it is > very likely because Calcite gave up on your filter-push-down version > because there was no cost reduction. > > > The Wiki page mentioned earlier explains all the copies a bit. Basically, > Drill creates many copies of your GroupScan as it proceeds. First a "blank" > one, then another with projected columns, then another full copy as Calcite > explores planning options, and so on. > > One key trick is that if you implement filter push down, you MUST return a > lower cost estimate after the push-down than before. Else, Calcite decides > that it is not worth the hassle of doing the push-down if the costs remain > the same. See the Wiki for details. this is what getScanStats() does: > report stats that must get lower as you improve the scan. > > That is, one cost at the start, a lower cost after projection push down > (reflecting the fact that we presumably now read less data per row) and a > lower cost again after filter-push down (because we read fewer rows.) There > is a "Dummy" storage plugin in PR 1914 that illustrates all of this. > > Don't worry about getDigest(), it is just Calcite trying to get a label to > use for its internal objects. You will need to implement getString(), using > Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan > output. EXPLAIN PLAN output is: > > ClassName [field1=x, field2=y] > > There is a little builder in PR 1914 to do this for you. > > Thanks, > - Paul > > > > On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove < > [email protected]> wrote: > > With some extra debugging I can see that the getNewWithChildren call is > made to an earlier instance of GroupScan and not the instance created by > the filter push-down rule. I'm wondering if this is some kind of > hashCode/equals/toString/getDigest issue? > > On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <[email protected]> wrote: > > > I'm now working on predicate push down ... I have a filter rule that is > > correctly extracting the predicates that the backend database supports > and > > I am creating a new GroupScan containing these predicates, using the > Kafka > > plugin as a reference. I see the GroupScan constructor being called after > > this, with the predicates populated So far so good ... but then I see > calls > > to getDigest, getScanStats, and getNewWithChildren, and then I see calls > to > > the GroupScan constructor with the predicates missing. > > > > Any pointers on what I might be missing? Is there more magic I need to > > know? > > > > Thanks! > > > > On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <[email protected]> > > wrote: > > > >> Hi Andy, > >> > >> Congrats! You are making good progress. Yes, the BatchCreator is a bit > of > >> magic: Drill looks for a subclass that has your SubScan subclass as the > >> second parameter. Looks like you figured that out. > >> > >> Thanks, > >> - Paul > >> > >> > >> > >> On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove < > >> [email protected]> wrote: > >> > >> Actually I managed to get past that error with an educated guess that > if > >> I > >> created a BatchCreator class, it would automagically be picked up > somehow. > >> I'm now at the point where my RecordReader is being invoked! > >> > >> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <[email protected]> > wrote: > >> > >> > Between reading the tutorial and copying and pasting code from the > Kudu > >> > storage plugin, I've been making reasonable progress with this but am > I > >> but > >> > confused by one error I'm now hitting. > >> > ExecutionSetupException: Failure finding OperatorCreator constructor > for > >> > config com.mydb.MyDbSubScan > >> > Prior to this, Drill had called getSpecificScan and then called a few > of > >> > the methods on my subscan object. I wasn't sure what to return for > >> > getOperatorType so just returned the kudu subscan operator type and > I'm > >> > wondering if the issue is related to that somehow? > >> > > >> > Thanks. > >> > > >> > > >> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <[email protected]> > >> wrote: > >> > > >> >> Thank you both for the those responses. This is very helpful. I have > >> >> ordered a copy of the book too. I'm using Drill 1.17.0. > >> >> > >> >> I'll take a look at the Jdbc Storage Plugin code and see if it would > be > >> >> feasible to add the logic I need there. In parallel, I've started > >> >> implementing a new storage plugin. I'll be working on this more > >> tomorrow > >> >> and I'm sure I'll be back with more questions soon. > >> >> > >> >> Thanks again for your help! > >> >> > >> >> Andy. > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <[email protected]> > >> wrote: > >> >> > >> >>> HI Andy, > >> >>> Thanks for your interest in Drill. I'm glad to see that Paul wrote > >> you > >> >>> back as well. I was going to say I thought the JDBC storage plugin > >> did in > >> >>> fact push down columns and filters to the source system. > >> >>> > >> >>> Also, what version of Drill are you using? > >> >>> > >> >>> Writing a storage plugin for Drill is not trivial and I'd definitely > >> >>> recommend using the code from Paul's PR as that greatly simplifies > >> things. > >> >>> Here is a tutorial as well: > >> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin > >> >>> > >> >>> If you need additional help, please let us know. > >> >>> -- C > >> >>> > >> >>> > >> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <[email protected]> > >> wrote: > >> >>> > >> >>> Hi, > >> >>> > >> >>> I'd like to use Apache Drill with a custom data source that > supports a > >> >>> subset of SQL. > >> >>> > >> >>> My goal is to have Drill push selection and predicates down to my > data > >> >>> source but the rest of the query processing should take place in > >> Drill. > >> >>> > >> >>> I started out by writing a JDBC driver for the data source and > >> >>> registering > >> >>> that with Drill using the Jdbc Storage Plugin but it seems to just > >> pass > >> >>> the > >> >>> whole query through to my data source, so that approach isn't going > to > >> >>> work > >> >>> unless I'm missing something? > >> >>> > >> >>> Is there any way to configure the JDBC storage plugin to only push > >> >>> certain > >> >>> parts of the query to the data source? > >> >>> > >> >>> If this isn't a good approach, do I need to write a custom storage > >> >>> plugin? > >> >>> Can these be added on the classpath or would that require me > >> maintaining > >> >>> a > >> >>> fork of the project? > >> >>> > >> >>> > >> >>> > >> >>> I appreciate any pointers anyone can give me. > >> >>> > >> >>> Thanks, > >> >>> > >> >>> Andy. > >> >>> > >> >>> > >> >>> > >> > > > > >
