Re: Looking for advice on integrating with a custom data source

Andy Grove Wed, 15 Jan 2020 06:08:51 -0800

And boom! With just 3 extra lines of code to adjust the CBO to make the row
count inversely proportional to the number of predicates, my little Poc
works :-)


Now that I've achieved the instant gratification (relatively speaking!) of
making something work, I think it's time to step back and start doing this
the right way with the PR you mentioned.

I would not have been able to get this working at all without all the
fantastic support!

Thanks,

Andy.



On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <[email protected]>
wrote:

> Hi Andy,
>
> Congratulations on making such fast progress!
>
> The code to do filter pushdowns is rather complex and, it seems, most
> plugins copy/paste the same wad of code (with the same bugs). PR 1914
> provides a layer that converts the messy Drill logical plan into a nice,
> simple set of predicates. You can then pick and choose which to push down,
> allowing the framework to do the rest.
>
> Note that most of the plugins do push-down as part of physical planning.
> While this works in most case, it WILL NOT work if you are doing push-down
> in order to shard the scan. For example, in order to divide a time range up
> into pieces for a time series scan. The PR thus does push-down in the
> logical phase so that we can "do the right thing."
>
> When you say that getNewWithChildren() is for an earlier instance, it is
> very likely because Calcite gave up on your filter-push-down version
> because there was no cost reduction.
>
>
> The Wiki page mentioned earlier explains all the copies a bit. Basically,
> Drill creates many copies of your GroupScan as it proceeds. First a "blank"
> one, then another with projected columns, then another full copy as Calcite
> explores planning options, and so on.
>
> One key trick is that if you implement filter push down, you MUST return a
> lower cost estimate after the push-down than before. Else, Calcite decides
> that it is not worth the hassle of doing the push-down if the costs remain
> the same. See the Wiki for details. this is what getScanStats() does:
> report stats that must get lower as you improve the scan.
>
> That is, one cost at the start, a lower cost after projection push down
> (reflecting the fact that we presumably now read less data per row) and a
> lower cost again after filter-push down (because we read fewer rows.) There
> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
>
> Don't worry about getDigest(), it is just Calcite trying to get a label to
> use for its internal objects. You will need to implement getString(), using
> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
> output. EXPLAIN PLAN output is:
>
> ClassName [field1=x, field2=y]
>
> There is a little builder in PR 1914 to do this for you.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
> [email protected]> wrote:
>
>  With some extra debugging I can see that the getNewWithChildren call is
> made to an earlier instance of GroupScan and not the instance created by
> the filter push-down rule. I'm wondering if this is some kind of
> hashCode/equals/toString/getDigest issue?
>
> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <[email protected]> wrote:
>
> > I'm now working on predicate push down ... I have a filter rule that is
> > correctly extracting the predicates that the backend database supports
> and
> > I am creating a new GroupScan containing these predicates, using the
> Kafka
> > plugin as a reference. I see the GroupScan constructor being called after
> > this, with the predicates populated So far so good ... but then I see
> calls
> > to getDigest, getScanStats, and getNewWithChildren, and then I see calls
> to
> > the GroupScan constructor with the predicates missing.
> >
> > Any pointers on what I might be missing? Is there more magic I need to
> > know?
> >
> > Thanks!
> >
> > On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <[email protected]>
> > wrote:
> >
> >> Hi Andy,
> >>
> >> Congrats! You are making good progress. Yes, the BatchCreator is a bit
> of
> >> magic: Drill looks for a subclass that has your SubScan subclass as the
> >> second parameter. Looks like you figured that out.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> >> [email protected]> wrote:
> >>
> >>  Actually I managed to get past that error with an educated guess that
> if
> >> I
> >> created a BatchCreator class, it would automagically be picked up
> somehow.
> >> I'm now at the point where my RecordReader is being invoked!
> >>
> >> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <[email protected]>
> wrote:
> >>
> >> > Between reading the tutorial and copying and pasting code from the
> Kudu
> >> > storage plugin, I've been making reasonable progress with this but am
> I
> >> but
> >> > confused by one error I'm now hitting.
> >> > ExecutionSetupException: Failure finding OperatorCreator constructor
> for
> >> > config com.mydb.MyDbSubScan
> >> > Prior to this, Drill had called getSpecificScan and then called a few
> of
> >> > the methods on my subscan object. I wasn't sure what to return for
> >> > getOperatorType so just returned the kudu subscan operator type and
> I'm
> >> > wondering if the issue is related to that somehow?
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <[email protected]>
> >> wrote:
> >> >
> >> >> Thank you both for the those responses. This is very helpful. I have
> >> >> ordered a copy of the book too. I'm using Drill 1.17.0.
> >> >>
> >> >> I'll take a look at the Jdbc Storage Plugin code and see if it would
> be
> >> >> feasible to add the logic I need there. In parallel, I've started
> >> >> implementing a new storage plugin. I'll be working on this more
> >> tomorrow
> >> >> and I'm sure I'll be back with more questions soon.
> >> >>
> >> >> Thanks again for your help!
> >> >>
> >> >> Andy.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <[email protected]>
> >> wrote:
> >> >>
> >> >>> HI Andy,
> >> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
> >> you
> >> >>> back as well.  I was going to say I thought the JDBC storage plugin
> >> did in
> >> >>> fact push down columns and filters to the source system.
> >> >>>
> >> >>> Also, what version of Drill are you using?
> >> >>>
> >> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
> >> >>> recommend using the code from Paul's PR as that greatly simplifies
> >> things.
> >> >>> Here is a tutorial as well:
> >> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >> >>>
> >> >>> If you need additional help, please let us know.
> >> >>> -- C
> >> >>>
> >> >>>
> >> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <[email protected]>
> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> I'd like to use Apache Drill with a custom data source that
> supports a
> >> >>> subset of SQL.
> >> >>>
> >> >>> My goal is to have Drill push selection and predicates down to my
> data
> >> >>> source but the rest of the query processing should take place in
> >> Drill.
> >> >>>
> >> >>> I started out by writing a JDBC driver for the data source and
> >> >>> registering
> >> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
> >> pass
> >> >>> the
> >> >>> whole query through to my data source, so that approach isn't going
> to
> >> >>> work
> >> >>> unless I'm missing something?
> >> >>>
> >> >>> Is there any way to configure the JDBC storage plugin to only push
> >> >>> certain
> >> >>> parts of the query to the data source?
> >> >>>
> >> >>> If this isn't a good approach, do I need to write a custom storage
> >> >>> plugin?
> >> >>> Can these be added on the classpath or would that require me
> >> maintaining
> >> >>> a
> >> >>> fork of the project?
> >> >>>
> >> >>>
> >> >>>
> >> >>> I appreciate any pointers anyone can give me.
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Andy.
> >> >>>
> >> >>>
> >> >>>
> >>
> >
> >
>

Re: Looking for advice on integrating with a custom data source

Reply via email to