Re: Looking for advice on integrating with a custom data source

Andy Grove Thu, 16 Jan 2020 19:03:16 -0800

Hi Charles,

I would like to be able to contribute something out of this effort. The PoC
I am working on is quite fluid at the moment but one possible outcome is
that this storage engine ends up supporting Arrow Flight, but I'm not sure
yet.


Andy.

On Wed, Jan 15, 2020 at 7:19 AM Charles Givre <[email protected]> wrote:

> Andy,
> Glad to hear you got it working!!   Can you share what data source you are
> working with?  Is it completely custom to your organization?  If not, would
> you consider submitting this as a pull request?
> Best,
> -- C
>
>
>
> > On Jan 15, 2020, at 9:07 AM, Andy Grove <[email protected]> wrote:
> >
> > And boom! With just 3 extra lines of code to adjust the CBO to make the
> row
> > count inversely proportional to the number of predicates, my little Poc
> > works :-)
> >
> > Now that I've achieved the instant gratification (relatively speaking!)
> of
> > making something work, I think it's time to step back and start doing
> this
> > the right way with the PR you mentioned.
> >
> > I would not have been able to get this working at all without all the
> > fantastic support!
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> > On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <[email protected]>
> > wrote:
> >
> >> Hi Andy,
> >>
> >> Congratulations on making such fast progress!
> >>
> >> The code to do filter pushdowns is rather complex and, it seems, most
> >> plugins copy/paste the same wad of code (with the same bugs). PR 1914
> >> provides a layer that converts the messy Drill logical plan into a nice,
> >> simple set of predicates. You can then pick and choose which to push
> down,
> >> allowing the framework to do the rest.
> >>
> >> Note that most of the plugins do push-down as part of physical planning.
> >> While this works in most case, it WILL NOT work if you are doing
> push-down
> >> in order to shard the scan. For example, in order to divide a time
> range up
> >> into pieces for a time series scan. The PR thus does push-down in the
> >> logical phase so that we can "do the right thing."
> >>
> >> When you say that getNewWithChildren() is for an earlier instance, it is
> >> very likely because Calcite gave up on your filter-push-down version
> >> because there was no cost reduction.
> >>
> >>
> >> The Wiki page mentioned earlier explains all the copies a bit.
> Basically,
> >> Drill creates many copies of your GroupScan as it proceeds. First a
> "blank"
> >> one, then another with projected columns, then another full copy as
> Calcite
> >> explores planning options, and so on.
> >>
> >> One key trick is that if you implement filter push down, you MUST
> return a
> >> lower cost estimate after the push-down than before. Else, Calcite
> decides
> >> that it is not worth the hassle of doing the push-down if the costs
> remain
> >> the same. See the Wiki for details. this is what getScanStats() does:
> >> report stats that must get lower as you improve the scan.
> >>
> >> That is, one cost at the start, a lower cost after projection push down
> >> (reflecting the fact that we presumably now read less data per row) and
> a
> >> lower cost again after filter-push down (because we read fewer rows.)
> There
> >> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
> >>
> >> Don't worry about getDigest(), it is just Calcite trying to get a label
> to
> >> use for its internal objects. You will need to implement getString(),
> using
> >> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
> >> output. EXPLAIN PLAN output is:
> >>
> >> ClassName [field1=x, field2=y]
> >>
> >> There is a little builder in PR 1914 to do this for you.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
> >> [email protected]> wrote:
> >>
> >> With some extra debugging I can see that the getNewWithChildren call is
> >> made to an earlier instance of GroupScan and not the instance created by
> >> the filter push-down rule. I'm wondering if this is some kind of
> >> hashCode/equals/toString/getDigest issue?
> >>
> >> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <[email protected]>
> wrote:
> >>
> >>> I'm now working on predicate push down ... I have a filter rule that is
> >>> correctly extracting the predicates that the backend database supports
> >> and
> >>> I am creating a new GroupScan containing these predicates, using the
> >> Kafka
> >>> plugin as a reference. I see the GroupScan constructor being called
> after
> >>> this, with the predicates populated So far so good ... but then I see
> >> calls
> >>> to getDigest, getScanStats, and getNewWithChildren, and then I see
> calls
> >> to
> >>> the GroupScan constructor with the predicates missing.
> >>>
> >>> Any pointers on what I might be missing? Is there more magic I need to
> >>> know?
> >>>
> >>> Thanks!
> >>>
> >>> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <[email protected]
> >
> >>> wrote:
> >>>
> >>>> Hi Andy,
> >>>>
> >>>> Congrats! You are making good progress. Yes, the BatchCreator is a bit
> >> of
> >>>> magic: Drill looks for a subclass that has your SubScan subclass as
> the
> >>>> second parameter. Looks like you figured that out.
> >>>>
> >>>> Thanks,
> >>>> - Paul
> >>>>
> >>>>
> >>>>
> >>>>   On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> >>>> [email protected]> wrote:
> >>>>
> >>>> Actually I managed to get past that error with an educated guess that
> >> if
> >>>> I
> >>>> created a BatchCreator class, it would automagically be picked up
> >> somehow.
> >>>> I'm now at the point where my RecordReader is being invoked!
> >>>>
> >>>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <[email protected]>
> >> wrote:
> >>>>
> >>>>> Between reading the tutorial and copying and pasting code from the
> >> Kudu
> >>>>> storage plugin, I've been making reasonable progress with this but am
> >> I
> >>>> but
> >>>>> confused by one error I'm now hitting.
> >>>>> ExecutionSetupException: Failure finding OperatorCreator constructor
> >> for
> >>>>> config com.mydb.MyDbSubScan
> >>>>> Prior to this, Drill had called getSpecificScan and then called a few
> >> of
> >>>>> the methods on my subscan object. I wasn't sure what to return for
> >>>>> getOperatorType so just returned the kudu subscan operator type and
> >> I'm
> >>>>> wondering if the issue is related to that somehow?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> Thank you both for the those responses. This is very helpful. I have
> >>>>>> ordered a copy of the book too. I'm using Drill 1.17.0.
> >>>>>>
> >>>>>> I'll take a look at the Jdbc Storage Plugin code and see if it would
> >> be
> >>>>>> feasible to add the logic I need there. In parallel, I've started
> >>>>>> implementing a new storage plugin. I'll be working on this more
> >>>> tomorrow
> >>>>>> and I'm sure I'll be back with more questions soon.
> >>>>>>
> >>>>>> Thanks again for your help!
> >>>>>>
> >>>>>> Andy.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>>> HI Andy,
> >>>>>>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
> >>>> you
> >>>>>>> back as well.  I was going to say I thought the JDBC storage plugin
> >>>> did in
> >>>>>>> fact push down columns and filters to the source system.
> >>>>>>>
> >>>>>>> Also, what version of Drill are you using?
> >>>>>>>
> >>>>>>> Writing a storage plugin for Drill is not trivial and I'd
> definitely
> >>>>>>> recommend using the code from Paul's PR as that greatly simplifies
> >>>> things.
> >>>>>>> Here is a tutorial as well:
> >>>>>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >>>>>>>
> >>>>>>> If you need additional help, please let us know.
> >>>>>>> -- C
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <[email protected]>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'd like to use Apache Drill with a custom data source that
> >> supports a
> >>>>>>> subset of SQL.
> >>>>>>>
> >>>>>>> My goal is to have Drill push selection and predicates down to my
> >> data
> >>>>>>> source but the rest of the query processing should take place in
> >>>> Drill.
> >>>>>>>
> >>>>>>> I started out by writing a JDBC driver for the data source and
> >>>>>>> registering
> >>>>>>> that with Drill using the Jdbc Storage Plugin but it seems to just
> >>>> pass
> >>>>>>> the
> >>>>>>> whole query through to my data source, so that approach isn't going
> >> to
> >>>>>>> work
> >>>>>>> unless I'm missing something?
> >>>>>>>
> >>>>>>> Is there any way to configure the JDBC storage plugin to only push
> >>>>>>> certain
> >>>>>>> parts of the query to the data source?
> >>>>>>>
> >>>>>>> If this isn't a good approach, do I need to write a custom storage
> >>>>>>> plugin?
> >>>>>>> Can these be added on the classpath or would that require me
> >>>> maintaining
> >>>>>>> a
> >>>>>>> fork of the project?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I appreciate any pointers anyone can give me.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Andy.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Looking for advice on integrating with a custom data source

Reply via email to