Re: Looking for advice on integrating with a custom data source

Paul Rogers Tue, 14 Jan 2020 22:43:47 -0800

Hi Andy,

Congratulations on making such fast progress!

The code to do filter pushdowns is rather complex and, it seems, most plugins 
copy/paste the same wad of code (with the same bugs). PR 1914 provides a layer 
that converts the messy Drill logical plan into a nice, simple set of 
predicates. You can then pick and choose which to push down, allowing the 
framework to do the rest.

Note that most of the plugins do push-down as part of physical planning. While 
this works in most case, it WILL NOT work if you are doing push-down in order 
to shard the scan. For example, in order to divide a time range up into pieces 
for a time series scan. The PR thus does push-down in the logical phase so that 
we can "do the right thing."

When you say that getNewWithChildren() is for an earlier instance, it is very 
likely because Calcite gave up on your filter-push-down version because there 
was no cost reduction.

The Wiki page mentioned earlier explains all the copies a bit. Basically, Drill 
creates many copies of your GroupScan as it proceeds. First a "blank" one, then 
another with projected columns, then another full copy as Calcite explores 
planning options, and so on.

One key trick is that if you implement filter push down, you MUST return a 
lower cost estimate after the push-down than before. Else, Calcite decides that 
it is not worth the hassle of doing the push-down if the costs remain the same. 
See the Wiki for details. this is what getScanStats() does: report stats that 
must get lower as you improve the scan.

That is, one cost at the start, a lower cost after projection push down 
(reflecting the fact that we presumably now read less data per row) and a lower 
cost again after filter-push down (because we read fewer rows.) There is a 
"Dummy" storage plugin in PR 1914 that illustrates all of this.

Don't worry about getDigest(), it is just Calcite trying to get a label to use 
for its internal objects. You will need to implement getString(), using Drill's 
"EXPLAIN PLAN" format, so your scan can appear in the text plan output. EXPLAIN 
PLAN output is:

ClassName [field1=x, field2=y]

There is a little builder in PR 1914 to do this for you.

Thanks,
- Paul

    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove 
<[email protected]> wrote:  

 With some extra debugging I can see that the getNewWithChildren call is
made to an earlier instance of GroupScan and not the instance created by
the filter push-down rule. I'm wondering if this is some kind of
hashCode/equals/toString/getDigest issue?

On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <[email protected]> wrote:

> I'm now working on predicate push down ... I have a filter rule that is
> correctly extracting the predicates that the backend database supports and
> I am creating a new GroupScan containing these predicates, using the Kafka
> plugin as a reference. I see the GroupScan constructor being called after
> this, with the predicates populated So far so good ... but then I see calls
> to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
> the GroupScan constructor with the predicates missing.
>
> Any pointers on what I might be missing? Is there more magic I need to
> know?
>
> Thanks!
>
> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <[email protected]>
> wrote:
>
>> Hi Andy,
>>
>> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
>> magic: Drill looks for a subclass that has your SubScan subclass as the
>> second parameter. Looks like you figured that out.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>> [email protected]> wrote:
>>
>>  Actually I managed to get past that error with an educated guess that if
>> I
>> created a BatchCreator class, it would automagically be picked up somehow.
>> I'm now at the point where my RecordReader is being invoked!
>>
>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <[email protected]> wrote:
>>
>> > Between reading the tutorial and copying and pasting code from the Kudu
>> > storage plugin, I've been making reasonable progress with this but am I
>> but
>> > confused by one error I'm now hitting.
>> > ExecutionSetupException: Failure finding OperatorCreator constructor for
>> > config com.mydb.MyDbSubScan
>> > Prior to this, Drill had called getSpecificScan and then called a few of
>> > the methods on my subscan object. I wasn't sure what to return for
>> > getOperatorType so just returned the kudu subscan operator type and I'm
>> > wondering if the issue is related to that somehow?
>> >
>> > Thanks.
>> >
>> >
>> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <[email protected]>
>> wrote:
>> >
>> >> Thank you both for the those responses. This is very helpful. I have
>> >> ordered a copy of the book too. I'm using Drill 1.17.0.
>> >>
>> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> >> feasible to add the logic I need there. In parallel, I've started
>> >> implementing a new storage plugin. I'll be working on this more
>> tomorrow
>> >> and I'm sure I'll be back with more questions soon.
>> >>
>> >> Thanks again for your help!
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <[email protected]>
>> wrote:
>> >>
>> >>> HI Andy,
>> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
>> you
>> >>> back as well.  I was going to say I thought the JDBC storage plugin
>> did in
>> >>> fact push down columns and filters to the source system.
>> >>>
>> >>> Also, what version of Drill are you using?
>> >>>
>> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
>> >>> recommend using the code from Paul's PR as that greatly simplifies
>> things.
>> >>> Here is a tutorial as well:
>> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>> >>>
>> >>> If you need additional help, please let us know.
>> >>> -- C
>> >>>
>> >>>
>> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <[email protected]>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I'd like to use Apache Drill with a custom data source that supports a
>> >>> subset of SQL.
>> >>>
>> >>> My goal is to have Drill push selection and predicates down to my data
>> >>> source but the rest of the query processing should take place in
>> Drill.
>> >>>
>> >>> I started out by writing a JDBC driver for the data source and
>> >>> registering
>> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
>> pass
>> >>> the
>> >>> whole query through to my data source, so that approach isn't going to
>> >>> work
>> >>> unless I'm missing something?
>> >>>
>> >>> Is there any way to configure the JDBC storage plugin to only push
>> >>> certain
>> >>> parts of the query to the data source?
>> >>>
>> >>> If this isn't a good approach, do I need to write a custom storage
>> >>> plugin?
>> >>> Can these be added on the classpath or would that require me
>> maintaining
>> >>> a
>> >>> fork of the project?
>> >>>
>> >>>
>> >>>
>> >>> I appreciate any pointers anyone can give me.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Andy.
>> >>>
>> >>>
>> >>>
>>
>
>

Re: Looking for advice on integrating with a custom data source

Reply via email to