Re: Looking for advice on integrating with a custom data source

2020-01-14 Thread Paul Rogers
Hi Andy,

Congratulations on making such fast progress!

The code to do filter pushdowns is rather complex and, it seems, most plugins 
copy/paste the same wad of code (with the same bugs). PR 1914 provides a layer 
that converts the messy Drill logical plan into a nice, simple set of 
predicates. You can then pick and choose which to push down, allowing the 
framework to do the rest.

Note that most of the plugins do push-down as part of physical planning. While 
this works in most case, it WILL NOT work if you are doing push-down in order 
to shard the scan. For example, in order to divide a time range up into pieces 
for a time series scan. The PR thus does push-down in the logical phase so that 
we can "do the right thing."

When you say that getNewWithChildren() is for an earlier instance, it is very 
likely because Calcite gave up on your filter-push-down version because there 
was no cost reduction.


The Wiki page mentioned earlier explains all the copies a bit. Basically, Drill 
creates many copies of your GroupScan as it proceeds. First a "blank" one, then 
another with projected columns, then another full copy as Calcite explores 
planning options, and so on.

One key trick is that if you implement filter push down, you MUST return a 
lower cost estimate after the push-down than before. Else, Calcite decides that 
it is not worth the hassle of doing the push-down if the costs remain the same. 
See the Wiki for details. this is what getScanStats() does: report stats that 
must get lower as you improve the scan.

That is, one cost at the start, a lower cost after projection push down 
(reflecting the fact that we presumably now read less data per row) and a lower 
cost again after filter-push down (because we read fewer rows.) There is a 
"Dummy" storage plugin in PR 1914 that illustrates all of this.

Don't worry about getDigest(), it is just Calcite trying to get a label to use 
for its internal objects. You will need to implement getString(), using Drill's 
"EXPLAIN PLAN" format, so your scan can appear in the text plan output. EXPLAIN 
PLAN output is:

ClassName [field1=x, field2=y]

There is a little builder in PR 1914 to do this for you.

Thanks,
- Paul

 

On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove 
 wrote:  
 
 With some extra debugging I can see that the getNewWithChildren call is
made to an earlier instance of GroupScan and not the instance created by
the filter push-down rule. I'm wondering if this is some kind of
hashCode/equals/toString/getDigest issue?

On Tue, Jan 14, 2020 at 7:52 PM Andy Grove  wrote:

> I'm now working on predicate push down ... I have a filter rule that is
> correctly extracting the predicates that the backend database supports and
> I am creating a new GroupScan containing these predicates, using the Kafka
> plugin as a reference. I see the GroupScan constructor being called after
> this, with the predicates populated So far so good ... but then I see calls
> to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
> the GroupScan constructor with the predicates missing.
>
> Any pointers on what I might be missing? Is there more magic I need to
> know?
>
> Thanks!
>
> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers 
> wrote:
>
>> Hi Andy,
>>
>> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
>> magic: Drill looks for a subclass that has your SubScan subclass as the
>> second parameter. Looks like you figured that out.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>> andygrov...@gmail.com> wrote:
>>
>>  Actually I managed to get past that error with an educated guess that if
>> I
>> created a BatchCreator class, it would automagically be picked up somehow.
>> I'm now at the point where my RecordReader is being invoked!
>>
>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove  wrote:
>>
>> > Between reading the tutorial and copying and pasting code from the Kudu
>> > storage plugin, I've been making reasonable progress with this but am I
>> but
>> > confused by one error I'm now hitting.
>> > ExecutionSetupException: Failure finding OperatorCreator constructor for
>> > config com.mydb.MyDbSubScan
>> > Prior to this, Drill had called getSpecificScan and then called a few of
>> > the methods on my subscan object. I wasn't sure what to return for
>> > getOperatorType so just returned the kudu subscan operator type and I'm
>> > wondering if the issue is related to that somehow?
>> >
>> > Thanks.
>> >
>> >
>> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove 
>> wrote:
>> >
>> >> Thank you both for the those responses. This is very helpful. I have
>> >> ordered a copy of the book too. I'm using Drill 1.17.0.
>> >>
>> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> >> feasible to add the logic I need there. In parallel, I've started
>> >> implementing a new storage plugin. I'll be working on this more
>> tomorrow
>> >> and I'm sure I'll be 

Re: Looking for advice on integrating with a custom data source

2020-01-14 Thread Andy Grove
With some extra debugging I can see that the getNewWithChildren call is
made to an earlier instance of GroupScan and not the instance created by
the filter push-down rule. I'm wondering if this is some kind of
hashCode/equals/toString/getDigest issue?

On Tue, Jan 14, 2020 at 7:52 PM Andy Grove  wrote:

> I'm now working on predicate push down ... I have a filter rule that is
> correctly extracting the predicates that the backend database supports and
> I am creating a new GroupScan containing these predicates, using the Kafka
> plugin as a reference. I see the GroupScan constructor being called after
> this, with the predicates populated So far so good ... but then I see calls
> to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
> the GroupScan constructor with the predicates missing.
>
> Any pointers on what I might be missing? Is there more magic I need to
> know?
>
> Thanks!
>
> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers 
> wrote:
>
>> Hi Andy,
>>
>> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
>> magic: Drill looks for a subclass that has your SubScan subclass as the
>> second parameter. Looks like you figured that out.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>> On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>> andygrov...@gmail.com> wrote:
>>
>>  Actually I managed to get past that error with an educated guess that if
>> I
>> created a BatchCreator class, it would automagically be picked up somehow.
>> I'm now at the point where my RecordReader is being invoked!
>>
>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove  wrote:
>>
>> > Between reading the tutorial and copying and pasting code from the Kudu
>> > storage plugin, I've been making reasonable progress with this but am I
>> but
>> > confused by one error I'm now hitting.
>> > ExecutionSetupException: Failure finding OperatorCreator constructor for
>> > config com.mydb.MyDbSubScan
>> > Prior to this, Drill had called getSpecificScan and then called a few of
>> > the methods on my subscan object. I wasn't sure what to return for
>> > getOperatorType so just returned the kudu subscan operator type and I'm
>> > wondering if the issue is related to that somehow?
>> >
>> > Thanks.
>> >
>> >
>> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove 
>> wrote:
>> >
>> >> Thank you both for the those responses. This is very helpful. I have
>> >> ordered a copy of the book too. I'm using Drill 1.17.0.
>> >>
>> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> >> feasible to add the logic I need there. In parallel, I've started
>> >> implementing a new storage plugin. I'll be working on this more
>> tomorrow
>> >> and I'm sure I'll be back with more questions soon.
>> >>
>> >> Thanks again for your help!
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre 
>> wrote:
>> >>
>> >>> HI Andy,
>> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
>> you
>> >>> back as well.  I was going to say I thought the JDBC storage plugin
>> did in
>> >>> fact push down columns and filters to the source system.
>> >>>
>> >>> Also, what version of Drill are you using?
>> >>>
>> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
>> >>> recommend using the code from Paul's PR as that greatly simplifies
>> things.
>> >>> Here is a tutorial as well:
>> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>> >>>
>> >>> If you need additional help, please let us know.
>> >>> -- C
>> >>>
>> >>>
>> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove 
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I'd like to use Apache Drill with a custom data source that supports a
>> >>> subset of SQL.
>> >>>
>> >>> My goal is to have Drill push selection and predicates down to my data
>> >>> source but the rest of the query processing should take place in
>> Drill.
>> >>>
>> >>> I started out by writing a JDBC driver for the data source and
>> >>> registering
>> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
>> pass
>> >>> the
>> >>> whole query through to my data source, so that approach isn't going to
>> >>> work
>> >>> unless I'm missing something?
>> >>>
>> >>> Is there any way to configure the JDBC storage plugin to only push
>> >>> certain
>> >>> parts of the query to the data source?
>> >>>
>> >>> If this isn't a good approach, do I need to write a custom storage
>> >>> plugin?
>> >>> Can these be added on the classpath or would that require me
>> maintaining
>> >>> a
>> >>> fork of the project?
>> >>>
>> >>>
>> >>>
>> >>> I appreciate any pointers anyone can give me.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Andy.
>> >>>
>> >>>
>> >>>
>>
>
>


Re: Looking for advice on integrating with a custom data source

2020-01-14 Thread Andy Grove
I'm now working on predicate push down ... I have a filter rule that is
correctly extracting the predicates that the backend database supports and
I am creating a new GroupScan containing these predicates, using the Kafka
plugin as a reference. I see the GroupScan constructor being called after
this, with the predicates populated So far so good ... but then I see calls
to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
the GroupScan constructor with the predicates missing.

Any pointers on what I might be missing? Is there more magic I need to know?

Thanks!

On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers 
wrote:

> Hi Andy,
>
> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
> magic: Drill looks for a subclass that has your SubScan subclass as the
> second parameter. Looks like you figured that out.
>
> Thanks,
> - Paul
>
>
>
> On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> andygrov...@gmail.com> wrote:
>
>  Actually I managed to get past that error with an educated guess that if I
> created a BatchCreator class, it would automagically be picked up somehow.
> I'm now at the point where my RecordReader is being invoked!
>
> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove  wrote:
>
> > Between reading the tutorial and copying and pasting code from the Kudu
> > storage plugin, I've been making reasonable progress with this but am I
> but
> > confused by one error I'm now hitting.
> > ExecutionSetupException: Failure finding OperatorCreator constructor for
> > config com.mydb.MyDbSubScan
> > Prior to this, Drill had called getSpecificScan and then called a few of
> > the methods on my subscan object. I wasn't sure what to return for
> > getOperatorType so just returned the kudu subscan operator type and I'm
> > wondering if the issue is related to that somehow?
> >
> > Thanks.
> >
> >
> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove 
> wrote:
> >
> >> Thank you both for the those responses. This is very helpful. I have
> >> ordered a copy of the book too. I'm using Drill 1.17.0.
> >>
> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
> >> feasible to add the logic I need there. In parallel, I've started
> >> implementing a new storage plugin. I'll be working on this more tomorrow
> >> and I'm sure I'll be back with more questions soon.
> >>
> >> Thanks again for your help!
> >>
> >> Andy.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre  wrote:
> >>
> >>> HI Andy,
> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
> >>> back as well.  I was going to say I thought the JDBC storage plugin
> did in
> >>> fact push down columns and filters to the source system.
> >>>
> >>> Also, what version of Drill are you using?
> >>>
> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
> >>> recommend using the code from Paul's PR as that greatly simplifies
> things.
> >>> Here is a tutorial as well:
> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >>>
> >>> If you need additional help, please let us know.
> >>> -- C
> >>>
> >>>
> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove  wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'd like to use Apache Drill with a custom data source that supports a
> >>> subset of SQL.
> >>>
> >>> My goal is to have Drill push selection and predicates down to my data
> >>> source but the rest of the query processing should take place in Drill.
> >>>
> >>> I started out by writing a JDBC driver for the data source and
> >>> registering
> >>> that with Drill using the Jdbc Storage Plugin but it seems to just pass
> >>> the
> >>> whole query through to my data source, so that approach isn't going to
> >>> work
> >>> unless I'm missing something?
> >>>
> >>> Is there any way to configure the JDBC storage plugin to only push
> >>> certain
> >>> parts of the query to the data source?
> >>>
> >>> If this isn't a good approach, do I need to write a custom storage
> >>> plugin?
> >>> Can these be added on the classpath or would that require me
> maintaining
> >>> a
> >>> fork of the project?
> >>>
> >>>
> >>>
> >>> I appreciate any pointers anyone can give me.
> >>>
> >>> Thanks,
> >>>
> >>> Andy.
> >>>
> >>>
> >>>
>


Re: Apache Drill documentation updates

2020-01-14 Thread Paul Rogers
Hi Bridget,

Thanks much for the update and for all your efforts on Drill documentation over 
the last several years. Thanks for squeezing in time to handle the release 
updates. Also, thanks for documenting how to update the documentation: the team 
should be able to keep information updated within the structure you created for 
us.


- Paul

 

On Monday, January 13, 2020, 11:38:06 AM PST, Bevens, Bridget 
 wrote:  
 
 Hi,



I wanted to let everyone know that I won’t be working regularly on the Apache 
Drill documentation, but I’ll be available to update the website for the 
releases, which includes:

  *  Generating the release notes, blog, What’s New pages
  *  Updating files with the new release version/pointing files to the correct 
mirrors site
  *  Publishing the Apache Drill website for the release


If you have documentation updates or feature documentation that you want to add 
to the Apache Drill project, please submit your updates through a pull request. 
I am planning to review and publish content at the end of each month and before 
each release.

For information about how to update the documentation, refer to the 
Documentation Guidelines in the README.md file in the Apache Drill gh-pages 
branch.

Thanks,

Bridget
  

Re: querying json from multiple subdirectories

2020-01-14 Thread Charles Givre
Hi Prabhakar, 
I would think that the following query would work:

SELECT 
FROM dfs..`transactions/`

That should merge everything into one table and you should get a dir0 column 
with the directory names.

--C

> On Jan 14, 2020, at 4:56 AM, Prabhakar Bhosaale  wrote:
> 
> Hi All,
> 
> I am new to apache drill and trying to retrieve data from json files by
> querying the directories.
> 
> The directory structure is
> 
>|-->Year2012--->trans.json
>|
>|
> transactions-->|
>|
>|-->Year2013--->trans.json
> 
> I would like to query trans.json from both the sub-directories as one table
> and then join the resultant table with another table in a single query.
> Please help with possible options. thx
> 
> Regards
> Prabhakar



Re: querying json from multiple subdirectories

2020-01-14 Thread Arina Yelchiyeva
Hi, 

Drill can easily query directories including subdirectories and then join data 
with other directories, tables etc.
Please refer to Drill documentation for more details.
For example, you can start from this article: 
https://drill.apache.org/docs/querying-directories/ 
 

Kind regards,
Arina

> On Jan 14, 2020, at 11:56 AM, Prabhakar Bhosaale  
> wrote:
> 
> Hi All,
> 
> I am new to apache drill and trying to retrieve data from json files by
> querying the directories.
> 
> The directory structure is
> 
>|-->Year2012--->trans.json
>|
>|
> transactions-->|
>|
>|-->Year2013--->trans.json
> 
> I would like to query trans.json from both the sub-directories as one table
> and then join the resultant table with another table in a single query.
> Please help with possible options. thx
> 
> Regards
> Prabhakar



querying json from multiple subdirectories

2020-01-14 Thread Prabhakar Bhosaale
Hi All,

I am new to apache drill and trying to retrieve data from json files by
querying the directories.

The directory structure is

|-->Year2012--->trans.json
|
|
transactions-->|
|
|-->Year2013--->trans.json

I would like to query trans.json from both the sub-directories as one table
and then join the resultant table with another table in a single query.
Please help with possible options. thx

Regards
Prabhakar