Re: Drill for Data Virtualization

Kunal Khatua Thu, 11 Apr 2019 13:59:24 -0700

On 4/11/2019 12:39:24 PM, Sarnath K <[email protected]> wrote:
Thank you Kunal.


>>>You could try creating views for each source and then doing a group by
on the union of those views... that *might* get you the results you want

When you mention views, do you mean to say each view will be a group by
statement for that particular source....And we try to union them and group
again...This way, explicitly making up the Query do the pushdown.... That's
the idea you are referring to. Right!??
Kunal Khatua: That is correct. Worth a try. Start with querying the views 
individually to see if the pushdown occurs in the first place. 



Btw....Calcite (possibly) not recognizing the pushdown opportunity would be
a let down ... especially for flexible frameworks like Drill... In my
opinion...
Kunal Khatua: Yes, but again... Calcite is an independent open-source project 
in use by many other OSS and commercial vendors. Considering many such projects 
are driven by volunteer contributions, it's a miracle in my opinion that the 
open-source software is able to achieve so much (and, sometimes, putting 
commercial offerings to shame) without charging a penny to the end users. 

Developers behind Drill have made contributions to Calcite in their limited 
capacity, as have developers from other projects.. so, in many ways, Drill has 
actually benefited from Calcite in more ways than it could have by implementing 
its own Calcite substitute. Hopefully, someone in the community can take a look 
at enhancing this feature as well.

Thanks for your time. Appreciate much. I will keep posted.

Best,
Satnath

On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:

> Hi Sarnath
>
> I haven't tried your specific requirement, and it is possible that if you
> are querying only A or only B, Drill would be able to push it down to the
> source.
>
> However, it gets tricky when you are querying 2 or more sources in the
> same query, because (from my limited knowledge of Calcite) the Calcite
> parser needs to be aware that it can push filters down to both sources.
> With GROUP BY, multiple groupings across a single source versus across
> multiple sources are not semantically the same.
>
> You could try creating views for each source and then doing a group by on
> the union of those views... that *might* get you the results you want.
>
> You can give it a shot, but I suspect it won't be as performant. Let us
> know if you find it otherwise.
>
> ~ Kunal
>
> On 4/10/2019 9:02:24 PM, Sarnath K wrote:
> Hi Kunal,
>
> Thank you for your response. But what I read in this URL says it can be
> done (though my own interpretation is muddled)
> https://drill.apache.org/docs/rdbms-storage-plugin/
>
> There is a statement in the documentation that says:
>
> As with any source, Drill supports joins within and between all systems.
> Drill additionally has powerful pushdown capabilities with RDBMS sources.
> This includes support to push down join, where, group by, intersect and
> other SQL operations into a particular RDBMS source (as appropriate).
>
>
> >> That said, even if the feature existed, by design, only one fragment can
> read from a JDBC storage plugin, as it uses a single connection to stream
> out the resultset.
>
> I did not understand this. Say, I GROUP BY a particular column and perform
> "max", "min" and "sum" aggregation. These are all associative group summary
> operations. So, I have send MAX Query to A and then MAX query to B. Get the
> results from both into Drill cluster and then perform a MAX on the
> partially reduced result. This will be cheaper than loading all data from A
> and B into Drill and then performing the GROUP BY operation.
>
> Can Drill do these smart group-by operations as on today? The documentation
> I read above is encouraging (its pretty recent - Dec 2018).
>
> Thanks for your time,
> Best,
> Sarnath
>
>
>
> On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>
> > Hi Sarnath
> >
> > From what I understand by your description, you are looking to see if
> > Drill can push down the GROUP BY clause to the underlying JDBC sources A
> > and B.
> >
> > Unfortunately, Drill does not support pushdown for the JDBC storage
> plugin
> > as yet. That said, even if the feature existed, by design, only one
> > fragment can read from a JDBC storage plugin, as it uses a single
> > connection to stream out the resultset.
> >
> > ~ Kunal
> >
> > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> > Hi,
> >
> > I have a requirement where I need to split data between a fast RDBMS
> system
> > (A) that will have HOT data and a slower cold storage (B)
> >
> > Both A and B provide JDBC drivers
> >
> > I am looking to see if Drill will help me in coming with a JDBC URL (C)
> > which will hide the fact that data is split between A and B. i.e. Can
> Drill
> > be used to implement Data Virtualization?
> >
> > As much as I can read about Drill, I can definitely create 2 tables in
> > Drill one pointing to A and another to B.
> > However when I do GROUP BY queries or FILTER queries -- Does Drill take
> > advantage of the existing JDBC systems by actually sending a part of the
> > GROUP BY to A and another to B and then reduce the result again? i.e.
> Some
> > kind of smart predicate push-down for Analytical queries?
> >
> > Hope I sound clear to you. Appreciate your response much.
> >
> > Thank you,
> >
> > Best,
> > Sarnath
> >
>

Re: Drill for Data Virtualization

Reply via email to