Re: Drill for Data Virtualization

Sarnath K Sat, 13 Apr 2019 02:30:04 -0700

Hi Kunal,
I tried examining the plan for a simple group by.  I see that the group by
is pushed to JDBC step whose output goes to the project ... Which seems
like pushdown is working fine ...
We are trying other cases. I will keep posted.
Thank you for your help and time.
I understand Calcite is a great community effort...I have been following it
for quite some time. Thanks!!
Best,
Sarnath


On Fri, Apr 12, 2019, 02:28 Kunal Khatua <[email protected]> wrote:

> On 4/11/2019 12:39:24 PM, Sarnath K <[email protected]> wrote:
> Thank you Kunal.
>
> >>>You could try creating views for each source and then doing a group by
> on the union of those views... that *might* get you the results you want
>
> When you mention views, do you mean to say each view will be a group by
> statement for that particular source....And we try to union them and group
> again...This way, explicitly making up the Query do the pushdown.... That's
> the idea you are referring to. Right!??
> Kunal Khatua: That is correct. Worth a try. Start with querying the views
> individually to see if the pushdown occurs in the first place.
>
>
>
> Btw....Calcite (possibly) not recognizing the pushdown opportunity would be
> a let down ... especially for flexible frameworks like Drill... In my
> opinion...
> Kunal Khatua: Yes, but again... Calcite is an independent open-source
> project in use by many other OSS and commercial vendors. Considering many
> such projects are driven by volunteer contributions, it's a miracle in my
> opinion that the open-source software is able to achieve so much (and,
> sometimes, putting commercial offerings to shame) without charging a penny
> to the end users.
>
> Developers behind Drill have made contributions to Calcite in their
> limited capacity, as have developers from other projects.. so, in many
> ways, Drill has actually benefited from Calcite in more ways than it could
> have by implementing its own Calcite substitute. Hopefully, someone in the
> community can take a look at enhancing this feature as well.
>
> Thanks for your time. Appreciate much. I will keep posted.
>
> Best,
> Satnath
>
> On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:
>
> > Hi Sarnath
> >
> > I haven't tried your specific requirement, and it is possible that if you
> > are querying only A or only B, Drill would be able to push it down to the
> > source.
> >
> > However, it gets tricky when you are querying 2 or more sources in the
> > same query, because (from my limited knowledge of Calcite) the Calcite
> > parser needs to be aware that it can push filters down to both sources.
> > With GROUP BY, multiple groupings across a single source versus across
> > multiple sources are not semantically the same.
> >
> > You could try creating views for each source and then doing a group by on
> > the union of those views... that *might* get you the results you want.
> >
> > You can give it a shot, but I suspect it won't be as performant. Let us
> > know if you find it otherwise.
> >
> > ~ Kunal
> >
> > On 4/10/2019 9:02:24 PM, Sarnath K wrote:
> > Hi Kunal,
> >
> > Thank you for your response. But what I read in this URL says it can be
> > done (though my own interpretation is muddled)
> > https://drill.apache.org/docs/rdbms-storage-plugin/
> >
> > There is a statement in the documentation that says:
> >
> > As with any source, Drill supports joins within and between all systems.
> > Drill additionally has powerful pushdown capabilities with RDBMS sources.
> > This includes support to push down join, where, group by, intersect and
> > other SQL operations into a particular RDBMS source (as appropriate).
> >
> >
> > >> That said, even if the feature existed, by design, only one fragment
> can
> > read from a JDBC storage plugin, as it uses a single connection to stream
> > out the resultset.
> >
> > I did not understand this. Say, I GROUP BY a particular column and
> perform
> > "max", "min" and "sum" aggregation. These are all associative group
> summary
> > operations. So, I have send MAX Query to A and then MAX query to B. Get
> the
> > results from both into Drill cluster and then perform a MAX on the
> > partially reduced result. This will be cheaper than loading all data
> from A
> > and B into Drill and then performing the GROUP BY operation.
> >
> > Can Drill do these smart group-by operations as on today? The
> documentation
> > I read above is encouraging (its pretty recent - Dec 2018).
> >
> > Thanks for your time,
> > Best,
> > Sarnath
> >
> >
> >
> > On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
> >
> > > Hi Sarnath
> > >
> > > From what I understand by your description, you are looking to see if
> > > Drill can push down the GROUP BY clause to the underlying JDBC sources
> A
> > > and B.
> > >
> > > Unfortunately, Drill does not support pushdown for the JDBC storage
> > plugin
> > > as yet. That said, even if the feature existed, by design, only one
> > > fragment can read from a JDBC storage plugin, as it uses a single
> > > connection to stream out the resultset.
> > >
> > > ~ Kunal
> > >
> > > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> > > Hi,
> > >
> > > I have a requirement where I need to split data between a fast RDBMS
> > system
> > > (A) that will have HOT data and a slower cold storage (B)
> > >
> > > Both A and B provide JDBC drivers
> > >
> > > I am looking to see if Drill will help me in coming with a JDBC URL (C)
> > > which will hide the fact that data is split between A and B. i.e. Can
> > Drill
> > > be used to implement Data Virtualization?
> > >
> > > As much as I can read about Drill, I can definitely create 2 tables in
> > > Drill one pointing to A and another to B.
> > > However when I do GROUP BY queries or FILTER queries -- Does Drill take
> > > advantage of the existing JDBC systems by actually sending a part of
> the
> > > GROUP BY to A and another to B and then reduce the result again? i.e.
> > Some
> > > kind of smart predicate push-down for Analytical queries?
> > >
> > > Hope I sound clear to you. Appreciate your response much.
> > >
> > > Thank you,
> > >
> > > Best,
> > > Sarnath
> > >
> >
>

Re: Drill for Data Virtualization

Reply via email to