Re: Drill for Data Virtualization

Kunal Khatua Wed, 17 Apr 2019 11:02:21 -0700

I was expecting that #1 would not do any push down at all, because it involves 
Calcite to get smart about doing query re-writes and pushing down aggregation 
as part of separate queries into the JDBC sources A & B, before reapplying the 
aggregation again within the UNIONized data.


If #2 partially works, I'd recommend you pick Case 1 as your candidate query 
(assuming that pushing down GROUP BY on MySQL is more beneficial than pushing 
it only on SQLLite).
Your snippet is truncated, but my suspicion is that it might have to do with 
the row-count estimate for Source B being considered much greater than Source 
A; causing Calcite to recommend pushing down the aggregation to only one of the 
2 sources.

Unfortunately, there is no flag to push it down to both and for something like 
this, one would need to dig into Calcite to understand why it didn't apply the 
push-down to both sides before a code change can be suggested.

Glad you've liked the WebUI. Apache Drill 1.16 will get further improvements in 
the WebUI, though nothing that will specifically address your current problem.

~ Kunal  
On 4/17/2019 7:20:38 AM, Sarnath K <[email protected]> wrote:
Hi Kunal,

Few experiments that we did and the results:
1)
UNION from A and B and then have a Single GROUP BY
select GK, count(*) from ((select * from A) UNION ALL (select * from B))
GROUP BY GK
This one did the HashAggregate only in Drill. There was absolutely NO
PUSHDOWN to JDBC sources.

2)
GROUP BY from A UNION GROUP BY from B and then GROUP AGAIN
select GK, count(*) from ((select GK, count(*) as cnt from A GROUP BY GK)
UNION ALL (select GK, count(*) as cnt from B GROUP BY GK)) GROUP BY GK
For this one, the Predicate Push Down is happening ONLY for 'A' (regardless
of which data source it is).
There is no Predicate Push Down for 'B' (no matter which Data source it is
in)
Case1) We tested with A as MySQL and B as SQLite
Case2) We tested with A as SQLite and B as MySQL
In the first case the Predicate Pushdown was for 'A' alone i.e. the first
table which is Grouped.

Here is a physical plan of such an execution:
00-00 Screen : rowType = RecordType(VARCHAR(65536) empname, BIGINT cnt):
rowcount = 2.0, cumula
00-01 Project(empname=[$0], cnt=[$1]) : rowType =
RecordType(VARCHAR(65536) empname, BIGINT c
00-02 HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empname, BI
00-03 UnionAll(all=[true]) : rowType = RecordType(VARCHAR(65536)
empname): rowcount = 20.
00-05 Jdbc(sql=[SELECT `EmpName` AS `empname` FROM
`test`.`emp` GROUP BY `EmpName` ]) : rowType = RecordType(VARCHAR(255)
empname): rowcount = 10.0, cumulative c
00-04 Project(empname=[$0]) : rowType =
RecordType(VARCHAR(65536) empname): rowcount =
00-06 HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empna
00-07 Project(empname=[$1]) : rowType =
RecordType(VARCHAR(65536) empname): rowcoun
00-08 Jdbc(sql=[SELECT * FROM "Emp" ]) : rowType =
RecordType(INTEGER EmpId, VARCHAR(65536) EmpName, VARCHAR(65536) Phone,

One can see that the JDBC at 00-05 has GROUP BY pushed down
Whereas the JDBC at 00-08 does not have the GROUP BY pushed down. It is
rather handled after Union at 00-02

Is there any option that we could enable to make them GROUP BY at both
places?

btw... I am very impressed with Drill. It has a very simple and elegant
interface. A place where we can see all Queries that are getting executed,
see logs from web-app itself and beautifully vertically rendered Plans that
are so easily understandable.

This is exactly how intuitive and easy a user-interface has to be. Well
done, Drill Team!
Of course, we are impressed with reading multiple sources and joining them.
Looks like Magic!
Great job!

Thanks,
Best,
Sarnath





On Sat, Apr 13, 2019 at 2:59 PM Sarnath K wrote:

> Hi Kunal,
> I tried examining the plan for a simple group by. I see that the group by
> is pushed to JDBC step whose output goes to the project ... Which seems
> like pushdown is working fine ...
> We are trying other cases. I will keep posted.
> Thank you for your help and time.
> I understand Calcite is a great community effort...I have been following
> it for quite some time. Thanks!!
> Best,
> Sarnath
>
> On Fri, Apr 12, 2019, 02:28 Kunal Khatua wrote:
>
>> On 4/11/2019 12:39:24 PM, Sarnath K wrote:
>> Thank you Kunal.
>>
>> >>>You could try creating views for each source and then doing a group by
>> on the union of those views... that *might* get you the results you want
>>
>> When you mention views, do you mean to say each view will be a group by
>> statement for that particular source....And we try to union them and group
>> again...This way, explicitly making up the Query do the pushdown....
>> That's
>> the idea you are referring to. Right!??
>> Kunal Khatua: That is correct. Worth a try. Start with querying the views
>> individually to see if the pushdown occurs in the first place.
>>
>>
>>
>> Btw....Calcite (possibly) not recognizing the pushdown opportunity would
>> be
>> a let down ... especially for flexible frameworks like Drill... In my
>> opinion...
>> Kunal Khatua: Yes, but again... Calcite is an independent open-source
>> project in use by many other OSS and commercial vendors. Considering many
>> such projects are driven by volunteer contributions, it's a miracle in my
>> opinion that the open-source software is able to achieve so much (and,
>> sometimes, putting commercial offerings to shame) without charging a penny
>> to the end users.
>>
>> Developers behind Drill have made contributions to Calcite in their
>> limited capacity, as have developers from other projects.. so, in many
>> ways, Drill has actually benefited from Calcite in more ways than it could
>> have by implementing its own Calcite substitute. Hopefully, someone in the
>> community can take a look at enhancing this feature as well.
>>
>> Thanks for your time. Appreciate much. I will keep posted.
>>
>> Best,
>> Satnath
>>
>> On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:
>>
>> > Hi Sarnath
>> >
>> > I haven't tried your specific requirement, and it is possible that if
>> you
>> > are querying only A or only B, Drill would be able to push it down to
>> the
>> > source.
>> >
>> > However, it gets tricky when you are querying 2 or more sources in the
>> > same query, because (from my limited knowledge of Calcite) the Calcite
>> > parser needs to be aware that it can push filters down to both sources.
>> > With GROUP BY, multiple groupings across a single source versus across
>> > multiple sources are not semantically the same.
>> >
>> > You could try creating views for each source and then doing a group by
>> on
>> > the union of those views... that *might* get you the results you want.
>> >
>> > You can give it a shot, but I suspect it won't be as performant. Let us
>> > know if you find it otherwise.
>> >
>> > ~ Kunal
>> >
>> > On 4/10/2019 9:02:24 PM, Sarnath K wrote:
>> > Hi Kunal,
>> >
>> > Thank you for your response. But what I read in this URL says it can be
>> > done (though my own interpretation is muddled)
>> > https://drill.apache.org/docs/rdbms-storage-plugin/
>> >
>> > There is a statement in the documentation that says:
>> >
>> > As with any source, Drill supports joins within and between all systems.
>> > Drill additionally has powerful pushdown capabilities with RDBMS
>> sources.
>> > This includes support to push down join, where, group by, intersect and
>> > other SQL operations into a particular RDBMS source (as appropriate).
>> >
>> >
>> > >> That said, even if the feature existed, by design, only one fragment
>> can
>> > read from a JDBC storage plugin, as it uses a single connection to
>> stream
>> > out the resultset.
>> >
>> > I did not understand this. Say, I GROUP BY a particular column and
>> perform
>> > "max", "min" and "sum" aggregation. These are all associative group
>> summary
>> > operations. So, I have send MAX Query to A and then MAX query to B. Get
>> the
>> > results from both into Drill cluster and then perform a MAX on the
>> > partially reduced result. This will be cheaper than loading all data
>> from A
>> > and B into Drill and then performing the GROUP BY operation.
>> >
>> > Can Drill do these smart group-by operations as on today? The
>> documentation
>> > I read above is encouraging (its pretty recent - Dec 2018).
>> >
>> > Thanks for your time,
>> > Best,
>> > Sarnath
>> >
>> >
>> >
>> > On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>> >
>> > > Hi Sarnath
>> > >
>> > > From what I understand by your description, you are looking to see if
>> > > Drill can push down the GROUP BY clause to the underlying JDBC
>> sources A
>> > > and B.
>> > >
>> > > Unfortunately, Drill does not support pushdown for the JDBC storage
>> > plugin
>> > > as yet. That said, even if the feature existed, by design, only one
>> > > fragment can read from a JDBC storage plugin, as it uses a single
>> > > connection to stream out the resultset.
>> > >
>> > > ~ Kunal
>> > >
>> > > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
>> > > Hi,
>> > >
>> > > I have a requirement where I need to split data between a fast RDBMS
>> > system
>> > > (A) that will have HOT data and a slower cold storage (B)
>> > >
>> > > Both A and B provide JDBC drivers
>> > >
>> > > I am looking to see if Drill will help me in coming with a JDBC URL
>> (C)
>> > > which will hide the fact that data is split between A and B. i.e. Can
>> > Drill
>> > > be used to implement Data Virtualization?
>> > >
>> > > As much as I can read about Drill, I can definitely create 2 tables in
>> > > Drill one pointing to A and another to B.
>> > > However when I do GROUP BY queries or FILTER queries -- Does Drill
>> take
>> > > advantage of the existing JDBC systems by actually sending a part of
>> the
>> > > GROUP BY to A and another to B and then reduce the result again? i.e.
>> > Some
>> > > kind of smart predicate push-down for Analytical queries?
>> > >
>> > > Hope I sound clear to you. Appreciate your response much.
>> > >
>> > > Thank you,
>> > >
>> > > Best,
>> > > Sarnath
>> > >
>> >
>>
>

Re: Drill for Data Virtualization

Reply via email to