Hello!
I have a setup where I'm querying Impala server through JDBC connection
using Drill and I noticed that when I'm doing a simple GROUP BY query
with LIMIT with a small value the query runs for a very long time. And
in the query profile I see that hundreds of thousands of rows are being
fetched from the Impala server.
The plan looks like this:
00-00 Screen : rowType = RecordType(DATE payment_date, ANY total_sum):
rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0
network, 1760.0000000000002 memory}, id = 67521
00-01 Project(payment_date=[$0], total_sum=[$1]) : rowType =
RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost
= {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id
= 67520
00-02 SelectionVectorRemover : rowType = RecordType(DATE payment_date,
ANY total_sum): rowcount = 10.0, cumulative cost = {320.0 rows, 2650.0 cpu, 0.0
io, 0.0 network, 1760.0000000000002 memory}, id = 67519
00-03 Limit(fetch=[10]) : rowType = RecordType(DATE payment_date, ANY
total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0 cpu, 0.0 io,
0.0 network, 1760.0000000000002 memory}, id = 67518
00-04 HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType =
RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost
= {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id
= 67517
00-05 Project(payment_date=[TO_DATE($0)], payed_in_usd_amt=[$11])
: rowType = RecordType(DATE date_field, REAL value_field): rowcount = 100.0,
cumulative cost = {200.0 rows, 600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id
= 67516
00-06 Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my table>` ]) :
rowType = RecordType(<my fields>): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 67435
It seems like Drill issues SELECT * FROM <Impala> query to perform
group-by and limit on the host which seems terribly inefficient and
could be performed by Impala itself with much less effort.
I wonder - is it possible to tune Drill somehow to perform
limit/group-by using the storage capabilities (Impala in my case)? Or if
such optimization is something do be developed - is it absent only for
JDBC connections or for all storage types?
--
With kind regards, Andrew Pashkin.
cell phone - +375 (44) 492-16-85
Skype - waves_in_fluids
e-mail - [email protected]