How to force Drill to offload performing LIMIT onto the target storage?

Andrew Pashkin Wed, 18 Dec 2019 06:26:49 -0800

Hello!

I have a setup where I'm querying Impala server through JDBC connection
using Drill and I noticed that when I'm doing a simple GROUP BY query
with LIMIT with a small value the query runs for a very long time. And
in the query profile I see that hundreds of thousands of rows are being
fetched from the Impala server.


The plan looks like this:

00-00    Screen : rowType = RecordType(DATE payment_date, ANY total_sum): 
rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0 
network, 1760.0000000000002 memory}, id = 67521
00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType = 
RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost 
= {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id 
= 67520
00-02        SelectionVectorRemover : rowType = RecordType(DATE payment_date, 
ANY total_sum): rowcount = 10.0, cumulative cost = {320.0 rows, 2650.0 cpu, 0.0 
io, 0.0 network, 1760.0000000000002 memory}, id = 67519
00-03          Limit(fetch=[10]) : rowType = RecordType(DATE payment_date, ANY 
total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0 cpu, 0.0 io, 
0.0 network, 1760.0000000000002 memory}, id = 67518
00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType = 
RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost 
= {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id 
= 67517
00-05              Project(payment_date=[TO_DATE($0)], payed_in_usd_amt=[$11]) 
: rowType = RecordType(DATE date_field, REAL value_field): rowcount = 100.0, 
cumulative cost = {200.0 rows, 600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id 
= 67516
00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my table>` ]) : 
rowType = RecordType(<my fields>): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 67435

It seems like Drill issues SELECT * FROM <Impala> query to perform
group-by and limit on the host which seems terribly inefficient and
could be performed by Impala itself with much less effort.

I wonder - is it possible to tune Drill somehow to perform
limit/group-by using the storage capabilities (Impala in my case)? Or if
such optimization is something do be developed - is it absent only for
JDBC connections or for all storage types?

--
With kind regards, Andrew Pashkin.
cell phone - +375 (44) 492-16-85
Skype - waves_in_fluids
e-mail - [email protected]

How to force Drill to offload performing LIMIT onto the target storage?

Reply via email to