Hi Andrew,

Drill in general case pushes aggregations into the JDBC storage, but Drill
doesn't push functions into JDBC storage since some databases may not have
specific function implementations.

In the plan, you have provided, the grouping is produced on top of
TO_DATE() function call. If it is possible, please rewrite the query to do
aggregations, and then call functions. In this case, Drill should be able
to push aggregation into JDBC storage.

Regarding pushing limit into the JDBC storage, it is a bug which should be
fixed, I have created a Jira ticket for this: DRILL-7490
<https://issues.apache.org/jira/browse/DRILL-7490>.

Kind regards,
Volodymyr Vysotskyi


On Wed, Dec 18, 2019 at 4:26 PM Andrew Pashkin <andrew.pash...@gmx.co.uk>
wrote:

> Hello!
>
> I have a setup where I'm querying Impala server through JDBC connection
> using Drill and I noticed that when I'm doing a simple GROUP BY query
> with LIMIT with a small value the query runs for a very long time. And
> in the query profile I see that hundreds of thousands of rows are being
> fetched from the Impala server.
>
> The plan looks like this:
>
> 00-00    Screen : rowType = RecordType(DATE payment_date, ANY total_sum):
> rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0
> network, 1760.0000000000002 memory}, id = 67521
> 00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType =
> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative
> cost = {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> memory}, id = 67520
> 00-02        SelectionVectorRemover : rowType = RecordType(DATE
> payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {320.0
> rows, 2650.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id =
> 67519
> 00-03          Limit(fetch=[10]) : rowType = RecordType(DATE payment_date,
> ANY total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0 cpu,
> 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67518
> 00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType =
> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative
> cost = {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> memory}, id = 67517
> 00-05              Project(payment_date=[TO_DATE($0)],
> payed_in_usd_amt=[$11]) : rowType = RecordType(DATE date_field, REAL
> value_field): rowcount = 100.0, cumulative cost = {200.0 rows, 600.0 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 67516
> 00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my
> table>` ]) : rowType = RecordType(<my fields>): rowcount = 100.0,
> cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 67435
>
> It seems like Drill issues SELECT * FROM <Impala> query to perform
> group-by and limit on the host which seems terribly inefficient and
> could be performed by Impala itself with much less effort.
>
> I wonder - is it possible to tune Drill somehow to perform
> limit/group-by using the storage capabilities (Impala in my case)? Or if
> such optimization is something do be developed - is it absent only for
> JDBC connections or for all storage types?
>
> --
> With kind regards, Andrew Pashkin.
> cell phone - +375 (44) 492-16-85
> Skype - waves_in_fluids
> e-mail - andrew.pash...@gmx.co.uk
>
>

Reply via email to