Re: How to force Drill to offload performing LIMIT onto the target storage?

Vova Vysotskyi Wed, 18 Dec 2019 10:29:35 -0800

>From the code, it looks like it happens to all functions that are present
in Drill even if such functions are part of the SQL standard (it was done
since function behavior may not be strictly described by standard and
therefore will be implementation-dependent).
But functions that absent in Drill and are present in SqlStdOperatorTable
<https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/sql/fun/SqlStdOperatorTable.java>
(if such exist) will be pushed to JDBC.


Kind regards,
Volodymyr Vysotskyi


On Wed, Dec 18, 2019 at 8:09 PM Andrew Pashkin <andrew.pash...@gmx.co.uk>
wrote:

> You are right! Group by is being pushed to the storage if I remove
> TO_DATE(). Does it happen with all functions or only with some (perhaps
> the ones that are not in the SQL standard)?
>
> On 18.12.19 20:08, Vova Vysotskyi wrote:
> > Hi Andrew,
> >
> > Drill in general case pushes aggregations into the JDBC storage, but
> Drill
> > doesn't push functions into JDBC storage since some databases may not
> have
> > specific function implementations.
> >
> > In the plan, you have provided, the grouping is produced on top of
> > TO_DATE() function call. If it is possible, please rewrite the query to
> do
> > aggregations, and then call functions. In this case, Drill should be able
> > to push aggregation into JDBC storage.
> >
> > Regarding pushing limit into the JDBC storage, it is a bug which should
> be
> > fixed, I have created a Jira ticket for this: DRILL-7490
> > <https://issues.apache.org/jira/browse/DRILL-7490>.
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Wed, Dec 18, 2019 at 4:26 PM Andrew Pashkin <andrew.pash...@gmx.co.uk
> >
> > wrote:
> >
> >> Hello!
> >>
> >> I have a setup where I'm querying Impala server through JDBC connection
> >> using Drill and I noticed that when I'm doing a simple GROUP BY query
> >> with LIMIT with a small value the query runs for a very long time. And
> >> in the query profile I see that hundreds of thousands of rows are being
> >> fetched from the Impala server.
> >>
> >> The plan looks like this:
> >>
> >> 00-00    Screen : rowType = RecordType(DATE payment_date, ANY
> total_sum):
> >> rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0
> >> network, 1760.0000000000002 memory}, id = 67521
> >> 00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType =
> >> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0,
> cumulative
> >> cost = {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> >> memory}, id = 67520
> >> 00-02        SelectionVectorRemover : rowType = RecordType(DATE
> >> payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {320.0
> >> rows, 2650.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id =
> >> 67519
> >> 00-03          Limit(fetch=[10]) : rowType = RecordType(DATE
> payment_date,
> >> ANY total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0
> cpu,
> >> 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67518
> >> 00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType =
> >> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0,
> cumulative
> >> cost = {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> >> memory}, id = 67517
> >> 00-05              Project(payment_date=[TO_DATE($0)],
> >> payed_in_usd_amt=[$11]) : rowType = RecordType(DATE date_field, REAL
> >> value_field): rowcount = 100.0, cumulative cost = {200.0 rows, 600.0
> cpu,
> >> 0.0 io, 0.0 network, 0.0 memory}, id = 67516
> >> 00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my
> >> table>` ]) : rowType = RecordType(<my fields>): rowcount = 100.0,
> >> cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0
> memory},
> >> id = 67435
> >>
> >> It seems like Drill issues SELECT * FROM <Impala> query to perform
> >> group-by and limit on the host which seems terribly inefficient and
> >> could be performed by Impala itself with much less effort.
> >>
> >> I wonder - is it possible to tune Drill somehow to perform
> >> limit/group-by using the storage capabilities (Impala in my case)? Or if
> >> such optimization is something do be developed - is it absent only for
> >> JDBC connections or for all storage types?
> >>
> >> --
> >> With kind regards, Andrew Pashkin.
> >> cell phone - +375 (44) 492-16-85
> >> Skype - waves_in_fluids
> >> e-mail - andrew.pash...@gmx.co.uk
> >>
> >>
>

Re: How to force Drill to offload performing LIMIT onto the target storage?

Reply via email to