[jira] [Comment Edited] (CALCITE-1428) Inefficient execution plan of SELECT and LIMIT for Druid

Jiarong Wei (JIRA) Thu, 13 Oct 2016 21:41:27 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574173#comment-15574173
 ]


Jiarong Wei edited comment on CALCITE-1428 at 10/14/16 4:40 AM:
----------------------------------------------------------------

Thanks for your suggestion!
I looked into the code and found a naive solution for pushing LIMIT.
In {{DruidQuery}}, the 
[cost|https://github.com/apache/calcite/blob/master/druid/src/main/java/org/apache/calcite/adapter/druid/DruidQuery.java#L325]
 depends on the last {{RelNode}} it "ate". And the 
[cost|https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/core/Sort.java#L139]
 of {{Sort}} is much higher if rows are wider. The comments says it prevents 
pushing a project through a sort for the case of wider rows. It's a reasonable 
"rule" but I think maybe creating a actual {{RelOptRule}} is better for that. 
That's also why LIMIT is pushed when only a few of columns are selected but not 
for {{SELECT *}}.
In my naive solution, assumed it's cheaper to let Druid do things for us, I 
just set {{DruidQuery}} zero-cost, which makes the LIMIT pushed {{DruidQuery}} 
chosen. What do you think of it?


was (Author: vcamx):
Thanks for your suggestion!
I looked into the code and found a naive solution for pushing LIMIT.
In {{DruidQuery}}, the 
[cost|https://github.com/apache/calcite/blob/master/druid/src/main/java/org/apache/calcite/adapter/druid/DruidQuery.java#L325]
 depends on the last {{RelNode}} it "ate". And the 
[cost|https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/rel/core/Sort.java#L139]
 of {{Sort}} is much higher if rows are wider. The comments says it prevents 
pushing a project through a sort for the case of wider rows. It's a reasonable 
"rule" but I think maybe creating a actual {{RelOptRule}} is better for that. 
That's also why {{LIMIT}} is pushed when only a few of columns are selected but 
not for {{SELECT *}}.
In my naive solution, assumed it's cheaper to let Druid do things for us, I 
just set {{DruidQuery}} zero-cost, which makes the LIMIT pushed {{DruidQuery}} 
chosen. What do you think of it?

> Inefficient execution plan of SELECT and LIMIT for Druid
> --------------------------------------------------------
>
>                 Key: CALCITE-1428
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1428
>             Project: Calcite
>          Issue Type: Bug
>          Components: core, druid
>            Reporter: Jiarong Wei
>            Assignee: Julian Hyde
>
> For SQLs like:
> 1. {{SELECT * FROM <table> LIMIT <row_count>}}
> 2. {{SELECT <all_columns_specified_explicitly> FROM <table> LIMIT 
> <row_count>}}
> {{DruidSortRule}} in Druid adapter does take effect and {{LIMIT}} is pushed 
> into {{DruidQuery}}. However the corresponding execution plan isn't chosen as 
> the best one. Thus Calcite will retrieve all data from Druid and purge all 
> unnecessary columns.
> These are three SQLs and their corresponding execution plans below for 
> dataset {{wikiticker}} in Druid quickstart:
> 1. {{SELECT "cityName" FROM "wikiticker" LIMIT 5}}
> {code}
> rel#27:EnumerableInterpreter.ENUMERABLE.[](input=rel#26:Subset#2.BINDABLE.[])
> rel#85:DruidQuery.BINDABLE.[](table=[default, 
> wikiticker],intervals=[1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z],projects=[$3],fetch=5)
> {code}
> 2. {{SELECT * FROM "wikiticker" LIMIT 5}}
> {code}
> rel#52:EnumerableLimit.ENUMERABLE.[](input=rel#36:Subset#0.ENUMERABLE.[],fetch=5)
> rel#79:EnumerableInterpreter.ENUMERABLE.[](input=rel#4:Subset#0.BINDABLE.[])
> rel#1:DruidQuery.BINDABLE.[](table=[default, 
> wikiticker],intervals=[1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z])
> {code}
> 3. {{SELECT "__time", "added", "channel", "cityName", "comment", 
> "commentLength", "count", "countryIsoCode", "countryName", "deleted", 
> "delta", "deltaBucket", "diffUrl", "flags", "isAnonymous", "isMinor", 
> "isNew", "isRobot", "isUnpatrolled", "metroCode", "namespace", "page", 
> "regionIsoCode", "regionName", "user", "user_unique" FROM "wikiticker" LIMIT 
> 5}}
> {code}
> rel#42:EnumerableLimit.ENUMERABLE.[](input=rel#41:Subset#1.ENUMERABLE.[],fetch=5)
> rel#113:EnumerableInterpreter.ENUMERABLE.[](input=rel#34:Subset#1.BINDABLE.[])
> rel#52:BindableProject.BINDABLE.[](input=rel#4:Subset#0.BINDABLE.[],__time=$0,added=$1,channel=$2,cityName=$3,comment=$4,commentLength=$5,count=$6,countryIsoCode=$7,countryName=$8,deleted=$9,delta=$10,deltaBucket=$11,diffUrl=$12,flags=$13,isAnonymous=$14,isMinor=$15,isNew=$16,isRobot=$17,isUnpatrolled=$18,metroCode=$19,namespace=$20,page=$21,regionIsoCode=$22,regionName=$23,user=USER,user_unique=$25)
> rel#1:DruidQuery.BINDABLE.[](table=[default, 
> wikiticker],intervals=[1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z])
> {code}
> Notice that 2 and 3 should have {{LIMIT}} pushed to {{DruidQuery}} like 1 
> (and should not have {{EnumerableLimit}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CALCITE-1428) Inefficient execution plan of SELECT and LIMIT for Druid

Reply via email to