[jira] [Updated] (DRILL-7227) TPCDS queries 47, 57, 59 fail to run with Statistics enabled at sf100

2019-05-09 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7227:

Labels:   (was: ready-to-commit)

> TPCDS queries 47, 57, 59 fail to run with Statistics enabled at sf100
> -
>
> Key: DRILL-7227
> URL: https://issues.apache.org/jira/browse/DRILL-7227
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: 23387ab0-cb1c-cd5e-449a-c9bcefc901c1.sys.drill, 
> 2338ae93-155b-356d-382e-0da949c6f439.sys.drill
>
>
> Here is query 78:
> {noformat}
> WITH ws 
>  AS (SELECT d_year AS ws_sold_year, 
> ws_item_sk, 
> ws_bill_customer_skws_customer_sk, 
> Sum(ws_quantity)   ws_qty, 
> Sum(ws_wholesale_cost) ws_wc, 
> Sum(ws_sales_price)ws_sp 
>  FROM   web_sales 
> LEFT JOIN web_returns 
>ON wr_order_number = ws_order_number 
>   AND ws_item_sk = wr_item_sk 
> JOIN date_dim 
>   ON ws_sold_date_sk = d_date_sk 
>  WHERE  wr_order_number IS NULL 
>  GROUP  BY d_year, 
>ws_item_sk, 
>ws_bill_customer_sk), 
>  cs 
>  AS (SELECT d_year AS cs_sold_year, 
> cs_item_sk, 
> cs_bill_customer_skcs_customer_sk, 
> Sum(cs_quantity)   cs_qty, 
> Sum(cs_wholesale_cost) cs_wc, 
> Sum(cs_sales_price)cs_sp 
>  FROM   catalog_sales 
> LEFT JOIN catalog_returns 
>ON cr_order_number = cs_order_number 
>   AND cs_item_sk = cr_item_sk 
> JOIN date_dim 
>   ON cs_sold_date_sk = d_date_sk 
>  WHERE  cr_order_number IS NULL 
>  GROUP  BY d_year, 
>cs_item_sk, 
>cs_bill_customer_sk), 
>  ss 
>  AS (SELECT d_year AS ss_sold_year, 
> ss_item_sk, 
> ss_customer_sk, 
> Sum(ss_quantity)   ss_qty, 
> Sum(ss_wholesale_cost) ss_wc, 
> Sum(ss_sales_price)ss_sp 
>  FROM   store_sales 
> LEFT JOIN store_returns 
>ON sr_ticket_number = ss_ticket_number 
>   AND ss_item_sk = sr_item_sk 
> JOIN date_dim 
>   ON ss_sold_date_sk = d_date_sk 
>  WHERE  sr_ticket_number IS NULL 
>  GROUP  BY d_year, 
>ss_item_sk, 
>ss_customer_sk) 
> SELECT ss_item_sk, 
>Round(ss_qty / ( COALESCE(ws_qty + cs_qty, 1) ), 2) ratio, 
>ss_qty  store_qty, 
>ss_wc 
>store_wholesale_cost, 
>ss_sp 
>store_sales_price, 
>COALESCE(ws_qty, 0) + COALESCE(cs_qty, 0) 
>other_chan_qty, 
>COALESCE(ws_wc, 0) + COALESCE(cs_wc, 0) 
>other_chan_wholesale_cost, 
>COALESCE(ws_sp, 0) + COALESCE(cs_sp, 0) 
>other_chan_sales_price 
> FROM   ss 
>LEFT JOIN ws 
>   ON ( ws_sold_year = ss_sold_year 
>AND ws_item_sk = ss_item_sk 
>AND ws_customer_sk = ss_customer_sk ) 
>LEFT JOIN cs 
>   ON ( cs_sold_year = ss_sold_year 
>AND cs_item_sk = cs_item_sk 
>AND cs_customer_sk = ss_customer_sk ) 
> WHERE  COALESCE(ws_qty, 0) > 0 
>AND COALESCE(cs_qty, 0) > 0 
>AND ss_sold_year = 1999 
> ORDER  BY ss_item_sk, 
>   ss_qty DESC, 
>   ss_wc DESC, 
>   ss_sp DESC, 
>   other_chan_qty, 
>   other_chan_wholesale_cost, 
>   other_chan_sales_price, 
>   Round(ss_qty / ( COALESCE(ws_qty + cs_qty, 1) ), 2)
> LIMIT 100; 
> {noformat}
> The profile for the new plan is 2338ae93-155b-356d-382e-0da949c6f439.  Hash 
> partition sender operator (10-00) takes 10-15 minutes.  I am not sure why it 
> takes so long.  It has 10 minor fragments sending to receiver (06-05), which 
> has 62 minor fragments.  But hash partition sender (16-00) has 10 minor 
> fragments sending to receiver (12-06), which has 220 minor fragments, and 
> there is no performance issue.
> The profile for the old plan is 23387ab0-cb1c-cd5e-449a-c9bcefc901c1.  Both 
> plans use the same commit.  

[jira] [Created] (DRILL-7245) TPCDS queries 1, 45, 65, 97 are 3x slower when Statistics is enabled at sf 100

2019-05-07 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-7245:
---

 Summary: TPCDS queries 1, 45, 65, 97 are 3x slower when Statistics 
is enabled at sf 100
 Key: DRILL-7245
 URL: https://issues.apache.org/jira/browse/DRILL-7245
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.17.0


Here is query 65:
{noformat}
SELECT s_store_name, 
   i_item_desc, 
   sc.revenue, 
   i_current_price, 
   i_wholesale_cost, 
   i_brand 
FROM   store, 
   item, 
   (SELECT ss_store_sk, 
   Avg(revenue) AS ave 
FROM   (SELECT ss_store_sk, 
   ss_item_sk, 
   Sum(ss_sales_price) AS revenue 
FROM   store_sales, 
   date_dim 
WHERE  ss_sold_date_sk = d_date_sk 
   AND d_month_seq BETWEEN 1199 AND 1199 + 11 
GROUP  BY ss_store_sk, 
  ss_item_sk) sa 
GROUP  BY ss_store_sk) sb, 
   (SELECT ss_store_sk, 
   ss_item_sk, 
   Sum(ss_sales_price) AS revenue 
FROM   store_sales, 
   date_dim 
WHERE  ss_sold_date_sk = d_date_sk 
   AND d_month_seq BETWEEN 1199 AND 1199 + 11 
GROUP  BY ss_store_sk, 
  ss_item_sk) sc 
WHERE  sb.ss_store_sk = sc.ss_store_sk 
   AND sc.revenue <= 0.1 * sb.ave 
   AND s_store_sk = sc.ss_store_sk 
   AND i_item_sk = sc.ss_item_sk 
ORDER  BY s_store_name, 
  i_item_desc
LIMIT 100; 
{noformat}
Here is the new plan.
{noformat}
00-00Screen : rowType = RecordType(ANY s_store_name, ANY i_item_desc, ANY 
revenue, ANY i_current_price, ANY i_wholesale_cost, ANY i_brand): rowcount = 
100.0, cumulative cost = {1.170402776556728E9 rows, 9.49229008144672E9 cpu, 
2.305289192E9 io, 5.9855814764732E10 network, 1.3320421632282749E8 memory}, id 
= 45433
00-01  Project(s_store_name=[$0], i_item_desc=[$1], revenue=[$2], 
i_current_price=[$3], i_wholesale_cost=[$4], i_brand=[$5]) : rowType = 
RecordType(ANY s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, 
ANY i_wholesale_cost, ANY i_brand): rowcount = 100.0, cumulative cost = 
{1.170402766556728E9 rows, 9.49229007144672E9 cpu, 2.305289192E9 io, 
5.9855814764732E10 network, 1.3320421632282749E8 memory}, id = 45432
00-02SelectionVectorRemover : rowType = RecordType(ANY s_store_name, 
ANY i_item_desc, ANY revenue, ANY i_current_price, ANY i_wholesale_cost, ANY 
i_brand): rowcount = 100.0, cumulative cost = {1.170402666556728E9 rows, 
9.49228947144672E9 cpu, 2.305289192E9 io, 5.9855814764732E10 network, 
1.3320421632282749E8 memory}, id = 45431
00-03  Limit(fetch=[100]) : rowType = RecordType(ANY s_store_name, ANY 
i_item_desc, ANY revenue, ANY i_current_price, ANY i_wholesale_cost, ANY 
i_brand): rowcount = 100.0, cumulative cost = {1.170402566556728E9 rows, 
9.49228937144672E9 cpu, 2.305289192E9 io, 5.9855814764732E10 network, 
1.3320421632282749E8 memory}, id = 45430
00-04SingleMergeExchange(sort0=[0], sort1=[1]) : rowType = 
RecordType(ANY s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, 
ANY i_wholesale_cost, ANY i_brand): rowcount = 100.0, cumulative cost = 
{1.170402466556728E9 rows, 9.49228897144672E9 cpu, 2.305289192E9 io, 
5.9855814764732E10 network, 1.3320421632282749E8 memory}, id = 45429
01-01  OrderedMuxExchange(sort0=[0], sort1=[1]) : rowType = 
RecordType(ANY s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, 
ANY i_wholesale_cost, ANY i_brand): rowcount = 100.0, cumulative cost = 
{1.170402366556728E9 rows, 9.492286842675482E9 cpu, 2.305289192E9 io, 
5.9853357164732E10 network, 1.3320421632282749E8 memory}, id = 45428
02-01SelectionVectorRemover : rowType = RecordType(ANY 
s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, ANY 
i_wholesale_cost, ANY i_brand): rowcount = 100.0, cumulative cost = 
{1.170402266556728E9 rows, 9.492286742675482E9 cpu, 2.305289192E9 io, 
5.9853357164732E10 network, 1.3320421632282749E8 memory}, id = 45427
02-02  Limit(fetch=[100]) : rowType = RecordType(ANY 
s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, ANY 
i_wholesale_cost, ANY i_brand): rowcount = 100.0, cumulative cost = 
{1.170402166556728E9 rows, 9.492286642675482E9 cpu, 2.305289192E9 io, 
5.9853357164732E10 network, 1.3320421632282749E8 memory}, id = 45426
02-03SelectionVectorRemover : rowType = RecordType(ANY 
s_store_name, ANY i_item_desc, ANY revenue, ANY i_current_price, ANY 
i_wholesale_cost, ANY i_brand): rowcount = 416478.4359081372, cumulative cost = 
{1.170402066556728E9 rows, 9.492286242675482E9 cpu, 2.305289192E9 io, 
5.9853357164732E10 network, 

[jira] [Updated] (DRILL-7227) TPCDS queries 47, 57, 59 fail to run with Statistics enabled at sf100

2019-05-07 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7227:

Labels: ready-to-commit  (was: )

> TPCDS queries 47, 57, 59 fail to run with Statistics enabled at sf100
> -
>
> Key: DRILL-7227
> URL: https://issues.apache.org/jira/browse/DRILL-7227
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
> Attachments: 23387ab0-cb1c-cd5e-449a-c9bcefc901c1.sys.drill, 
> 2338ae93-155b-356d-382e-0da949c6f439.sys.drill
>
>
> Here is query 78:
> {noformat}
> WITH ws 
>  AS (SELECT d_year AS ws_sold_year, 
> ws_item_sk, 
> ws_bill_customer_skws_customer_sk, 
> Sum(ws_quantity)   ws_qty, 
> Sum(ws_wholesale_cost) ws_wc, 
> Sum(ws_sales_price)ws_sp 
>  FROM   web_sales 
> LEFT JOIN web_returns 
>ON wr_order_number = ws_order_number 
>   AND ws_item_sk = wr_item_sk 
> JOIN date_dim 
>   ON ws_sold_date_sk = d_date_sk 
>  WHERE  wr_order_number IS NULL 
>  GROUP  BY d_year, 
>ws_item_sk, 
>ws_bill_customer_sk), 
>  cs 
>  AS (SELECT d_year AS cs_sold_year, 
> cs_item_sk, 
> cs_bill_customer_skcs_customer_sk, 
> Sum(cs_quantity)   cs_qty, 
> Sum(cs_wholesale_cost) cs_wc, 
> Sum(cs_sales_price)cs_sp 
>  FROM   catalog_sales 
> LEFT JOIN catalog_returns 
>ON cr_order_number = cs_order_number 
>   AND cs_item_sk = cr_item_sk 
> JOIN date_dim 
>   ON cs_sold_date_sk = d_date_sk 
>  WHERE  cr_order_number IS NULL 
>  GROUP  BY d_year, 
>cs_item_sk, 
>cs_bill_customer_sk), 
>  ss 
>  AS (SELECT d_year AS ss_sold_year, 
> ss_item_sk, 
> ss_customer_sk, 
> Sum(ss_quantity)   ss_qty, 
> Sum(ss_wholesale_cost) ss_wc, 
> Sum(ss_sales_price)ss_sp 
>  FROM   store_sales 
> LEFT JOIN store_returns 
>ON sr_ticket_number = ss_ticket_number 
>   AND ss_item_sk = sr_item_sk 
> JOIN date_dim 
>   ON ss_sold_date_sk = d_date_sk 
>  WHERE  sr_ticket_number IS NULL 
>  GROUP  BY d_year, 
>ss_item_sk, 
>ss_customer_sk) 
> SELECT ss_item_sk, 
>Round(ss_qty / ( COALESCE(ws_qty + cs_qty, 1) ), 2) ratio, 
>ss_qty  store_qty, 
>ss_wc 
>store_wholesale_cost, 
>ss_sp 
>store_sales_price, 
>COALESCE(ws_qty, 0) + COALESCE(cs_qty, 0) 
>other_chan_qty, 
>COALESCE(ws_wc, 0) + COALESCE(cs_wc, 0) 
>other_chan_wholesale_cost, 
>COALESCE(ws_sp, 0) + COALESCE(cs_sp, 0) 
>other_chan_sales_price 
> FROM   ss 
>LEFT JOIN ws 
>   ON ( ws_sold_year = ss_sold_year 
>AND ws_item_sk = ss_item_sk 
>AND ws_customer_sk = ss_customer_sk ) 
>LEFT JOIN cs 
>   ON ( cs_sold_year = ss_sold_year 
>AND cs_item_sk = cs_item_sk 
>AND cs_customer_sk = ss_customer_sk ) 
> WHERE  COALESCE(ws_qty, 0) > 0 
>AND COALESCE(cs_qty, 0) > 0 
>AND ss_sold_year = 1999 
> ORDER  BY ss_item_sk, 
>   ss_qty DESC, 
>   ss_wc DESC, 
>   ss_sp DESC, 
>   other_chan_qty, 
>   other_chan_wholesale_cost, 
>   other_chan_sales_price, 
>   Round(ss_qty / ( COALESCE(ws_qty + cs_qty, 1) ), 2)
> LIMIT 100; 
> {noformat}
> The profile for the new plan is 2338ae93-155b-356d-382e-0da949c6f439.  Hash 
> partition sender operator (10-00) takes 10-15 minutes.  I am not sure why it 
> takes so long.  It has 10 minor fragments sending to receiver (06-05), which 
> has 62 minor fragments.  But hash partition sender (16-00) has 10 minor 
> fragments sending to receiver (12-06), which has 220 minor fragments, and 
> there is no performance issue.
> The profile for the old plan is 

[jira] [Commented] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-05-06 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834153#comment-16834153
 ] 

Gautam Parai commented on DRILL-7148:
-

[~anishareddy] / [~agirish] these 2 tests require a baseline update. Could you 
please update the baseline? Thanks!

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 

[jira] [Updated] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-05-06 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7148:

Labels: ready-to-commit  (was: )

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2], 
> 

[jira] [Updated] (DRILL-7231) TPCDS-21 regresses after fix for DRILL-7148

2019-04-30 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7231:

Description: 
The join rowcount regresses a lot after changes made for DRILL-7148. This 
affects several TPC-DS queries.

One of the  fixes for DRILL-7148, introduced a change in 
DrillRelMDDistinctRowcount to only use the guess of 0.1*input_row_count when 
not all columns in the group-by key have NDV statistics. However, the fix was 
incorrect and instead caused it to use the guess-timate NDV even when 
statistics were present.

Since the NDV was estimated as 0.1 * input_count_count because of the 
regression, the join cardinality was severely underestimated for TPCDS-21 = 
400M * 15 / Max(400K, 15) = 150.

  was:The join rowcount regresses a lot after changes made for DRILL-7148. This 
affects several TPC-DS queries.


> TPCDS-21 regresses after fix for DRILL-7148
> ---
>
> Key: DRILL-7231
> URL: https://issues.apache.org/jira/browse/DRILL-7231
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The join rowcount regresses a lot after changes made for DRILL-7148. This 
> affects several TPC-DS queries.
> One of the  fixes for DRILL-7148, introduced a change in 
> DrillRelMDDistinctRowcount to only use the guess of 0.1*input_row_count when 
> not all columns in the group-by key have NDV statistics. However, the fix was 
> incorrect and instead caused it to use the guess-timate NDV even when 
> statistics were present.
> Since the NDV was estimated as 0.1 * input_count_count because of the 
> regression, the join cardinality was severely underestimated for TPCDS-21 = 
> 400M * 15 / Max(400K, 15) = 150.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7231) TPCDS-21 regresses after fix for DRILL-7148

2019-04-30 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-7231:
---

 Summary: TPCDS-21 regresses after fix for DRILL-7148
 Key: DRILL-7231
 URL: https://issues.apache.org/jira/browse/DRILL-7231
 Project: Apache Drill
  Issue Type: Bug
Reporter: Gautam Parai
Assignee: Gautam Parai


The join rowcount regresses a lot after changes made for DRILL-7148. This 
affects several TPC-DS queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-29 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829718#comment-16829718
 ] 

Gautam Parai commented on DRILL-7148:
-

PR link is present in the Issue Links section

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], 

[jira] [Commented] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-29 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829717#comment-16829717
 ] 

Gautam Parai commented on DRILL-7148:
-

[https://github.com/apache/drill/pull/1744]

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], 

[jira] [Issue Comment Deleted] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-29 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7148:

Comment: was deleted

(was: [https://github.com/apache/drill/pull/1744])

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2], 

[jira] [Updated] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-29 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7148:

Labels: ready-to-commit  (was: )

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2], 
> 

[jira] [Updated] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-07 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7148:

Fix Version/s: (was: 1.16.0)
   1.17.0

> TPCH query 17 increases execution time with Statistics enabled because join 
> order is changed
> 
>
> Key: DRILL-7148
> URL: https://issues.apache.org/jira/browse/DRILL-7148
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.17.0
>
>
> TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
> has flipped the build side and the probe side in Major Fragment 01.
> Here is the query:
> select
>  sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>  lineitem l,
>  part p
> where
>  p.p_partkey = l.l_partkey
>  and p.p_brand = 'Brand#13'
>  and p.p_container = 'JUMBO CAN'
>  and l.l_quantity < (
>  select
>  0.2 * avg(l2.l_quantity)
>  from
>  lineitem l2
>  where
>  l2.l_partkey = p.p_partkey
>  );
> Here is original plan:
> {noformat}
> 00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, 
> cumulative cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489493
> 00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY 
> avg_yearly): rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 
> 6.6179786770527E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489492
> 00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 
> 6.6179786770127E11 cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 
> 1.17294998955024E11 memory}, id = 489491
> 00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, 
> cumulative cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 
> 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
> memory}, id = 489490
> 01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
> rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 
> 6.6179786768127E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489489
> 01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
> l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
> \{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
> 01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
> 2.948545E9, cumulative cost = \{7.253787630218E10 rows, 
> 6.2279793457277E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489487
> 01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
> rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
> 6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
> 1.17294998955024E11 memory}, id = 489486
> 01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) 
> : rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, 
> ANY l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
> \{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
> 1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
> 01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : 
> rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): 
> rowcount = 5.89709E9, cumulative cost = \{4.2417927963E10 rows, 
> 2.71618536905E11 cpu, 1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 
> memory}, id = 489476
> 01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY 
> l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
> E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
> \{3.6417938254E10 rows, 2.53618567778E11 cpu, 1.8599969127E10 io, 
> 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
> 02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
> l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
> 5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 
> cpu, 1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
> 04-01 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2, 

[jira] [Commented] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-04-05 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811085#comment-16811085
 ] 

Gautam Parai commented on DRILL-7114:
-

Fixed with commit ID: e477480e7b9626abc8efd70914d3bfd4321b7258

> ANALYZE command generates warnings for stats file and materialization
> -
>
> Key: DRILL-7114
> URL: https://issues.apache.org/jira/browse/DRILL-7114
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Gautam Parai
>Priority: Minor
> Fix For: 1.16.0
>
>
> When I run ANALYZE, I see warnings in the log file as shown below. The 
> ANALYZE command should not try to read the stats file or materialize the 
> stats.  
> {noformat}
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-04-05 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai resolved DRILL-7114.
-
Resolution: Fixed

> ANALYZE command generates warnings for stats file and materialization
> -
>
> Key: DRILL-7114
> URL: https://issues.apache.org/jira/browse/DRILL-7114
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Gautam Parai
>Priority: Minor
> Fix For: 1.16.0
>
>
> When I run ANALYZE, I see warnings in the log file as shown below. The 
> ANALYZE command should not try to read the stats file or materialize the 
> stats.  
> {noformat}
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7136) Num_buckets for HashAgg in profile may be inaccurate

2019-04-04 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai reassigned DRILL-7136:
---

Assignee: Boaz Ben-Zvi  (was: Gautam Parai)

> Num_buckets for HashAgg in profile may be inaccurate
> 
>
> Key: DRILL-7136
> URL: https://issues.apache.org/jira/browse/DRILL-7136
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Tools, Build  Test
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Boaz Ben-Zvi
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: 23650ee5-6721-8a8f-7dd3-f5dd09a3a7b0.sys.drill
>
>
> I ran TPCH query 17 with sf 1000.  Here is the query:
> {noformat}
> select
>   sum(l.l_extendedprice) / 7.0 as avg_yearly
> from
>   lineitem l,
>   part p
> where
>   p.p_partkey = l.l_partkey
>   and p.p_brand = 'Brand#13'
>   and p.p_container = 'JUMBO CAN'
>   and l.l_quantity < (
> select
>   0.2 * avg(l2.l_quantity)
> from
>   lineitem l2
> where
>   l2.l_partkey = p.p_partkey
>   );
> {noformat}
> One of the hash agg operators has resized 6 times.  It should have 4M 
> buckets.  But the profile shows it has 64K buckets.
> I have attached a sample profile.  In this profile, the hash agg operator is 
> (04-02).
> {noformat}
> Operator Metrics
> Minor FragmentNUM_BUCKETS NUM_ENTRIES NUM_RESIZING
> RESIZING_TIME_MSNUM_PARTITIONS  SPILLED_PARTITIONS  SPILL_MB  
>   SPILL_CYCLE INPUT_BATCH_COUNT   AVG_INPUT_BATCH_BYTES   
> AVG_INPUT_ROW_BYTES INPUT_RECORD_COUNT  OUTPUT_BATCH_COUNT  
> AVG_OUTPUT_BATCH_BYTES  AVG_OUTPUT_ROW_BYTESOUTPUT_RECORD_COUNT
> 04-00-02  65,536 748,746  6   364 1   
> 582 0   813 582,653 18  26,316,456  401 1,631,943 
>   25  26,176,350
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7148) TPCH query 17 increases execution time with Statistics enabled because join order is changed

2019-04-01 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-7148:
---

 Summary: TPCH query 17 increases execution time with Statistics 
enabled because join order is changed
 Key: DRILL-7148
 URL: https://issues.apache.org/jira/browse/DRILL-7148
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.16.0


TPCH query 17 with sf 1000 runs 45% slower. One issue is that the join order 
has flipped the build side and the probe side in Major Fragment 01.

Here is the query:
select
 sum(l.l_extendedprice) / 7.0 as avg_yearly
from
 lineitem l,
 part p
where
 p.p_partkey = l.l_partkey
 and p.p_brand = 'Brand#13'
 and p.p_container = 'JUMBO CAN'
 and l.l_quantity < (
 select
 0.2 * avg(l2.l_quantity)
 from
 lineitem l2
 where
 l2.l_partkey = p.p_partkey
 );

Here is original plan:
{noformat}
00-00 Screen : rowType = RecordType(ANY avg_yearly): rowcount = 1.0, cumulative 
cost = \{7.853786601428E10 rows, 6.6179786770537E11 cpu, 3.0599948545E10 io, 
1.083019457355776E14 network, 1.17294998955024E11 memory}, id = 489493
00-01 Project(avg_yearly=[/($0, 7.0)]) : rowType = RecordType(ANY avg_yearly): 
rowcount = 1.0, cumulative cost = \{7.853786601418E10 rows, 6.6179786770527E11 
cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
memory}, id = 489492
00-02 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
rowcount = 1.0, cumulative cost = \{7.853786601318E10 rows, 6.6179786770127E11 
cpu, 3.0599948545E10 io, 1.083019457355776E14 network, 1.17294998955024E11 
memory}, id = 489491
00-03 UnionExchange : rowType = RecordType(ANY $f0): rowcount = 1.0, cumulative 
cost = \{7.853786601218E10 rows, 6.6179786768927E11 cpu, 3.0599948545E10 io, 
1.083019457355776E14 network, 1.17294998955024E11 memory}, id = 489490
01-01 StreamAgg(group=[{}], agg#0=[SUM($0)]) : rowType = RecordType(ANY $f0): 
rowcount = 1.0, cumulative cost = \{7.853786601118E10 rows, 6.6179786768127E11 
cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 1.17294998955024E11 
memory}, id = 489489
01-02 Project(l_extendedprice=[$1]) : rowType = RecordType(ANY 
l_extendedprice): rowcount = 2.948545E9, cumulative cost = 
\{7.553787115668E10 rows, 6.2579792942727E11 cpu, 3.0599948545E10 io, 
1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489488
01-03 SelectionVectorRemover : rowType = RecordType(ANY l_quantity, ANY 
l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): rowcount = 
2.948545E9, cumulative cost = \{7.253787630218E10 rows, 6.2279793457277E11 
cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 1.17294998955024E11 
memory}, id = 489487
01-04 Filter(condition=[<($0, *(0.2, $4))]) : rowType = RecordType(ANY 
l_quantity, ANY l_extendedprice, ANY p_partkey, ANY l_partkey, ANY $f1): 
rowcount = 2.948545E9, cumulative cost = \{6.953788144768E10 rows, 
6.1979793971827E11 cpu, 3.0599948545E10 io, 1.083019457314816E14 network, 
1.17294998955024E11 memory}, id = 489486
01-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false]) : 
rowType = RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
l_partkey, ANY $f1): rowcount = 5.89709E9, cumulative cost = 
\{6.353789173867999E10 rows, 5.8379800146427E11 cpu, 3.0599948545E10 io, 
1.083019457314816E14 network, 1.17294998955024E11 memory}, id = 489485
01-07 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2]) : rowType 
= RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): rowcount = 
5.89709E9, cumulative cost = \{4.2417927963E10 rows, 2.71618536905E11 cpu, 
1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 memory}, id = 489476
01-09 HashToRandomExchange(dist0=[[$2]]) : rowType = RecordType(ANY l_quantity, 
ANY l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
5.89709E9, cumulative cost = \{3.6417938254E10 rows, 2.53618567778E11 cpu, 
1.8599969127E10 io, 9.8471562592256E13 network, 7.92E7 memory}, id = 489475
02-01 UnorderedMuxExchange : rowType = RecordType(ANY l_quantity, ANY 
l_extendedprice, ANY p_partkey, ANY E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 
5.89709E9, cumulative cost = \{3.0417948545E10 rows, 1.57618732434E11 cpu, 
1.8599969127E10 io, 1.677312E11 network, 7.92E7 memory}, id = 489474
04-01 Project(l_quantity=[$0], l_extendedprice=[$1], p_partkey=[$2], 
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($2, 1301011)]) : rowType = 
RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey, ANY 
E_X_P_R_H_A_S_H_F_I_E_L_D): rowcount = 5.89709E9, cumulative cost = 
\{2.4417958836E10 rows, 1.51618742725E11 cpu, 1.8599969127E10 io, 1.677312E11 
network, 7.92E7 memory}, id = 489473
04-02 Project(l_quantity=[$1], l_extendedprice=[$2], p_partkey=[$3]) : rowType 
= RecordType(ANY l_quantity, ANY l_extendedprice, ANY p_partkey): rowcount = 
5.89709E9, cumulative 

[jira] [Commented] (DRILL-7076) NPE is logged when querying postgres tables

2019-03-28 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804510#comment-16804510
 ] 

Gautam Parai commented on DRILL-7076:
-

[~vvysotskyi] could you please let me know how to get/create the data-source 
i.e. the postgres table.

> NPE is logged when querying postgres tables
> ---
>
> Key: DRILL-7076
> URL: https://issues.apache.org/jira/browse/DRILL-7076
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Volodymyr Vysotskyi
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.16.0
>
>
> NPE is seen in logs when querying Postgres table:
> {code:sql}
> select 1 from postgres.public.tdt
> {code}
> Stack trace from {{sqlline.log}}:
> {noformat}
> 2019-03-05 13:49:19,395 [23819dc0-abf8-24f3-ea81-6ced1b6e11af:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.lang.NullPointerException: null
>   at 
> org.apache.drill.exec.planner.common.DrillStatsTable$StatsMaterializationVisitor.visit(DrillStatsTable.java:189)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at org.apache.calcite.rel.SingleRel.childrenAccept(SingleRel.java:72) 
> [calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
>   at org.apache.calcite.rel.RelVisitor.visit(RelVisitor.java:44) 
> [calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
>   at 
> org.apache.drill.exec.planner.common.DrillStatsTable$StatsMaterializationVisitor.visit(DrillStatsTable.java:202)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at org.apache.calcite.rel.RelVisitor.go(RelVisitor.java:61) 
> [calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
>   at 
> org.apache.drill.exec.planner.common.DrillStatsTable$StatsMaterializationVisitor.materialize(DrillStatsTable.java:177)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel(DefaultSqlHandler.java:235)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:331)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan(DefaultSqlHandler.java:178)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan(DrillSqlWorker.java:204)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan(DrillSqlWorker.java:114)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:80)
>  [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:584) 
> [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:272) 
> [drill-java-exec-1.16.0-SNAPSHOT.jar:1.16.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_191]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_191]
>   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]
> {noformat}
> But query runs and returns the correct result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-7123) TPCDS query 83 runs slower when Statistics is disabled

2019-03-28 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai resolved DRILL-7123.
-
Resolution: Fixed

> TPCDS query 83 runs slower when Statistics is disabled
> --
>
> Key: DRILL-7123
> URL: https://issues.apache.org/jira/browse/DRILL-7123
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.16.0
>
>
> Query is TPCDS 83 with sf 100:
> {noformat}
> WITH sr_items 
>  AS (SELECT i_item_id   item_id, 
> Sum(sr_return_quantity) sr_item_qty 
>  FROM   store_returns, 
> item, 
> date_dim 
>  WHERE  sr_item_sk = i_item_sk 
> AND d_date IN (SELECT d_date 
>FROM   date_dim 
>WHERE  d_week_seq IN (SELECT d_week_seq 
>  FROM   date_dim 
>  WHERE 
>   d_date IN ( '1999-06-30', 
>   '1999-08-28', 
>   '1999-11-18' 
> ))) 
> AND sr_returned_date_sk = d_date_sk 
>  GROUP  BY i_item_id), 
>  cr_items 
>  AS (SELECT i_item_id   item_id, 
> Sum(cr_return_quantity) cr_item_qty 
>  FROM   catalog_returns, 
> item, 
> date_dim 
>  WHERE  cr_item_sk = i_item_sk 
> AND d_date IN (SELECT d_date 
>FROM   date_dim 
>WHERE  d_week_seq IN (SELECT d_week_seq 
>  FROM   date_dim 
>  WHERE 
>   d_date IN ( '1999-06-30', 
>   '1999-08-28', 
>   '1999-11-18' 
> ))) 
> AND cr_returned_date_sk = d_date_sk 
>  GROUP  BY i_item_id), 
>  wr_items 
>  AS (SELECT i_item_id   item_id, 
> Sum(wr_return_quantity) wr_item_qty 
>  FROM   web_returns, 
> item, 
> date_dim 
>  WHERE  wr_item_sk = i_item_sk 
> AND d_date IN (SELECT d_date 
>FROM   date_dim 
>WHERE  d_week_seq IN (SELECT d_week_seq 
>  FROM   date_dim 
>  WHERE 
>   d_date IN ( '1999-06-30', 
>   '1999-08-28', 
>   '1999-11-18' 
> ))) 
> AND wr_returned_date_sk = d_date_sk 
>  GROUP  BY i_item_id) 
> SELECT sr_items.item_id, 
>sr_item_qty, 
>sr_item_qty / ( sr_item_qty + cr_item_qty + wr_item_qty ) / 
> 3.0 * 
>100 sr_dev, 
>cr_item_qty, 
>cr_item_qty / ( sr_item_qty + cr_item_qty + wr_item_qty ) / 
> 3.0 * 
>100 cr_dev, 
>wr_item_qty, 
>wr_item_qty / ( sr_item_qty + cr_item_qty + wr_item_qty ) / 
> 3.0 * 
>100 wr_dev, 
>( sr_item_qty + cr_item_qty + wr_item_qty ) / 3.0 
>average 
> FROM   sr_items, 
>cr_items, 
>wr_items 
> WHERE  sr_items.item_id = cr_items.item_id 
>AND sr_items.item_id = wr_items.item_id 
> ORDER  BY sr_items.item_id, 
>   sr_item_qty
> LIMIT 100; 
> {noformat}
> The number of threads for major fragments 1 and 2 has changed when Statistics 
> is disabled.  The number of minor fragments has been reduced from 10 and 15 
> fragments down to 3 fragments.  Rowcount has changed for major fragment 2 
> from 1439754.0 down to 287950.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-7120) Query fails with ChannelClosedException when Statistics is disabled

2019-03-28 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai resolved DRILL-7120.
-
Resolution: Fixed

> Query fails with ChannelClosedException when Statistics is disabled
> ---
>
> Key: DRILL-7120
> URL: https://issues.apache.org/jira/browse/DRILL-7120
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.16.0
>
>
> TPCH query 5 fails at sf100 when Statistics is disabled.  Here is the query:
> {noformat}
> select
>   n.n_name,
>   sum(l.l_extendedprice * (1 - l.l_discount)) as revenue
> from
>   customer c,
>   orders o,
>   lineitem l,
>   supplier s,
>   nation n,
>   region r
> where
>   c.c_custkey = o.o_custkey
>   and l.l_orderkey = o.o_orderkey
>   and l.l_suppkey = s.s_suppkey
>   and c.c_nationkey = s.s_nationkey
>   and s.s_nationkey = n.n_nationkey
>   and n.n_regionkey = r.r_regionkey
>   and r.r_name = 'EUROPE'
>   and o.o_orderdate >= date '1997-01-01'
>   and o.o_orderdate < date '1997-01-01' + interval '1' year
> group by
>   n.n_name
> order by
>   revenue desc;
> {noformat}
> This is the error from drillbit.log:
> {noformat}
> 2019-03-04 17:46:38,684 [23822b0a-b7bd-0b79-b905-1438f5b1d039:frag:6:64] INFO 
>  o.a.d.e.w.fragment.FragmentExecutor - 
> 23822b0a-b7bd-0b79-b905-1438f5b1d039:6:64: State change requested RUNNING --> 
> FINISHED
> 2019-03-04 17:46:38,684 [23822b0a-b7bd-0b79-b905-1438f5b1d039:frag:6:64] INFO 
>  o.a.d.e.w.f.FragmentStatusReporter - 
> 23822b0a-b7bd-0b79-b905-1438f5b1d039:6:64: State to report: FINISHED
> 2019-03-04 18:17:51,454 [BitServer-13] WARN  
> o.a.d.exec.rpc.ProtobufLengthDecoder - Failure allocating buffer on incoming 
> stream due to memory limits.  Current Allocation: 262144.
> 2019-03-04 18:17:51,454 [BitServer-13] ERROR 
> o.a.drill.exec.rpc.data.DataServer - Out of memory in RPC layer.
> 2019-03-04 18:17:51,463 [BitServer-13] ERROR 
> o.a.d.exec.rpc.RpcExceptionHandler - Exception in RPC communication.  
> Connection: /10.10.120.104:31012 <--> /10.10.120.106:53048 (data server).  
> Closing connection.
> io.netty.handler.codec.DecoderException: 
> org.apache.drill.exec.exception.OutOfMemoryException: Failure allocating 
> buffer.
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:271)
>  ~[netty-codec-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:335)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:356)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>  [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) 
> [netty-transport-4.0.48.Final.jar:4.0.48.Final]
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  

[jira] [Resolved] (DRILL-7122) TPCDS queries 29 25 17 are slower when Statistics is disabled.

2019-03-28 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai resolved DRILL-7122.
-
Resolution: Fixed

> TPCDS queries 29 25 17 are slower when Statistics is disabled.
> --
>
> Key: DRILL-7122
> URL: https://issues.apache.org/jira/browse/DRILL-7122
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.16.0
>
>
> Here is query 29 with sf 100:
> {noformat}
> SELECT i_item_id, 
>i_item_desc, 
>s_store_id, 
>s_store_name, 
>Avg(ss_quantity)AS store_sales_quantity, 
>Avg(sr_return_quantity) AS store_returns_quantity, 
>Avg(cs_quantity)AS catalog_sales_quantity 
> FROM   store_sales, 
>store_returns, 
>catalog_sales, 
>date_dim d1, 
>date_dim d2, 
>date_dim d3, 
>store, 
>item 
> WHERE  d1.d_moy = 4 
>AND d1.d_year = 1998 
>AND d1.d_date_sk = ss_sold_date_sk 
>AND i_item_sk = ss_item_sk 
>AND s_store_sk = ss_store_sk 
>AND ss_customer_sk = sr_customer_sk 
>AND ss_item_sk = sr_item_sk 
>AND ss_ticket_number = sr_ticket_number 
>AND sr_returned_date_sk = d2.d_date_sk 
>AND d2.d_moy BETWEEN 4 AND 4 + 3 
>AND d2.d_year = 1998 
>AND sr_customer_sk = cs_bill_customer_sk 
>AND sr_item_sk = cs_item_sk 
>AND cs_sold_date_sk = d3.d_date_sk 
>AND d3.d_year IN ( 1998, 1998 + 1, 1998 + 2 ) 
> GROUP  BY i_item_id, 
>   i_item_desc, 
>   s_store_id, 
>   s_store_name 
> ORDER  BY i_item_id, 
>   i_item_desc, 
>   s_store_id, 
>   s_store_name
> LIMIT 100; 
> {noformat}
> The hash join order has changed.  As a result, one of the hash joins does not 
> seem to reduce the number of rows significantly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7109) Statistics adds external sort, which spills to disk

2019-03-22 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799449#comment-16799449
 ] 

Gautam Parai commented on DRILL-7109:
-

[~rhou] can you please create another Jira for the issue where we see filter 
predicates of the type $0=$0 in TPCH4 - that is another issue which should be 
looked at outside the scope of statistics.

> Statistics adds external sort, which spills to disk
> ---
>
> Key: DRILL-7109
> URL: https://issues.apache.org/jira/browse/DRILL-7109
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>
> TPCH query 4 with sf 100 runs many times slower.  One issue is that an extra 
> external sort has been added, and both external sorts spill to disk.
> Also, the hash join sees 100x more data.
> Here is the query:
> {noformat}
> select
>   o.o_orderpriority,
>   count(*) as order_count
> from
>   orders o
> where
>   o.o_orderdate >= date '1996-10-01'
>   and o.o_orderdate < date '1996-10-01' + interval '3' month
>   and 
>   exists (
> select
>   *
> from
>   lineitem l
> where
>   l.l_orderkey = o.o_orderkey
>   and l.l_commitdate < l.l_receiptdate
>   )
> group by
>   o.o_orderpriority
> order by
>   o.o_orderpriority;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7129) Join with more than 1 condition is not using stats to compute row count estimate

2019-03-21 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai reassigned DRILL-7129:
---

Assignee: Gautam Parai

> Join with more than 1 condition is not using stats to compute row count 
> estimate
> 
>
> Key: DRILL-7129
> URL: https://issues.apache.org/jira/browse/DRILL-7129
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Anisha Reddy
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.17.0
>
>
> Below are the details: 
>  
> {code:java}
> 0: jdbc:drill:drillbit=10.10.101.108> select count(*) from 
> `table_stats/Tpch0.01/parquet/lineitem`; +-+ | EXPR$0 | +-+ | 
> 57068 | +-+ 1 row selected (0.179 seconds)
>  0: jdbc:drill:drillbit=10.10.101.108> select count(*) from 
> `table_stats/Tpch0.01/parquet/partsupp`; +-+ | EXPR$0 | +-+ | 
> 7474 | +-+ 1 row selected (0.171 seconds) 
> 0: jdbc:drill:drillbit=10.10.101.108> select count(*) from 
> `table_stats/Tpch0.01/parquet/lineitem` l, 
> `table_stats/Tpch0.01/parquet/partsupp` ps where l.l_partkey = ps.ps_partkey 
> and l.l_suppkey = ps.ps_suppkey; +-+ | EXPR$0 | +-+ | 53401 | 
> +-+ 1 row selected (0.769 seconds)
>  0: jdbc:drill:drillbit=10.10.101.108> explain plan including all attributes 
> for select * from `table_stats/Tpch0.01/parquet/lineitem` l, 
> `table_stats/Tpch0.01/parquet/partsupp` ps where l.l_partkey = ps.ps_partkey 
> and l.l_suppkey = ps.ps_suppkey; 
> +--+--+
>  | text | json | 
> +--+--+
>  | 00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0): 
> rowcount = 57068.0, cumulative cost = {313468.8 rows, 2110446.8 cpu, 193626.0 
> io, 0.0 network, 197313.6 memory}, id = 107578 00-01 ProjectAllowDup(**=[$0], 
> **0=[$1]) : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0): rowcount 
> = 57068.0, cumulative cost = {307762.0 rows, 2104740.0 cpu, 193626.0 io, 0.0 
> network, 197313.6 memory}, id = 107577 00-02 Project(T10¦¦**=[$0], 
> T11¦¦**=[$3]) : rowType = RecordType(DYNAMIC_STAR T10¦¦**, DYNAMIC_STAR 
> T11¦¦**): rowcount = 57068.0, cumulative cost = {250694.0 rows, 1990604.0 
> cpu, 193626.0 io, 0.0 network, 197313.6 memory}, id = 107576 00-03 
> HashJoin(condition=[AND(=($1, $4), =($2, $5))], joinType=[inner], semi-join: 
> =[false]) : rowType = RecordType(DYNAMIC_STAR T10¦¦**, ANY l_partkey, ANY 
> l_suppkey, DYNAMIC_STAR T11¦¦**, ANY ps_partkey, ANY ps_suppkey): rowcount = 
> 57068.0, cumulative cost = {193626.0 rows, 1876468.0 cpu, 193626.0 io, 0.0 
> network, 197313.6 memory}, id = 107575 00-05 Project(T10¦¦**=[$0], 
> l_partkey=[$1], l_suppkey=[$2]) : rowType = RecordType(DYNAMIC_STAR T10¦¦**, 
> ANY l_partkey, ANY l_suppkey): rowcount = 57068.0, cumulative cost = 
> {114136.0 rows, 342408.0 cpu, 171204.0 io, 0.0 network, 0.0 memory}, id = 
> 107572 00-07 Scan(table=[[dfs, drilltestdir, 
> table_stats/Tpch0.01/parquet/lineitem]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=maprfs:///drill/testdata/table_stats/Tpch0.01/parquet/lineitem]], 
> selectionRoot=maprfs:/drill/testdata/table_stats/Tpch0.01/parquet/lineitem, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`, 
> `l_partkey`, `l_suppkey`]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY 
> l_partkey, ANY l_suppkey): rowcount = 57068.0, cumulative cost = {57068.0 
> rows, 171204.0 cpu, 171204.0 io, 0.0 network, 0.0 memory}, id = 107571 00-04 
> Project(T11¦¦**=[$0], ps_partkey=[$1], ps_suppkey=[$2]) : rowType = 
> RecordType(DYNAMIC_STAR T11¦¦**, ANY ps_partkey, ANY ps_suppkey): rowcount = 
> 7474.0, cumulative cost = {14948.0 rows, 44844.0 cpu, 22422.0 io, 0.0 
> network, 0.0 memory}, id = 107574 00-06 Scan(table=[[dfs, drilltestdir, 
> table_stats/Tpch0.01/parquet/partsupp]], groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath 
> [path=maprfs:///drill/testdata/table_stats/Tpch0.01/parquet/partsupp]], 
> selectionRoot=maprfs:/drill/testdata/table_stats/Tpch0.01/parquet/partsupp, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`, 
> `ps_partkey`, `ps_suppkey`]]]) : rowType = RecordType(DYNAMIC_STAR **, ANY 
> ps_partkey, ANY ps_suppkey): rowcount = 7474.0, cumulative cost = {7474.0 
> rows, 22422.0 cpu, 22422.0 io, 0.0 network, 0.0 memory}, id = 107573
> {code}
> The ndv for l_partkey = 2000
> ps_partkey = 1817
> l_supkey = 100
> ps_suppkey = 100 
> We see that such joins is 

[jira] [Updated] (DRILL-7108) With statistics enabled TPCH 16 has two additional exchange operators

2019-03-20 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-7108:

Labels: ready-to-commit  (was: )

> With statistics enabled TPCH 16 has two additional exchange operators
> -
>
> Key: DRILL-7108
> URL: https://issues.apache.org/jira/browse/DRILL-7108
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Affects Versions: 1.16.0
>Reporter: Robert Hou
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> TPCH 16 with sf 100 runs 14% slower.  Here is the query:
> {noformat}
> select
>   p.p_brand,
>   p.p_type,
>   p.p_size,
>   count(distinct ps.ps_suppkey) as supplier_cnt
> from
>   partsupp ps,
>   part p
> where
>   p.p_partkey = ps.ps_partkey
>   and p.p_brand <> 'Brand#21'
>   and p.p_type not like 'MEDIUM PLATED%'
>   and p.p_size in (38, 2, 8, 31, 44, 5, 14, 24)
>   and ps.ps_suppkey not in (
> select
>   s.s_suppkey
> from
>   supplier s
> where
>   s.s_comment like '%Customer%Complaints%'
>   )
> group by
>   p.p_brand,
>   p.p_type,
>   p.p_size
> order by
>   supplier_cnt desc,
>   p.p_brand,
>   p.p_type,
>   p.p_size;
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7114) ANALYZE command generates warnings for stats file and materialization

2019-03-19 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796306#comment-16796306
 ] 

Gautam Parai commented on DRILL-7114:
-

[~vitalii] yes you are right - this happens for all queries. I plan to address 
both issues as part of this JIRA.

> ANALYZE command generates warnings for stats file and materialization
> -
>
> Key: DRILL-7114
> URL: https://issues.apache.org/jira/browse/DRILL-7114
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization
>Reporter: Aman Sinha
>Assignee: Gautam Parai
>Priority: Minor
> Fix For: 1.16.0
>
>
> When I run ANALYZE, I see warnings in the log file as shown below. The 
> ANALYZE command should not try to read the stats file or materialize the 
> stats.  
> {noformat}
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to read the stats file.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> 12:04:32.939 [2370143e-c419-f33c-d879-84989712bc85:foreman] WARN  
> o.a.d.e.p.common.DrillStatsTable - Failed to materialize the stats. 
> Continuing without stats.
> java.io.FileNotFoundException: File /tmp/orders3/.stats.drill/0_0.json does 
> not exist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7085) Drill Statistics: analyze table cmd fails

2019-03-08 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai reassigned DRILL-7085:
---

Assignee: Gautam Parai

> Drill Statistics: analyze table cmd fails
> -
>
> Key: DRILL-7085
> URL: https://issues.apache.org/jira/browse/DRILL-7085
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Anisha Reddy
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.16.0
>
>
> On the latest build we are seeing in the regression results that the analyze 
> cmd is failing.
> This is the dir:
> {code:java}
> [root@qa101-108 bin]# hadoop fs -ls /drill/testdata/table_stats/data_with_0 
> Found 1 items -rwxr-xr-x 3 root root 487 2019-03-05 18:26 
> /drill/testdata/table_stats/data_with_0/0_0_0.parquet
> {code}
> {code:java}
> 0: jdbc:drill:drillbit=10.10.101.108> set `planner.statistics.use`=true;
> +---+--+
> |  ok   | summary  |
> +---+--+
> | true  | planner.statistics.use updated.  |
> +---+--+
> 1 row selected (0.164 seconds)
> {code}
> {code:java}
> 0: jdbc:drill:drillbit=10.10.101.108> analyze table `table_stats/data_with_0` 
> compute statistics;
> ++--+
> |   ok   | summary
>   |
> ++--+
> | false  | Table table_stats/data_with_0 is not supported by ANALYZE. Support 
> is currently limited to directory-based Parquet tables. |
> ++--+
> 1 row selected (0.901 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-1328) Support table statistics

2019-02-28 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai resolved DRILL-1328.
-
Resolution: Fixed

Merged with commit IDs 

469be17597e7b7c6bc1de9863dcb6c5604a55f0c

3233d8aaff57ac71bd3b726efcd5fdaa92aef861

> Support table statistics
> 
>
> Key: DRILL-1328
> URL: https://issues.apache.org/jira/browse/DRILL-1328
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Cliff Buchanan
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: 0001-PRE-Set-value-count-in-splitAndTransfer.patch
>
>
> This consists of several subtasks
> * implement operators to generate statistics
> * add "analyze table" support to parser/planner
> * create a metadata provider to allow statistics to be used by optiq in 
> planning optimization
> * implement statistics functions
> Right now, the bulk of this functionality is implemented, but it hasn't been 
> rigorously tested and needs to have some definite answers for some of the 
> parts "around the edges" (how analyze table figures out where the table 
> statistics are located, how a table "append" should work in a read only file 
> system)
> Also, here are a few known caveats:
> * table statistics are collected by creating a sql query based on the string 
> path of the table. This should probably be done with a Table reference.
> * Case sensitivity for column statistics is probably iffy
> * Math for combining two column NDVs into a joint NDV should be checked.
> * Schema changes aren't really being considered yet.
> * adding getDrillTable is probably unnecessary; it might be better to do 
> getTable().unwrap(DrillTable.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-1328) Support table statistics

2019-02-28 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780922#comment-16780922
 ] 

Gautam Parai commented on DRILL-1328:
-

Reopened the PR to address the most critical review comments. I will merge the 
PR based upon approval from the reviewer(s) - this will avoid the costly rebase 
work which comes with keeping the PR open for such a long time. More work will 
be done as part of subsequent JIRAs.

> Support table statistics
> 
>
> Key: DRILL-1328
> URL: https://issues.apache.org/jira/browse/DRILL-1328
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Cliff Buchanan
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
> Attachments: 0001-PRE-Set-value-count-in-splitAndTransfer.patch
>
>
> This consists of several subtasks
> * implement operators to generate statistics
> * add "analyze table" support to parser/planner
> * create a metadata provider to allow statistics to be used by optiq in 
> planning optimization
> * implement statistics functions
> Right now, the bulk of this functionality is implemented, but it hasn't been 
> rigorously tested and needs to have some definite answers for some of the 
> parts "around the edges" (how analyze table figures out where the table 
> statistics are located, how a table "append" should work in a read only file 
> system)
> Also, here are a few known caveats:
> * table statistics are collected by creating a sql query based on the string 
> path of the table. This should probably be done with a Table reference.
> * Case sensitivity for column statistics is probably iffy
> * Math for combining two column NDVs into a joint NDV should be checked.
> * Schema changes aren't really being considered yet.
> * adding getDrillTable is probably unnecessary; it might be better to do 
> getTable().unwrap(DrillTable.class)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6899) Fix timestamp issues in unit tests ignored with DRILL-6833

2018-12-12 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6899:
---

 Summary: Fix timestamp issues in unit tests ignored with DRILL-6833
 Key: DRILL-6899
 URL: https://issues.apache.org/jira/browse/DRILL-6899
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.16.0


{{The following tests were disabled in the PR for DRILL-6833}}

{{IndexPlanTest.testCastTimestampPlan() - Re-enable after the MapRDB format 
plugin issue is fixed.}}
{{IndexPlanTest.testRowkeyJoinPushdown_13() - Re-enable the testcase after 
fixing the execution issue with HashJoin used as Rowkeyjoin.}}
{{IndexPlanTest.testRowkeyJoinPushdown_12() - Remove the testcase since the 
SemiJoin transformation makes the rowkeyjoinpushdown transformation invalid.}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6501) Revert/modify fix for DRILL-6212 after CALCITE-2223 is fixed

2018-12-11 Thread Gautam Parai (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718497#comment-16718497
 ] 

Gautam Parai commented on DRILL-6501:
-

This may also be the root cause of 
https://issues.apache.org/jira/browse/DRILL-6897. Needs further investigation.

> Revert/modify fix for DRILL-6212 after CALCITE-2223 is fixed
> 
>
> Key: DRILL-6501
> URL: https://issues.apache.org/jira/browse/DRILL-6501
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.14.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DRILL-6212 is a temporary fix to alleviate issues due to CALCITE-2223. Once, 
> CALCITE-2223 is fixed this change needs to be reverted back which would 
> require DrillProjectMergeRule to go back to extending the ProjectMergeRule. 
> Please take a look at how CALCITE-2223 is eventually fixed (as of now it is 
> still not clear which fix is the way to do). Depending on the fix we may need 
> to additional work to integrate these changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6501) Revert/modify fix for DRILL-6212 after CALCITE-2223 is fixed

2018-12-11 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6501:

Fix Version/s: 1.16.0

> Revert/modify fix for DRILL-6212 after CALCITE-2223 is fixed
> 
>
> Key: DRILL-6501
> URL: https://issues.apache.org/jira/browse/DRILL-6501
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.14.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.16.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DRILL-6212 is a temporary fix to alleviate issues due to CALCITE-2223. Once, 
> CALCITE-2223 is fixed this change needs to be reverted back which would 
> require DrillProjectMergeRule to go back to extending the ProjectMergeRule. 
> Please take a look at how CALCITE-2223 is eventually fixed (as of now it is 
> still not clear which fix is the way to do). Depending on the fix we may need 
> to additional work to integrate these changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-05 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6882:

Description: 
RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to repeat 
the pipeline multiple times. This should be avoided as IterOutcome.NONE will 
assertion checks in validator. We ran into this issue when trying to fix 
DRILL-6878.

Sample unit test failure from DRILL-6878

select t1.id.ssn as ssn from hbase.`index_test_primary` t1, 
hbase.`index_test_primary` t2 where t1._id = t2.rowid and 
cast(t2.activity.irs.firstlogin as timestamp) = to_timestamp('2013-02-04 
22:34:38.0', '-MM-dd HH:mm:ss.S')

 at 
org.apache.drill.test.DrillTestWrapper.compareMergedOnHeapVectors(DrillTestWrapper.java:630)
 ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]

 at 
org.apache.drill.test.DrillTestWrapper.compareOrderedResults(DrillTestWrapper.java:584)
 ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]

 at org.apache.drill.test.DrillTestWrapper.run(DrillTestWrapper.java:157) 
~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]

 at org.apache.drill.test.TestBuilder.go(TestBuilder.java:139) 
~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]

 at 
com.mapr.drill.maprdb.tests.index.IndexPlanTest.testRowkeyJoinPushdown_7(IndexPlanTest.java:1828)
 ~[test-classes/:na]

 at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_141]

Caused by: org.apache.drill.exec.rpc.RpcException: 
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalStateException: next() [on #851, ScanBatch] called again after it 
returned NONE.  Caller should not have called next() again.

 

  was:
RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to repeat 
the pipeline multiple times. This should be avoided as IterOutcome.NONE will 
assertion checks in validator. We ran into this issue when trying to fix 
DRILL-6878.

 


> Handle the cases where RowKeyJoin's left pipeline being called multiple times.
> --
>
> Key: DRILL-6882
> URL: https://issues.apache.org/jira/browse/DRILL-6882
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.15.0
>
>
> RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to 
> repeat the pipeline multiple times. This should be avoided as 
> IterOutcome.NONE will assertion checks in validator. We ran into this issue 
> when trying to fix DRILL-6878.
> Sample unit test failure from DRILL-6878
> select t1.id.ssn as ssn from hbase.`index_test_primary` t1, 
> hbase.`index_test_primary` t2 where t1._id = t2.rowid and 
> cast(t2.activity.irs.firstlogin as timestamp) = to_timestamp('2013-02-04 
> 22:34:38.0', '-MM-dd HH:mm:ss.S')
>  at 
> org.apache.drill.test.DrillTestWrapper.compareMergedOnHeapVectors(DrillTestWrapper.java:630)
>  ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at 
> org.apache.drill.test.DrillTestWrapper.compareOrderedResults(DrillTestWrapper.java:584)
>  ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at org.apache.drill.test.DrillTestWrapper.run(DrillTestWrapper.java:157) 
> ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at org.apache.drill.test.TestBuilder.go(TestBuilder.java:139) 
> ~[drill-java-exec-1.15.0-SNAPSHOT-tests.jar:1.15.0-SNAPSHOT]
>  at 
> com.mapr.drill.maprdb.tests.index.IndexPlanTest.testRowkeyJoinPushdown_7(IndexPlanTest.java:1828)
>  ~[test-classes/:na]
>  at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_141]
> Caused by: org.apache.drill.exec.rpc.RpcException: 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
> IllegalStateException: next() [on #851, ScanBatch] called again after it 
> returned NONE.  Caller should not have called next() again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-05 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6882:
---

 Summary: Handle the cases where RowKeyJoin's left pipeline being 
called multiple times.
 Key: DRILL-6882
 URL: https://issues.apache.org/jira/browse/DRILL-6882
 Project: Apache Drill
  Issue Type: Bug
Reporter: Gautam Parai
Assignee: Gautam Parai


RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to repeat 
the pipeline multiple times. This should be avoided as IterOutcome.NONE will 
assertion checks in validator.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-05 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6882:

Description: 
RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to repeat 
the pipeline multiple times. This should be avoided as IterOutcome.NONE will 
assertion checks in validator. We ran into this issue when trying to fix 
DRILL-6878.

 

  was:
RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to repeat 
the pipeline multiple times. This should be avoided as IterOutcome.NONE will 
assertion checks in validator.

 


> Handle the cases where RowKeyJoin's left pipeline being called multiple times.
> --
>
> Key: DRILL-6882
> URL: https://issues.apache.org/jira/browse/DRILL-6882
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.15.0
>
>
> RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to 
> repeat the pipeline multiple times. This should be avoided as 
> IterOutcome.NONE will assertion checks in validator. We ran into this issue 
> when trying to fix DRILL-6878.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-05 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6882:

Fix Version/s: 1.15.0

> Handle the cases where RowKeyJoin's left pipeline being called multiple times.
> --
>
> Key: DRILL-6882
> URL: https://issues.apache.org/jira/browse/DRILL-6882
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.15.0
>
>
> RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to 
> repeat the pipeline multiple times. This should be avoided as 
> IterOutcome.NONE will assertion checks in validator.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6882) Handle the cases where RowKeyJoin's left pipeline being called multiple times.

2018-12-05 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6882:

Affects Version/s: 1.15.0

> Handle the cases where RowKeyJoin's left pipeline being called multiple times.
> --
>
> Key: DRILL-6882
> URL: https://issues.apache.org/jira/browse/DRILL-6882
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Blocker
> Fix For: 1.15.0
>
>
> RowKeyJoin operator's leftStream pipeline uses IterOutcome.NONE state to 
> repeat the pipeline multiple times. This should be avoided as 
> IterOutcome.NONE will assertion checks in validator.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6878) Index Plan Test failures after DRILL-6833

2018-12-03 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6878:
---

 Summary: Index Plan Test failures after DRILL-6833
 Key: DRILL-6878
 URL: https://issues.apache.org/jira/browse/DRILL-6878
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.15.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.15.0


[ERROR] Failures: 
[ERROR] IndexPlanTest.testCastTimestampPlan:1148 Incorrect number of rows 
returned by query. expected:<1> but was:<0>
[ERROR] IndexPlanTest.testRowkeyJoinPushdown_1:1702 Did not find expected 
pattern in plan: RowKeyJoin
00-00 Screen
00-01 Project(ssn=[$0])
00-02 Project(ssn=[$1])
00-03 HashJoin(condition=[=($0, $2)], joinType=[inner], semi-join: =[true])
00-05 Project(_id=[$0], ITEM=[ITEM($1, 'ssn')])
00-06 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`_id`, `id`.`ssn`], maxwidth=2]])
00-04 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=($0 =

{"$date":"2013-02-04T22:34:38.000Z"}

), indexName=hash_i_cast_timestamp_firstlogin], columns=[`_id`], maxwidth=10]])

[ERROR] IndexPlanTest.testRowkeyJoinPushdown_10:1883 Incorrect number of rows 
returned by query. expected:<1> but was:<0>
[ERROR] IndexPlanTest.testRowkeyJoinPushdown_12:1918 Did not find expected 
pattern in plan: HashJoin(.*[
])+.*Scan.*indexName=i_state_city_dl(.*[
])+.*RowKeyJoin(.*[
])+.*RestrictedJsonTableGroupScan(.*[
])+.*HashAgg(group=[\{0}])(.*[
])+.*HashJoin
00-00 Screen
00-01 Project(ssn=[$0])
00-02 Project(ssn=[$1])
00-03 HashJoin(condition=[=($0, $5)], joinType=[inner], semi-join: =[true])
00-05 HashJoin(condition=[=($2, $3)], joinType=[inner], semi-join: =[false])
00-08 Project(_id=[$0], ITEM=[ITEM($1, 'ssn')], ITEM2=[ITEM($2, 'city')])
00-11 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`_id`, `id`.`ssn`, `address`.`city`], maxwidth=2]])
00-07 Project(ITEM0=[$0], ITEM1=[$1])
00-10 Project(ITEM=[ITEM($0, 'city')], ITEM1=[ITEM($0, 'state')])
00-14 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=(address.state = "pc"), indexName=i_state_city_dl], 
columns=[`address`.`city`, `address`.`state`], maxwidth=1]])
00-04 Project(EXPR$0=[CAST(CAST($0):INTEGER):VARCHAR(10) CHARACTER SET 
"UTF-16LE" COLLATE "UTF-16LE$en_US$primary"])
00-06 Project(rowid=[$1], ITEM=[$2], ITEM2=[$3], ITEM0=[$0])
00-09 HashJoin(condition=[=($2, $0)], joinType=[inner], semi-join: =[false])
00-13 Project(ITEM=[ITEM($0, 'city')])
00-16 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`address`.`city`], maxwidth=2]])
00-12 Project(rowid=[$0], ITEM0=[$1], ITEM2=[$2])
00-15 SelectionVectorRemover
00-17 Filter(condition=[=(CAST($2):TIMESTAMP(0), 2013-02-04 22:34:38)])
00-18 Project(rowid=[$0], ITEM=[ITEM($1, 'city')], ITEM2=[ITEM(ITEM($2, 'irs'), 
'firstlogin')])
00-19 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`rowid`, `address`.`city`, 
`activity`.`irs`.`firstlogin`], maxwidth=2]])

[ERROR] IndexPlanTest.testRowkeyJoinPushdown_6:1799 Did not find expected 
pattern in plan: RowKeyJoin
00-00 Screen
00-01 Project(ssn=[$0])
00-02 Project(ssn=[$1])
00-03 HashJoin(condition=[=($0, $2)], joinType=[inner], semi-join: =[true])
00-05 Project(_id=[$0], ITEM=[ITEM($1, 'ssn')])
00-07 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`_id`, `id`.`ssn`], maxwidth=2]])
00-04 Project(EXPR$0=[CAST(CAST($0):INTEGER):VARCHAR(10) CHARACTER SET 
"UTF-16LE" COLLATE "UTF-16LE$en_US$primary"])
00-06 Project(rowid=[$1], ITEM=[$2], ITEM2=[$3], ITEM0=[$0])
00-08 HashJoin(condition=[=($2, $0)], joinType=[inner], semi-join: =[false])
00-10 Project(ITEM=[ITEM($0, 'city')])
00-12 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec [tableName=maprfs:///tmp/index_test_primary, 
condition=null], columns=[`address`.`city`], maxwidth=2]])
00-09 Project(rowid=[$0], ITEM0=[$1], ITEM2=[$2])
00-11 SelectionVectorRemover
00-13 Filter(condition=[=(CAST($2):TIMESTAMP(0), 2013-02-04 22:34:38)])
00-14 Project(rowid=[$0], ITEM=[ITEM($1, 'city')], ITEM2=[ITEM(ITEM($2, 'irs'), 
'firstlogin')])
00-15 Scan(table=[[hbase, index_test_primary]], groupscan=[JsonTableGroupScan 
[ScanSpec=JsonScanSpec 

[jira] [Updated] (DRILL-6833) MapRDB queries with Run Time Filters with row_key/Secondary Index Should Support Pushdown

2018-11-12 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6833:

Labels: ready-to-commit  (was: )

> MapRDB queries with Run Time Filters with row_key/Secondary Index Should 
> Support Pushdown
> -
>
> Key: DRILL-6833
> URL: https://issues.apache.org/jira/browse/DRILL-6833
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> Drill should push down all row key filters to maprDB for queries that only 
> have WHERE conditions on row_keys. In the following example, the query only 
> has where clause on row_keys:
> select t.mscIdentities from dfs.root.`/user/mapr/MixTable` t where t.row_key=
> (select max(convert_fromutf8(i.KeyA.ENTRY_KEY)) from 
> dfs.root.`/user/mapr/TableIMSI` i where i.row_key='460021050005636')
> row_keys can return at most 1 row. So the physical planning must leverage 
> MapRDB row_key push down to execute the sub query, with its results execute 
> the outer query. Currently only the inner query is pushed down. The outer 
> query requires a table scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6770) Queries on MapR-DB JSON tables fail with UnsupportedOperationException: Getting number of rows for tablet not supported

2018-11-12 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6770:

Labels: ready-to-commit  (was: )

> Queries on MapR-DB JSON tables fail with UnsupportedOperationException: 
> Getting number of rows for tablet not supported
> ---
>
> Key: DRILL-6770
> URL: https://issues.apache.org/jira/browse/DRILL-6770
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning  Optimization, Storage - MapRDB
>Affects Versions: 1.15.0
> Environment: MapR 6.1.0
> Drill 1.15.0
>Reporter: Abhishek Girish
>Assignee: Gautam Parai
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.15.0
>
>
> Create a simple MapR-DB JSON table
> {code}
> $ mapr dbshell 
> MapR-DB Shell
> maprdb root:> create /tmp/t1
> Table /tmp/t1 created.
> maprdb root:> insert /tmp/t1 --id '1' --v '{"a":1}'
> Document with id: "1" inserted.
> maprdb root:> find /tmp/t1
> {"_id":"1","a":1}
> 1 document(s) found.
> {code}
> Querying this from Drill fails:
> {code}
> > select * from mfs.`/tmp/t1`;
> Error: SYSTEM ERROR: UnsupportedOperationException: Getting number of rows 
> for tablet not supported
> {code}
> Stack Trace:
> {code}
>   (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception 
> during fragment initialization: Error while applying rule DrillTableRule, 
> args [rel#1400499:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[mfs, 
> /tmp/t1])]
> org.apache.drill.exec.work.foreman.Foreman.run():300
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
>   Caused By (java.lang.RuntimeException) Error while applying rule 
> DrillTableRule, args 
> [rel#1400499:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[mfs, /tmp/t1])]
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():236
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp():648
> org.apache.calcite.tools.Programs$RuleSetProgram.run():339
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():425
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():365
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel():252
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel():314
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():179
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():145
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():83
> org.apache.drill.exec.work.foreman.Foreman.runSQL():584
> org.apache.drill.exec.work.foreman.Foreman.run():272
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
>   Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) Error 
> getting region info for table: maprfs:///tmp/t1
> org.apache.drill.exec.store.mapr.db.json.JsonTableGroupScan.init():161
> org.apache.drill.exec.store.mapr.db.json.JsonTableGroupScan.():83
> org.apache.drill.exec.store.mapr.db.MapRDBFormatPlugin.getGroupScan():81
> org.apache.drill.exec.store.dfs.FileSystemPlugin.getPhysicalScan():170
> org.apache.drill.exec.store.AbstractStoragePlugin.getPhysicalScan():117
> org.apache.drill.exec.store.AbstractStoragePlugin.getPhysicalScan():112
> org.apache.drill.exec.planner.logical.DrillTable.getGroupScan():99
> org.apache.drill.exec.planner.logical.DrillScanRel.():90
> org.apache.drill.exec.planner.logical.DrillScanRel.():70
> org.apache.drill.exec.planner.logical.DrillScanRel.():63
> org.apache.drill.exec.planner.logical.DrillScanRule.onMatch():38
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():212
> org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp():648
> org.apache.calcite.tools.Programs$RuleSetProgram.run():339
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():425
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform():365
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToRawDrel():252
> 
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel():314
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.getPlan():179
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():145
> org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():83
> org.apache.drill.exec.work.foreman.Foreman.runSQL():584
> 

[jira] [Created] (DRILL-6841) Statistics ANALYZE statement should be more flexible

2018-11-09 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6841:
---

 Summary: Statistics ANALYZE statement should be more flexible
 Key: DRILL-6841
 URL: https://issues.apache.org/jira/browse/DRILL-6841
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Gautam Parai


Investigate and implement other alternatives to the ANALYZE statement e.g. 
SELECT ndv(id) FROM  OR enhance the ANALYZE statement to support such 
options. This may be helpful for users to experiment easily and quickly. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6833) MapRDB queries with Run Time Filters with row_key/Secondary Index Should Support Pushdown

2018-11-06 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6833:
---

 Summary: MapRDB queries with Run Time Filters with 
row_key/Secondary Index Should Support Pushdown
 Key: DRILL-6833
 URL: https://issues.apache.org/jira/browse/DRILL-6833
 Project: Apache Drill
  Issue Type: New Feature
Affects Versions: 1.15.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.15.0


Drill should push down all row key filters to maprDB for queries that only have 
WHERE conditions on row_keys. In the following example, the query only has 
where clause on row_keys:

select t.mscIdentities from dfs.root.`/user/mapr/MixTable` t where t.row_key=
(select max(convert_fromutf8(i.KeyA.ENTRY_KEY)) from 
dfs.root.`/user/mapr/TableIMSI` i where i.row_key='460021050005636')

row_keys can return at most 1 row. So the physical planning must leverage 
MapRDB row_key push down to execute the sub query, with its results execute the 
outer query. Currently only the inner query is pushed down. The outer query 
requires a table scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6824) Drill Query on MapRDB JSON table failing on schema SchemaChangeException, the only distinct Values are NULL and Text

2018-10-31 Thread Gautam Parai (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautam Parai updated DRILL-6824:

Description: 
Drill Query on MapR DB JSON Table or it View fails with a 
SchemaChangeException. The only distinct values is NULL and some Text.

The issue is that when Drill sees a NULL or does not see any values for a 
column in the first batch it assumes the type as NULLABLE INT. Subsequently, 
when the data shows up and it is different from NULLABLE INT there is a schema 
change. Operators e.g. aggregators etc. cannot handle such a Schema Change and 
throw a SchemaChangeException.

 

One of the short-term solution implemented in this fix: Add a CAST expression 
which will cast null values to the target type. Hence, we would never see a 
SchemaChange due to NULLs. However, the MapRDB Reader code was written 
differently than other reader and was hitting a SchemaChangeException. The code 
was changed to make it similar to other scans i.e. emit a new batch whenever a 
schema change is encountered.

  was:Drill Query on MapR DB JSON Table or it View fails with a 
SchemaChangeException. Th only distinct values is NULL and some Text.


> Drill Query on MapRDB JSON table failing on schema SchemaChangeException, the 
> only distinct Values are NULL and Text
> 
>
> Key: DRILL-6824
> URL: https://issues.apache.org/jira/browse/DRILL-6824
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.14.0, 1.15.0
>Reporter: Gautam Parai
>Assignee: Gautam Parai
>Priority: Major
> Fix For: 1.15.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Drill Query on MapR DB JSON Table or it View fails with a 
> SchemaChangeException. The only distinct values is NULL and some Text.
> The issue is that when Drill sees a NULL or does not see any values for a 
> column in the first batch it assumes the type as NULLABLE INT. Subsequently, 
> when the data shows up and it is different from NULLABLE INT there is a 
> schema change. Operators e.g. aggregators etc. cannot handle such a Schema 
> Change and throw a SchemaChangeException.
>  
> One of the short-term solution implemented in this fix: Add a CAST expression 
> which will cast null values to the target type. Hence, we would never see a 
> SchemaChange due to NULLs. However, the MapRDB Reader code was written 
> differently than other reader and was hitting a SchemaChangeException. The 
> code was changed to make it similar to other scans i.e. emit a new batch 
> whenever a schema change is encountered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6824) Drill Query on MapRDB JSON table failing on schema SchemaChangeException, the only distinct Values are NULL and Text

2018-10-31 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6824:
---

 Summary: Drill Query on MapRDB JSON table failing on schema 
SchemaChangeException, the only distinct Values are NULL and Text
 Key: DRILL-6824
 URL: https://issues.apache.org/jira/browse/DRILL-6824
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.14.0, 1.15.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.15.0


Drill Query on MapR DB JSON Table or it View fails with a 
SchemaChangeException. Th only distinct values is NULL and some Text.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)