Rahul Challapalli created DRILL-5604:
----------------------------------------

             Summary: Possible performance degradation with hash aggregate when 
number of distinct keys increase
                 Key: DRILL-5604
                 URL: https://issues.apache.org/jira/browse/DRILL-5604
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
    Affects Versions: 1.11.0
            Reporter: Rahul Challapalli


git.commit.id.abbrev=90f43bf

I tried to track the runtime as we gradually increase the no of distinct keys 
without increasing the total no of records. Below is one such test on top of 
tpcds sf1000 dataset

{code}
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_list_price) from 
store_sales;
+---------+
| EXPR$0  |
+---------+
| 19736   |
+---------+
1 row selected (163.345 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select count(distinct ss_net_profit) from 
store_sales;
+----------+
|  EXPR$0  |
+----------+
| 1525675  |
+----------+
1 row selected (2094.962 seconds)
{code}

In both the above queries, the hash agg code processed 2879987999 records. So 
the time difference is due to overheads like hash table resizing etc. The 
second query took ~30 mins more than the first raising doubts whether there is 
an issue somewhere.

The dataset is too large to attach to a jira and so are the logs



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to