[jira] [Updated] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below

2017-03-27 Thread Suresh Ollala (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Ollala updated DRILL-5293:
-
Reviewer: Kunal Khatua  (was: Chunhui Shi)

> Poor performance of Hash Table due to same hash value as distribution below
> ---
>
> Key: DRILL-5293
> URL: https://issues.apache.org/jira/browse/DRILL-5293
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.8.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>  Labels: ready-to-commit
> Fix For: 1.10.0
>
>
> The computation of the hash value is basically the same whether for the Hash 
> Table (used by Hash Agg, and Hash Join), or for distribution of rows at the 
> exchange. As a result, a specific Hash Table (in a parallel minor fragment) 
> gets only rows "filtered out" by the partition below ("upstream"), so the 
> pattern of this filtering leads to a non uniform usage of the hash buckets in 
> the table.
>   Here is a simplified example: An exchange partitions into TWO (minor 
> fragments), each running a Hash Agg. So the partition sends rows of EVEN hash 
> values to the first, and rows of ODD hash values to the second. Now the first 
> recomputes the _same_ hash value for its Hash table -- and only the even 
> buckets get used !!  (Or with a partition into EIGHT -- possibly only one 
> eighth of the buckets would be used !! ) 
>This would lead to longer hash chains and thus a _poor performance_ !
> A possible solution -- add a distribution function distFunc (only for 
> partitioning) that takes the hash value and "scrambles" it so that the 
> entropy in all the bits effects the low bits of the output. This function 
> should be applied (in HashPrelUtil) over the generated code that produces the 
> hash value, like:
>distFunc( hash32(field1, hash32(field2, hash32(field3, 0))) );
> Tested with a huge hash aggregate (64 M rows) and a parallelism of 8 ( 
> planner.width.max_per_node = 8 ); minor fragments 0 and 4 used only 1/8 of 
> their buckets, the others used 1/4 of their buckets.  Maybe the reason for 
> this variance is that distribution is using "hash32AsDouble" and hash agg is 
> using "hash32".  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below

2017-03-03 Thread Jinfeng Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinfeng Ni updated DRILL-5293:
--
Labels: ready-to-commit  (was: )

> Poor performance of Hash Table due to same hash value as distribution below
> ---
>
> Key: DRILL-5293
> URL: https://issues.apache.org/jira/browse/DRILL-5293
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.8.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>  Labels: ready-to-commit
>
> The computation of the hash value is basically the same whether for the Hash 
> Table (used by Hash Agg, and Hash Join), or for distribution of rows at the 
> exchange. As a result, a specific Hash Table (in a parallel minor fragment) 
> gets only rows "filtered out" by the partition below ("upstream"), so the 
> pattern of this filtering leads to a non uniform usage of the hash buckets in 
> the table.
>   Here is a simplified example: An exchange partitions into TWO (minor 
> fragments), each running a Hash Agg. So the partition sends rows of EVEN hash 
> values to the first, and rows of ODD hash values to the second. Now the first 
> recomputes the _same_ hash value for its Hash table -- and only the even 
> buckets get used !!  (Or with a partition into EIGHT -- possibly only one 
> eighth of the buckets would be used !! ) 
>This would lead to longer hash chains and thus a _poor performance_ !
> A possible solution -- add a distribution function distFunc (only for 
> partitioning) that takes the hash value and "scrambles" it so that the 
> entropy in all the bits effects the low bits of the output. This function 
> should be applied (in HashPrelUtil) over the generated code that produces the 
> hash value, like:
>distFunc( hash32(field1, hash32(field2, hash32(field3, 0))) );
> Tested with a huge hash aggregate (64 M rows) and a parallelism of 8 ( 
> planner.width.max_per_node = 8 ); minor fragments 0 and 4 used only 1/8 of 
> their buckets, the others used 1/4 of their buckets.  Maybe the reason for 
> this variance is that distribution is using "hash32AsDouble" and hash agg is 
> using "hash32".  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DRILL-5293) Poor performance of Hash Table due to same hash value as distribution below

2017-02-27 Thread Zelaine Fong (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zelaine Fong updated DRILL-5293:

Reviewer: Chunhui Shi

Assigned Reviewer to [~cshi]

> Poor performance of Hash Table due to same hash value as distribution below
> ---
>
> Key: DRILL-5293
> URL: https://issues.apache.org/jira/browse/DRILL-5293
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.8.0
>Reporter: Boaz Ben-Zvi
>Assignee: Boaz Ben-Zvi
>
> The computation of the hash value is basically the same whether for the Hash 
> Table (used by Hash Agg, and Hash Join), or for distribution of rows at the 
> exchange. As a result, a specific Hash Table (in a parallel minor fragment) 
> gets only rows "filtered out" by the partition below ("upstream"), so the 
> pattern of this filtering leads to a non uniform usage of the hash buckets in 
> the table.
>   Here is a simplified example: An exchange partitions into TWO (minor 
> fragments), each running a Hash Agg. So the partition sends rows of EVEN hash 
> values to the first, and rows of ODD hash values to the second. Now the first 
> recomputes the _same_ hash value for its Hash table -- and only the even 
> buckets get used !!  (Or with a partition into EIGHT -- possibly only one 
> eighth of the buckets would be used !! ) 
>This would lead to longer hash chains and thus a _poor performance_ !
> A possible solution -- add a distribution function distFunc (only for 
> partitioning) that takes the hash value and "scrambles" it so that the 
> entropy in all the bits effects the low bits of the output. This function 
> should be applied (in HashPrelUtil) over the generated code that produces the 
> hash value, like:
>distFunc( hash32(field1, hash32(field2, hash32(field3, 0))) );
> Tested with a huge hash aggregate (64 M rows) and a parallelism of 8 ( 
> planner.width.max_per_node = 8 ); minor fragments 0 and 4 used only 1/8 of 
> their buckets, the others used 1/4 of their buckets.  Maybe the reason for 
> this variance is that distribution is using "hash32AsDouble" and hash agg is 
> using "hash32".  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)