Chao Sun created HIVE-15796: ------------------------------- Summary: HoS: poor reducer parallelism when operator stats are not accurate Key: HIVE-15796 URL: https://issues.apache.org/jira/browse/HIVE-15796 Project: Hive Issue Type: Improvement Components: Statistics Affects Versions: 2.2.0 Reporter: Chao Sun Assignee: Chao Sun
In HoS we use currently use operator stats to determine reducer parallelism. However, it is often the case that operator stats are not accurate, especially if column stats are not available. This sometimes will generate extremely poor reducer parallelism, and cause HoS query to run forever. This JIRA tries to offer an alternative way to compute reducer parallelism, similar to how MR does. Here's the approach we are suggesting: 1. when computing the parallelism for a MapWork, use stats associated with the TableScan operator; 2. when computing the parallelism for a ReduceWork, use the *maximum* parallelism from all its parents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)