[ https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dechang Gu closed DRILL-4194. ----------------------------- Verified with the test case HIVE in drill perf test framework, comparing this commit (bc74629) and the commit right before it (2953fe5), the planning time is improved ~20x: [root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL log/49_2953fe5_HIVE_20160413_094226/*/*log log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 185351 msec log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 163891 msec log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 165902 msec log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 137410 msec log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 165910 msec log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 164624 msec with this fix: [root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL log/50_bc74629_HIVE_20160413_100625/*/*log log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 8094 msec log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 6801 msec log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] TOTAL TIME : 6166 msec log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 11229 msec log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 7533 msec log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] TOTAL TIME : 7764 msec > Improve the performance of metadata fetch operation in HiveScan > --------------------------------------------------------------- > > Key: DRILL-4194 > URL: https://issues.apache.org/jira/browse/DRILL-4194 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive > Affects Versions: 1.4.0 > Reporter: Venki Korukanti > Assignee: Venki Korukanti > Fix For: 1.5.0 > > > Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}} > is created. This causes long delays when the table contains large number of > partitions. If we end up pruning majority of partitions, this delay is > unnecessary. > We need this InputSplits info from the beginning of planning because > * it is used in calculating the cost of the {{HiveScan}}. Currently when > calculating the cost first we look at the rowCount (from Hive MetaStore), if > it is available we use it in cost calculation. Otherwise we estimate the > rowCount from InputSplits. > * We also need the InputSplits for determining whether {{HiveScan}} is a > singleton or distributed for adding appropriate traits in {{ScanPrule}} > Fix is to delay the loading of the InputSplits until we need. There are two > cases where we need it. If we end up fetching the InputSplits, store them > until the query completes. > * If the stats are not available, then we need InputSplits > * If the partition is not pruned we need it for parallelization purposes. > Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion > with [~amansinha100]. All we need at this point is whether the data is > distributed or singleton at this point. Added a method {{isSingleton()}} to > GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not > sure of the implications here. We also have {{ExcessiveExchangeIdentifier}} > which removes unnecessary exchanges by looking at the parallelization info. I > think it is ok to return the parallelization info here as the pruning must > have already completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)