[ 
https://issues.apache.org/jira/browse/DRILL-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dechang Gu closed DRILL-4194.
-----------------------------

Verified with the test case HIVE in drill perf test framework, comparing this 
commit (bc74629) and the commit right before it (2953fe5), the planning time is 
improved ~20x:

[root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL 
log/49_2953fe5_HIVE_20160413_094226/*/*log
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 185351 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 163891 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 165902 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 137410 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 165910 msec
log/49_2953fe5_HIVE_20160413_094226/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 164624 msec



with this fix:

[root@ucs-node1 private-drill-perf-test-framework]# grep TOTAL 
log/50_bc74629_HIVE_20160413_100625/*/*log
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 8094 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 6801 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_expPlan_01/HIVE_expPlan_01.log:[STAT] 
TOTAL TIME : 6166 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 11229 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 7533 msec
log/50_bc74629_HIVE_20160413_100625/HIVE_limit1_02/HIVE_limit1_02.log:[STAT] 
TOTAL TIME : 7764 msec

> Improve the performance of metadata fetch operation in HiveScan
> ---------------------------------------------------------------
>
>                 Key: DRILL-4194
>                 URL: https://issues.apache.org/jira/browse/DRILL-4194
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Hive
>    Affects Versions: 1.4.0
>            Reporter: Venki Korukanti
>            Assignee: Venki Korukanti
>             Fix For: 1.5.0
>
>
> Current HiveScan fetches the InputSplits for all partitions when {{HiveScan}} 
> is created. This causes long delays when the table contains large number of 
> partitions. If we end up pruning majority of partitions, this delay is 
> unnecessary.
> We need this InputSplits info from the beginning of planning because
>  * it is used in calculating the cost of the {{HiveScan}}. Currently when 
> calculating the cost first we look at the rowCount (from Hive MetaStore), if 
> it is available we use it in cost calculation. Otherwise we estimate the 
> rowCount from InputSplits. 
>  * We also need the InputSplits for determining whether {{HiveScan}} is a 
> singleton or distributed for adding appropriate traits in {{ScanPrule}}
> Fix is to delay the loading of the InputSplits until we need. There are two 
> cases where we need it. If we end up fetching the InputSplits, store them 
> until the query completes.
>  * If the stats are not available, then we need InputSplits
>  * If the partition is not pruned we need it for parallelization purposes.
> Regarding getting the parallelization info in {{ScanPrule}}: Had a discussion 
> with [~amansinha100]. All we need at this point is whether the data is 
> distributed or singleton at this point. Added a method {{isSingleton()}} to 
> GroupScan. Returning {{false}} seems to work fine for HiveScan, but I am not 
> sure of the implications here. We also have {{ExcessiveExchangeIdentifier}} 
> which removes unnecessary exchanges by looking at the parallelization info. I 
> think it is ok to return the parallelization info here as the pruning must 
> have already completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to