[jira] [Reopened] (KYLIN-2165) Use hive table statistics data to get the total count

Dayue Gao (JIRA) Wed, 05 Apr 2017 21:28:58 -0700

     [ 
https://issues.apache.org/jira/browse/KYLIN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dayue Gao reopened KYLIN-2165:
------------------------------

Hi [~Shaofengshi],

After deploying this feature in our production env, I found two bugs

# JobEngineConfig should use cube specific configurations, otherwise hive set 
statements won't contain overrides (such as queue settings)
# For cube that contains upper case letters in its name, `numRows` is always 0. 
We use Hive 1.2.1 and "fs" StatsDB, and I can reproduce the bug in Hive. To 
handle this situation, I think we can use lowercase name for intermediate table.

> Use hive table statistics data to get the total count
> -----------------------------------------------------
>
>                 Key: KYLIN-2165
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2165
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Job Engine
>            Reporter: Shaofeng SHI
>            Assignee: Shaofeng SHI
>             Fix For: v2.0.0
>
>
> Kylin will count on the intermediate flat hive table to get the total row 
> number, then to redistribute that.
> From hive's wiki, hive will automatically collect the table statistics when 
> run a "insert overwrite" statement, then the subsequent "select count(*)" 
> will be very fast (see 
> https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is 
> executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) 
> from", which still cause MR/Tez job be started, this will cause the step take 
> longer time.
> Just change the SQL to "select count(*)" or using Hive API to get the 
> statistic, the cost will be saved. 
> Here is a sample, the table 
> 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an 
> intermediate table :
> If directly run "count(*)", it is pretty fast:
> hive> select count(*) from 
> kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> OK
> 970033
> Time taken: 0.112 seconds, Fetched: 1 row(s)
> While today Kylin's SQL will cause a job be started:
> hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from 
> kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5;
> Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1463701915919_46208, Tracking URL =



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Reopened] (KYLIN-2165) Use hive table statistics data to get the total count

Reply via email to