[ https://issues.apache.org/jira/browse/KYLIN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dayue Gao reopened KYLIN-2165: ------------------------------ Hi [~Shaofengshi], After deploying this feature in our production env, I found two bugs # JobEngineConfig should use cube specific configurations, otherwise hive set statements won't contain overrides (such as queue settings) # For cube that contains upper case letters in its name, `numRows` is always 0. We use Hive 1.2.1 and "fs" StatsDB, and I can reproduce the bug in Hive. To handle this situation, I think we can use lowercase name for intermediate table. > Use hive table statistics data to get the total count > ----------------------------------------------------- > > Key: KYLIN-2165 > URL: https://issues.apache.org/jira/browse/KYLIN-2165 > Project: Kylin > Issue Type: Improvement > Components: Job Engine > Reporter: Shaofeng SHI > Assignee: Shaofeng SHI > Fix For: v2.0.0 > > > Kylin will count on the intermediate flat hive table to get the total row > number, then to redistribute that. > From hive's wiki, hive will automatically collect the table statistics when > run a "insert overwrite" statement, then the subsequent "select count(*)" > will be very fast (see > https://cwiki.apache.org/confluence/display/Hive/StatsDev). While, Kylin is > executing "INSERT OVERWRITE DIRECTORY '/kylin/row_count' SELECT count(*) > from", which still cause MR/Tez job be started, this will cause the step take > longer time. > Just change the SQL to "select count(*)" or using Hive API to get the > statistic, the cost will be saved. > Here is a sample, the table > 'kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5' is an > intermediate table : > If directly run "count(*)", it is pretty fast: > hive> select count(*) from > kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5; > OK > 970033 > Time taken: 0.112 seconds, Fetched: 1 row(s) > While today Kylin's SQL will cause a job be started: > hive> INSERT OVERWRITE DIRECTORY '/kylin/row_count' select count(*) from > kylin_intermediate_qq_dbe874d2_bb9a_4375_ba50_3dcf096a13c5; > Query ID = root_20161106080808_0099b622-c0bd-41da-aee5-2321adf7bdda > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapreduce.job.reduces=<number> > Starting Job = job_1463701915919_46208, Tracking URL = -- This message was sent by Atlassian JIRA (v6.3.15#6346)