query resulting in many small output files causes timeout error in Hue

Eric Chu Thu, 21 Nov 2013 10:17:24 -0800

Hi,

We often have map-only queries that result in a large number of small
output files (in the thousands). Although this doesn't affect CLI, when
users try to view/download the query result in Hue, Hue would time out in
trying to read all these small files. We tried to set the following
properties that supposedly will make Hive launch an extra MR job to merge
these files when the average file size is smaller than some threshold, but
it's not working:


   1. hive.merge.mapfiles = true
   2. hive.merge.mapredfiles = true
   3. hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
   4. In Hive 10, we used to have hive.mergejob.maponly set to true, but
   this property does not exist in Hive 11 and 12. What's the story behind
   this?

For example, in the following select-from-where query on a partitioned
table in RCFile, there would be two root stages - one doing a scan with
filter and the other doing a fetch.

*Query*:

select data_date as date, ID, if(col_10=1, "yes","no") as answer
from table_1
where arr[4] <> "0"
and lookup("table_1", x,"action_id")=20519251
and data_date>=20131014

*Query Plan:*

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        table_1
          TableScan
            alias: table_1
            Filter Operator
              predicate:
                  expr: ((arr[4] <> '0') and (dim_lookup('table_1', x,
'action_id') = 20519251))
                  type: boolean
              Select Operator
                expressions:
                      expr: data_date
                      type: string
                      expr: ID
                      type: string
                      expr: if((col_10= 1), 'yes', 'no')
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: true
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

The query leads to 6253 output files, and the total size is 86427 bytes.
Many of the files have 8 bytes and the ones that have more than 8 bytes
usually have ~30 bytes. With the aforementioned settings, I'd expect an
extra MR job to merge the files, but that didn't happen.

If anyone has some insights please let me know.

Thanks,
Eric

query resulting in many small output files causes timeout error in Hue

Reply via email to