> We are using a query with union all and groupby and same table is read > multiple times in the union all subquery. … > When run with Mapreduce, the job is run in one stage consuming n mappers and > m reducers and all union all scans are done with the same job.
The logical plans are identical btw - MR effectively reads the same table again and again, unless the correlation optimizer is folding this. I doubt that due to the unix_timestamp(). An explain would be useful. > Hence if there are 50 union alls in a query, the 50n map vertex tasks are > launched which is huge. Tez lets you scale the mappers up/down using split grouping parameters, so you can tweak it to scale down if you want to. set tez.grouping.split-waves=0.1; would try to shrink the width of the mappers. An alternative is to use a CTE + materialization (HIVE-11752), but for that you need Hive2. > http://pastebin.com/u7Rw6Hag You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and using CURRENT_TIMESTAMP instead. Cheers, Gopal