Hello,
I'm trying to optimize some queries in Hive that were recently switched to
Tez.. had a general question regarding reducer parallelism ..
A lot of our queries do the following style of simultaneous windowing ..
SELECT
row_number() OVER( PARTITION BY app, user, type ORDER BY ts ) as
a_number,
row_number() OVER( PARTITION BY day, app, user, type ORDER BY ts ) as
type_rank,
row_number() OVER( PARTITION BY day, app, user ORDER BY ts ) as
dau_rank,
FROM messages
WHERE ...
Since each OVER / PARTITION-By clause is independent they can the put into
parallelized Reducer phases. But what I see is that these get serialized
into M1 -> R1 -> R2 -> R3 .. instead of M1 -> [ R1, R2, R3 ]
Is this something that Tez tries to do at all or an optimization that I can
use to my benefit ?
Cheers,
-Gautam.