Hi,
We've noticed a difference in behaviour in doing our insert overwrite
queries when moving to Hive3.
We are using dynamic partitions where our queries look something like this:
insert overwrite tableX partition('col0', 'col1') select * from t1
distribute by col0, col1, hash(columns)%100
Where 100 is the number of files we want in the partition folder. So we
would generate 100 1GB files per partition.
In Hive2 the 'distribute by' clause caused the data to get nicely
distributed to the number of configured reducers. In Hive3 running with the
same amount of reducers the data gets distributed to just 1 reducer.
Generating 1 very large 100 GB file.
The plan for the queries also differ in the 'Map-reduce partition columns'
in the last reducer stage.
For Hive 2 it says: Map-reduce partition columns:_col0, col1,
hash(cols)%100
For Hive 3 it says: Map-reduce partition columns:_col0, col1
Which corresponds with the behaviour we see.
We've been trying every possible conf that we can find but we can't find a
way to reproduce our hive2 behaviour. Any suggestions?
Thanks,
Patrick