I wanted to look into improving performance of my Hive cluster, and from what I read turning on compression of intermediate data could help. As I understand, this would help because it would reduce the amount of data written to disk in between jobs.
I look at the documentation and set the following settings: SET hive.exec.compress.intermediate=true; SET mapred.output.compression.type=BLOCK; SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; I ran some queries to see how compression impacts the performance. But it usually made the query time worse. I also had a query whose size of intermediate data was close to the size of input data, but it made the performance worse for this query too. Question 1: Are the above settings correct settings for using compression of intermediate data? Question 2: Are there use-cases in which compression of intermediate data would not help performance? Why would someone not keep it turned on always? Thanks