Hi Bejoy, Thanks. Following your instructions, I also enabled map output compression.
I tried different queries but I couldn't get the benefit from compression in any single of them. I also tried creating queries which have large intermediate data, but it didn't improve the performance for them either. I should also note that our Hadoop cluster is setup at few Amazon EC2 m2.2xlarge instances. Question is: What are the scenarios in which compression can improve the performance? Thanks, -- Hadi On Sat, Oct 6, 2012 at 6:32 PM, Bejoy KS <bejoy...@yahoo.com> wrote: > ** > Hi Hadi > > The propertis you specified doen't enable compression of map output. To > enable map output compression you need to enable the following properties > > SET hive.exec.compress.output=true; > > SET mapred.map.output.compression=true; > SET > mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; > > > This property 'hive.exec.compress.intermediate > ' Is used to enable compression of data in between multiple mapreduce jobs > generated by a hive query. > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * Hadi Moshayedi <h...@moshayedi.net> > *Date: *Sat, 6 Oct 2012 16:55:47 +0300 > *To: *<user@hive.apache.org> > *ReplyTo: * user@hive.apache.org > *Subject: *Compression of Intermediate Data > > I wanted to look into improving performance of my Hive cluster, and from > what I read turning on compression of intermediate data could help. As I > understand, this would help because it would reduce the amount of data > written to disk in between jobs. > > I look at the documentation and set the following settings: > > SET hive.exec.compress.intermediate=true; > SET mapred.output.compression.type=BLOCK; > SET > mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; > > I ran some queries to see how compression impacts the performance. But it > usually made the query time worse. I also had a query whose size of > intermediate data was close to the size of input data, but it made the > performance worse for this query too. > > Question 1: Are the above settings correct settings for using compression > of intermediate data? > > Question 2: Are there use-cases in which compression of intermediate data > would not help performance? Why would someone not keep it turned on always? > > Thanks >