Hey Hivers,

I am trying to understand what are some of the obvious and not so obvious 
optimization I could do for a Hive Query on AWS EMR cluster. I know the answer 
for some of these questions but want to know what do you guys think and by what 
factor it affects the performance over the other approach.


1.       Having my external table data gzipped and reading it in the table v/s 
no compression at all.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s 
creating views?

4.       Storing the external table as Textfile v/s Sequence file. I know 
sequence file compresses the data, but in what format? I read about RC files 
and how efficient they are, how to use them?

5.       How are number of reducers get set for a Hive query (The way group by 
and order by sets the number of reducers to 1) ? If I am not changing it 
explicitly does it pick it from the underlying Hadoop cluster? I am trying to 
understand the bottleneck between query and cluster size.

6.       Any other optimizations/ best practices?

Thanks a lot in advance.
Richin

Reply via email to