Hey Hivers, I am trying to understand what are some of the obvious and not so obvious optimization I could do for a Hive Query on AWS EMR cluster. I know the answer for some of these questions but want to know what do you guys think and by what factor it affects the performance over the other approach.
1. Having my external table data gzipped and reading it in the table v/s no compression at all. 2. Having the external table data on S3 v/s having it on HDFS? 3. Creating intermediate external tables v/s non external tables v/s creating views? 4. Storing the external table as Textfile v/s Sequence file. I know sequence file compresses the data, but in what format? I read about RC files and how efficient they are, how to use them? 5. How are number of reducers get set for a Hive query (The way group by and order by sets the number of reducers to 1) ? If I am not changing it explicitly does it pick it from the underlying Hadoop cluster? I am trying to understand the bottleneck between query and cluster size. 6. Any other optimizations/ best practices? Thanks a lot in advance. Richin