Re: Obvious and not so obvious query optimzations in Hive

Bejoy KS Wed, 27 Jun 2012 11:46:45 -0700

Hi Richin

I'm not an AWS guy but still lemme try answering a few questions in general 
(not wrt AWS EMR)


1.       Having my external table data gzipped and reading it in the table v/s 
no compression at all.

Bejoy: When the data volume is large compression saves up the disk space and 
hence it is recommended. Gzip is not splittable, means one file can't be 
distributed across map tasks. Go in with Lzo or Snappy.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s 
creating views?

Bejoy: External Tables are not much different than managed tables. In managed 
tables when you drop the table using HQL the underlying data in hdfs is also 
deleted but in case of External tables only the table defn is dropped and the 
data in hdfs is preserved. Views are not evaluated on its creation time, but 
when it is used MR jobs are triggered and the required data is extracted out of 
source tables. So if you are planning to reuse a view n number of times it'll 
be better creating a table and using it else the view query will be evaluated n 
times.

4.       Storing the external table as Textfile v/s Sequence file. I know 
sequence file compresses the data, but in what format? I read about RC files 
and how efficient they are, how to use them?

Bejoy: Sequence files are splitable on its own so it is a good choice when 
using with Snappy. In sequence file, compression happens either at record level 
or block level which is configurable at the time of compressing. If you are 
using Gzip, TextFiles offer more compression ratio as the whole data is 
compressed in a go compared with Sequence files where it happens per 
record/block. But then you compromise on splitability. RC files are a good 
choice when your queries involve querying only a few columns rather than all 
columns in a row. 

5.       How are number of reducers get set for a Hive query (The way group by 
and order by sets the number of reducers to 1) ? If I am not changing it 
explicitly does it pick it from the underlying Hadoop cluster? I am trying to 
understand the bottleneck between query and cluster size.

Bejoy: Some queries in hive are forced to have just a single reducer like Order 
By. In case of other queries hive determines the number of reducers. However 
you can always specify the number of reducer on a per query basis based on the 
data it process. 

6.       Any other optimizations/ best practices?

Bejoy: There are lots of other optimizations in hive which can be injected 
based on the query. There are various join optimizations, group by 
optimizations etc suited for specific needs.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-----Original Message-----
From: <[email protected]>
Date: Wed, 27 Jun 2012 15:47:54 
To: <[email protected]>
Reply-To: [email protected]
Subject: Obvious and not so obvious query optimzations in Hive

Hey Hivers,

I am trying to understand what are some of the obvious and not so obvious 
optimization I could do for a Hive Query on AWS EMR cluster. I know the answer 
for some of these questions but want to know what do you guys think and by what 
factor it affects the performance over the other approach.


1.       Having my external table data gzipped and reading it in the table v/s 
no compression at all.

2.       Having the external table data on S3 v/s having it on HDFS?

3.       Creating intermediate external tables v/s non external tables v/s 
creating views?

4.       Storing the external table as Textfile v/s Sequence file. I know 
sequence file compresses the data, but in what format? I read about RC files 
and how efficient they are, how to use them?

5.       How are number of reducers get set for a Hive query (The way group by 
and order by sets the number of reducers to 1) ? If I am not changing it 
explicitly does it pick it from the underlying Hadoop cluster? I am trying to 
understand the bottleneck between query and cluster size.

6.       Any other optimizations/ best practices?

Thanks a lot in advance.
Richin

Re: Obvious and not so obvious query optimzations in Hive

Reply via email to