table design and performance questions

Avdeev V . M . Sun, 27 May 2012 22:17:41 -0700

Question from novice.

Where I can read table design best practices? I have a measure table with 
millions of rows and many dimension tables with less than 1000 rows each. I 
can't find out the way to get optimal design of both kind of tables. Is there 
performance tuning guides or performance FAQ?


Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of 
queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
performance? What aspects should be taken into account when choosing a file 
format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, 
in some cases, known to give better performance that uncompressed storage;' and 
again - What is these cases? 

Thanks!
Vyacheslav

table design and performance questions

Reply via email to