Question from novice. Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ?
Specifically 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense? 2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved? 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format? 4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? Thanks! Vyacheslav