> What was the type (Parquet, text, ORC etc) and row count for each three >tables above?
I always use ORC for flat columnar data. ORC is designed to be ideal if you have measure/dimensions normalized into tables - most SQL workloads don't start with an indefinite depth tree. hive> select count(1) from store_sales; OK 2879987999 Time taken: 2.603 seconds, Fetched: 1 row(s) hive> select count(1) from store; OK 1002 Time taken: 0.213 seconds, Fetched: 1 row(s) hive> select count(1) from date_dim; OK 73049 Time taken: 0.186 seconds, Fetched: 1 row(s) hive> The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion records only 93 million are read into the cache. Standard TPC-DS data-set at 1000 scale - same layout you can get from hive-testbench && ./tpcds-setup.sh 1000; Cheers, Gopal