> What was the type (Parquet, text, ORC etc) and row count for each three
>tables above?

I always use ORC for flat columnar data.

ORC is designed to be ideal if you have measure/dimensions normalized into
tables - most SQL workloads don't start with an indefinite depth tree.

hive> select count(1) from store_sales;
OK
2879987999
Time taken: 2.603 seconds, Fetched: 1 row(s)
hive> select count(1) from store;
OK
1002
Time taken: 0.213 seconds, Fetched: 1 row(s)
hive> select count(1) from date_dim;
OK
73049
Time taken: 0.186 seconds, Fetched: 1 row(s)
hive> 

The DPP semi-join for date_dim is very fast, so out of the ~2.8 billion
records only 93 million are read into the cache.

Standard TPC-DS data-set at 1000 scale - same layout you can get from
hive-testbench && ./tpcds-setup.sh 1000;

Cheers,
Gopal


Reply via email to