Hi, > Using CDH 5.3 - Hive 0.13. Does a view help here? Does how i format >the table help in reducing size?
No, a view does not help - they are not materialized and you need hive-1.0 to have temporary table support. The only way out is if you only have 1 filter column in the system. I assume your data is not in ORC (because of CDH), which prevents any speedups due to the ORC row-index filters. In ORC, you can reorganize data to get min-max pruning IO savings by reinserting data via ³insert overwrite table <x> select * from <x> sort by <filter-col>². In my point lookup queries, when looking for ~3 rows in 6 billion unique values, ORC ends up reading only 54,000 rows into memory thanks to ORC indexes, even in MapReduce. But if your case is much simpler & you have a low-cardinality column (i.e <100 unique items), you can use that column as a partition column for your table, so that it is pre-filtered during planning time. Outside of those scenarios, a scalable distributed solution exists in Tez¹s broadcast JOIN - you can test Tez using the open-source Apache hive-0.13.1 (the one Yahoo uses), because the CDH version had Tez removed from the package. Cheers, Gopal