Re: map side join

Gopal Vijayaraghavan Thu, 30 Apr 2015 09:55:01 -0700

Hi,

> Using CDH 5.3 - Hive 0.13.  Does a view help here?  Does how i format
>the table help in reducing size?


No, a view does not help - they are not materialized and you need hive-1.0
to have temporary table support.

The only way out is if you only have 1 filter column in the system.

I assume your data is not in ORC (because of CDH), which prevents any
speedups due to the ORC row-index filters.

In ORC, you can reorganize data to get min-max pruning IO savings by
reinserting data via ³insert overwrite table <x> select * from <x> sort by
<filter-col>².

In my point lookup queries, when looking for ~3 rows in 6 billion unique
values, ORC ends up reading only 54,000 rows into memory thanks to ORC
indexes, even in MapReduce.

But if your case is much simpler & you have a low-cardinality column (i.e
<100 unique items), you can use that column as a partition column for your
table, so that it is pre-filtered during planning time.

Outside of those scenarios, a scalable distributed solution exists in
Tez¹s broadcast JOIN - you can test Tez using the open-source Apache
hive-0.13.1 (the one Yahoo uses), because the CDH version had Tez removed
from the package.

Cheers,
Gopal

Re: map side join

Reply via email to