Hi Varun,

Yes both of these are valid ways of filtering data before join in Hive.

As long as the join is not outer and the ON condition is not on non-null
generating side of join Hive planner will try to push the predicate down to
table scan.
In fact Hive goes one step ahead and also generate IS NOT NULL predicate on
join keys (e.g. a.some_id IS NOT NULL, b.some_other_id IS NOT NULL) and
push is down to table scan if possible.

Regards,
Vineet Garg


On Sun, Apr 28, 2019 at 11:54 AM Varun Rao <va...@cloudera.com> wrote:

> When performing a join in Hive and then filtering the output with a where
> clause, the Hive compiler will try to filter data before the tables are
> joined. This is known as predicate pushdown (
> http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
>
> For example:
>
> SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
>
> Rows from table a which have some_name = 6 will be filtered before
> performing the join, if push down predicates are enabled(hive.optimize.ppd).
>
> However, I have also learned recently that there is another way of
> filtering data from a table before joining it with another table(
> https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/
> ).
>
> One can provide the condition in the ON clause, and table a will be
> filtered before the join is performed
>
> For example:
>
> SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6
>
> Are these both valid ways of filtering data before joins?
>
> Thank you
>
> Yours Truly,
> Varun Rao
>

Reply via email to