When performing a join in Hive and then filtering the output with a where
clause, the Hive compiler will try to filter data before the tables are
joined. This is known as predicate pushdown (
http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)

For example:

SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6

Rows from table a which have some_name = 6 will be filtered before
performing the join, if push down predicates are enabled(hive.optimize.ppd).

However, I have also learned recently that there is another way of
filtering data from a table before joining it with another table(
https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/
).

One can provide the condition in the ON clause, and table a will be
filtered before the join is performed

For example:

SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6

Are these both valid ways of filtering data before joins?

Thank you

Yours Truly,
Varun Rao

Reply via email to