2 tables join happens at Hive but not in spark

Sandeep Khurana Sat, 27 Feb 2016 02:11:12 -0800

Hello

We have 2 tables  (tab1, tab2) exposed using hive. The data is in different
hdfs folders. We are trying to join these 2 tables on certain single column
using sparkR join. But inspite of join columns having same values, it
returns zero rows.


But when I run the same join sql in hive, from hive console, to get the
count(*), I do get millions of records meeting the join criteria.

The join columns are of 'int' type. Also, when I join 'tab1' from one of
these 2 tables for which join is not working with another 3rd table 'tab3'
separately, that join works.

To debug , we selected just 1 row in the sparkR script from tab1 and also 1
row row having the same value of join column from tab2 also. We used
'select' sparkR function for this. Now, our dataframes for tab1 and tab2
have single row each and the join columns have same value in both, but
still joining these 2 dataframes having single row each and with same join
column, the join returned zero rows.


We are running the script from rstudio. It does not give any error. It runs
fine. But gives zero join results whereas on hive I do get many rows for
same join. Any idea what might be the cause of this?



-- 
Architect
Infoworks.io
http://Infoworks.io

2 tables join happens at Hive but not in spark

Reply via email to