Adesh Kumar Rao created HIVE-16498:
--------------------------------------

             Summary: [Tez] ReduceRecordProcessor has no check to see if all 
the operators are done or not and is reading complete data
                 Key: HIVE-16498
                 URL: https://issues.apache.org/jira/browse/HIVE-16498
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.2.0, 1.3.0
            Reporter: Adesh Kumar Rao


ReducerRecordProcessor is not checking if the reducer (Operator) is done or not 
and this causes reading of useless data.

It can be reproduced by a reduce side join.

The data for large_table is generated by following shell script and a table can 
be created from the file `large.txt`
{code:java}
for (( j=1 ; j <=20; j++))
do
  for (( i=1; i <= 1000000; i++ ))
  do
    echo "$i,$j" >> large.txt
  done
done
{code}

{code:java}
create external table large_table ( i int, j int) row format delimited fields 
terminated by ',' location "hdfs://<some-hdfs-location>";

set hive.auto.convert.join=false; -- So that reduce side join is used instead 
of MapJoin

select * from large_table a join large_table b on a,j = b.j limit 100;
{code}
The above join query is stuck reading all the data from table (because of no 
check) and does not seem to finish in real time as compared to MR or even Tez 
with MapJoin enabled.

For reference, the same query takes around 5-6 minutes on MR and 2-3 minutes in 
case of MapJoin on Tez.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to