Thomas Poepping created HIVE-15852:
--------------------------------------

             Summary: Tablesampling on Tez in low-record case throws 
ArrayIndexOutOfBoundsException
                 Key: HIVE-15852
                 URL: https://issues.apache.org/jira/browse/HIVE-15852
             Project: Hive
          Issue Type: Bug
          Components: Tez
    Affects Versions: 2.1.1
            Reporter: Thomas Poepping


Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which 
doesn't create empty files to represent empty buckets when Hive is on Tez, a 
couple things are broken.

First of all, if there are empty buckets (which is possible with large datasets 
in the partitioned-bucketed case), tablesampling will not work if you're 
referencing a bucket number higher than the number of files.
e.g. In some partition 'p', there are three rows. The table 't' is clustered 
into ten buckets. With maximal hashing, only three bucket files will be 
created. If we do select * from t tablesample (bucket x out of 10) where 
<selecting from p> (where x > 3), an ArrayIndexOutOfBoundsException will be 
thrown because Hive assumes there are only three buckets.

Second, other applications (such as Pig) may be making assumptions about the 
number of files equaling the number of buckets.

Possible fixes:
* Revert HIVE-13040
* Change how tablesampling is implemented to accept possibility that number of 
files != number of buckets
** Would require coordination across projects to change assumptions

Things to consider:
* what performance gains are there from not creating empty files?
* if the gains are large, are we willing to lose them? (by reverting HIVE-13040)
* _how else can we avoid creating unnecessary files, while still maintaining 
invariants other applications expect?_



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to