Implicitly CLUSTER BY when dynamically partitioning
---------------------------------------------------

                 Key: HIVE-2363
                 URL: https://issues.apache.org/jira/browse/HIVE-2363
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Adam Kramer
            Priority: Critical


Whenever someone is dynamically creating partitions, the underlying 
implementation is to look at the output data, write it to a file so long as the 
partition columns are contiguous, then to close that file and open a new one if 
the partition column changes. This leads to potentially way too many files 
generated.

The solution is to ensure that a partition column's data all appears in a row 
and on the same reducer. I.e., to cluster by the partitioning columns on the 
way out.

This improvement is to detect whether a query is clustering by the eventual 
partition columns, and if not, to do so as an additional step at the end of the 
query. This will potentially save lots of space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to