Steven Cardella created SPARK-25774:
---------------------------------------

             Summary: Eliminate query anomalies with empty partitions - 
TRUNCATE, SELECT DISTINCT, etc.
                 Key: SPARK-25774
                 URL: https://issues.apache.org/jira/browse/SPARK-25774
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0
         Environment: Right now, I'm using Cloudera with Spark 2.2.0, but I 
understand it's a widespread thing.
            Reporter: Steven Cardella


If you run a spark SQL TRUNCATE TABLE command on a managed table in Hive, it 
deletes the files in HDFS but leaves the partitions and partition folder 
structure.  If you then SELECT DISTINCT on the partition columns, it returns 
all the empty partition values.  So, you can have a SELECT DISTINCT return rows 
but SELECT * on the same table returns 0 rows.  

Coming from SQL Server and the like, SELECT DISTINCT always reflects the ROWS, 
and Impala works like that as well.  

I'd like SELECT DISTINCT to reflect rows, not partitions, TRUNCATE TABLE to 
have the option to drop partitions, and MSCK REPAIR TABLE to have the option to 
drop empty partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to