Kernel Force created SPARK-32228: ------------------------------------ Summary: Partition column of hive table was capitalized while stored on HDFS Key: SPARK-32228 URL: https://issues.apache.org/jira/browse/SPARK-32228 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: Hadoop 2.7.7 Hive 2.3.6 Spark 3.0.0 Reporter: Kernel Force
Suppose we have a target hive table to be insert by spark with dynamic partition feature on. {code:sql} CREATE TABLE DEMO_PART ( ID VARCHAR(10), NAME VARCHAR(10) ) PARTITIONED BY (BATCH DATE, TEAM VARCHAR(10)) STORED AS ORC; {code} And have a source data table like: {code:sql} 0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_DATA T; +-------+---------+-------------+---------+ | t.id | t.name | t.batch | t.team | +-------+---------+-------------+---------+ | 1 | mike | 2020-07-08 | A | | 2 | john | 2020-07-07 | B | +-------+---------+-------------+---------+ 2 rows selected (0.177 seconds) {code} Then doing join operation against an exploded view and insert the result into DEMO_PART table: {code:sql} sql(""" WITH VA AS ( SELECT ARRAY_REPEAT(1,10) A ), VB AS ( SELECT EXPLODE(T.A) IDX FROM VA T ), VC AS ( SELECT ROW_NUMBER() OVER(ORDER BY NULL) RN FROM VB T ), VD AS ( SELECT T.RN, DATE_ADD(TO_DATE('2020-07-01','yyyy-MM-dd'),T.RN) DT FROM VC T ), VE AS ( SELECT T.DT BATCH, T.RN ID, CASE WHEN T.RN > 5 THEN 'A' ELSE 'B' END TEAM FROM VD T ) SELECT T.BATCH BATCH, S.ID ID, S.NAME NAME, S.TEAM TEAM FROM VE T INNER JOIN DEMO_DATA S ON T.TEAM = S.TEAM """). selectExpr(spark.table("DEMO_PART").columns:_*). write.mode("overwrite").insertInto("DEMO_PART") {code} The result could NOT be read by hive beeline: {code:sql} 0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_PART T; +-------+---------+----------+---------+ | t.id | t.name | t.batch | t.team | +-------+---------+----------+---------+ +-------+---------+----------+---------+ No rows selected (0.268 seconds) {code} Because the underlying data stored in HDFS was uncorrect: {code:bash} [user@HOSTNAME ~]$ dfs -ls /user/hive/warehouse/demo_part/ Found 21 items /user/hive/warehouse/demo_part/BATCH=2020-07-02 /user/hive/warehouse/demo_part/BATCH=2020-07-03 /user/hive/warehouse/demo_part/BATCH=2020-07-04 /user/hive/warehouse/demo_part/BATCH=2020-07-05 /user/hive/warehouse/demo_part/BATCH=2020-07-06 /user/hive/warehouse/demo_part/BATCH=2020-07-07 /user/hive/warehouse/demo_part/BATCH=2020-07-08 /user/hive/warehouse/demo_part/BATCH=2020-07-09 /user/hive/warehouse/demo_part/BATCH=2020-07-10 /user/hive/warehouse/demo_part/BATCH=2020-07-11 /user/hive/warehouse/demo_part/_SUCCESS /user/hive/warehouse/demo_part/batch=2020-07-02 /user/hive/warehouse/demo_part/batch=2020-07-03 /user/hive/warehouse/demo_part/batch=2020-07-04 /user/hive/warehouse/demo_part/batch=2020-07-05 /user/hive/warehouse/demo_part/batch=2020-07-06 /user/hive/warehouse/demo_part/batch=2020-07-07 /user/hive/warehouse/demo_part/batch=2020-07-08 /user/hive/warehouse/demo_part/batch=2020-07-09 /user/hive/warehouse/demo_part/batch=2020-07-10 /user/hive/warehouse/demo_part/batch=2020-07-11 {code} Both "BATCH=XXXX" and "batch=XXXX" directories appeared, and the data files was stored in "BATCH" directories but not "batch" The result will be correct if I change the SQL statement, simply change the column alias to lower case in the last select, like: {code:sql} SELECT T.BATCH batch, S.ID id, S.NAME name, S.TEAM team FROM VE T INNER JOIN DEMO_DATA S ON T.TEAM = S.TEAM {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org