[ https://issues.apache.org/jira/browse/SPARK-44473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-44473: ----------------------------------- Labels: pull-request-available (was: ) > Overwriting the same partition of a partitioned table multiple times with > empty data yields non-idempotent results > ------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-44473 > URL: https://issues.apache.org/jira/browse/SPARK-44473 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.3, 3.2.4, 3.3.2, 3.4.1 > Environment: spark : 3.x > Reporter: chris Yu > Priority: Major > Labels: pull-request-available > > > Preparation: > Create a simple partition table using spark version 3.x, for example: > > {code:java} > spark-sql> create table test1 (a int) partitioned by (dt string); > Time taken: 0.219 seconds{code} > > > * Overwrite a new partition with empty data, and you can see that the > partition information and the corresponding HDFS path are generated , for > example: > {code:java} > spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 > where 1 <> 1; > Time taken: 0.992 seconds > spark-sql> dfs -ls /user/hive/warehouse/test1; > Found 2 items > -rw-r--r-- 2 hadoop hadoop 0 2023-07-18 14:41 > /user/hive/warehouse/test1/_SUCCESS > drwxrwxrwx- hadoop hadoop 0 2023-07-18 14:41 > /user/hive/warehouse/test1/dt=20230702 > spark-sql> show partitions test1; > dt=20230702 > Time taken: 0.162 seconds, Fetched 1 row(s) > {code} > * When re-running the insert overwrite statement, you can see that the HDFS > path corresponding to this partition does not exist. > > {code:java} > spark-sql> insert overwrite table test1 partition(dt='20230702') select 2 > where 1 <> 1; > Time taken: 0.706 seconds > spark-sql> dfs -ls /user/hive/warehouse/test1; > Found 1 items > -rw-r--r-- 2 hadoop hadoop 0 2023-07-18 14:45 > /user/hive/warehouse/test1/_SUCCESS > spark-sql> show partitions test1; > dt=20230702 > Time taken: 0.183 seconds, Fetched 1 row(s){code} > For subsequent tasks that need to use this HDFS path, an exception that the > path does not exist will be thrown, which caused us trouble. > > I was expecting to execute the same statement multiple times to get the same > result, {*}not non-idempotent{*}. thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org