LeoHsu0802 opened a new issue #2184: URL: https://github.com/apache/hudi/issues/2184
Describe the problem you faced partition value be duplicated after UPSERT **Setting in Jupyter Notebook** ``` %%configure -f { "conf": { "spark.jars": "hdfs:///user/hadoop/httpclient-4.5.9.jar, hdfs:///user/hadoop/httpcore-4.4.11.jar, hdfs:///user/hadoop/hudi-spark-bundle.jar, hdfs:///user/hadoop/spark-avro.jar", "spark.sql.hive.convertMetastoreParquet":"false", "spark.serializer":"org.apache.spark.serializer.KryoSerializer", "spark.dynamicAllocation.executorIdleTimeout": 3600 } } ``` **To Reproduce** 1. Load raw data to dataframe and write to S3 in Hudi dataset ``` (df.write.format("org.apache.hudi") .option("hoodie.datasource.write.precombine.field", "tstamp") .option("hoodie.datasource.write.recordkey.field", "id") .option("hoodie.table.name", config['table_name']) .option("hoodie.datasource.write.operation", "insert") .option("hoodie.bulkinsert.shuffle.parallelism", 6) .option("hoodie.consistency.check.enabled", "true") .option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.ComplexKeyGenerator") .option("hoodie.datasource.write.partitionpath.field", 'country, gateway, year, month, day') .option("hoodie.datasource.hive_sync.table",config['table_name']) .option("hoodie.datasource.hive_sync.enable","true") .option("hoodie.datasource.hive_sync.partition_extractor_class","org.apache.hudi.hive.MultiPartKeysValueExtractor") .option("hoodie.datasource.hive_sync.partition_fields", 'country, gateway, year, month, day') .option("hoodie.datasource.hive_sync.database", 'hudi_dev') .mode("Overwrite") .save(config['target'])) ``` 2. Write successfully and can be query in spark.sql 3. Change country value between id 101-105 ![image](https://user-images.githubusercontent.com/38006639/96225748-2002e780-0fc4-11eb-959c-dcee462dd22c.png) 4. Upsert the dataframe ``` (upsert_df.write.format("org.apache.hudi") .option("hoodie.datasource.write.precombine.field", "tstamp") .option("hoodie.datasource.write.recordkey.field", "id") .option("hoodie.table.name", config['table_name']) .option("hoodie.datasource.write.operation", "upsert") .option("hoodie.upsert.shuffle.parallelism", 20) .option("hoodie.consistency.check.enabled", "true") .option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator") .option("hoodie.datasource.write.partitionpath.field", 'country, year, month, day') .option("hoodie.cleaner.policy", "KEEP_LATEST_COMMITS") .option("hoodie.cleaner.commits.retained", "10") .option("hoodie.datasource.hive_sync.table",config['table_name']) .option("hoodie.datasource.hive_sync.enable","true") .option("hoodie.datasource.hive_sync.partition_extractor_class","org.apache.hudi.hive.MultiPartKeysValueExtractor") .option("hoodie.datasource.hive_sync.partition_fields", 'country, year, month, day') .option("hoodie.datasource.hive_sync.database", 'hudi_dev') .mode("Append") .save(config['target'])) ``` 5. Query the upsert result ![image](https://user-images.githubusercontent.com/38006639/96223799-175ce200-0fc1-11eb-880a-93dd9a2c3af4.png) **Expected behavior** I expected no duplicate data in query result **Environment Description** * EMR version : 5.31.0 * Hudi version : 0.6.0 * Spark version : 2.4.6 * Hive version : 2.3.7 * Hadoop version : Amazon 2.10.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** httpclient-4.5.9.jar httpcore-4.4.11.jar hudi-spark-bundle.jar spark-avro.jar" **Stacktrace** No ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org