jvaesteves opened a new issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary URL: https://github.com/apache/incubator-hudi/issues/1488 I was trying Hudi on some ORC backup files from my Kafka broker, to see if it would be a nice deduplication process for the messages. The source has 16767835 rows, with 4049589 unique keys, but when I first create the Hudi table from it, it only contains 5 rows. I tried to cast the key to string and it actually worked, droping the duplicate keys, but I want to know if this casting is a requirement for Hudi to properly work. Here is a sample from the data, and the snippet that I used. ```scala import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig //Set up various input values as variables val inputDataPath = "" val hudiTableName = "test" val hudiTablePath = "/tmp/kafka-hudi" // Set up our Hudi Data Source Options val hudiOptions = Map[String,String]( DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "key", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "dt", HoodieWriteConfig.TABLE_NAME -> hudiTableName, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp" ) // Read data from S3 and create a DataFrame with Partition and Record Key val inputDF = spark.read.orc(inputDataPath).withColumn("dt", to_date($"timestamp")) // Write data into the Hudi dataset inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath) ``` ``` +-------------------------------------------------------------------------+-----------------------+----------+ |key |timestamp |dt | +-------------------------------------------------------------------------+-----------------------+----------+ |[38 33 39 36 35 34 32 38 30 7C 33 32 7C 31 31 37 36 31 35 37 34 35 32 37]|2019-11-28 13:24:07.792|2019-11-28| |[38 34 34 39 34 33 34 37 32 7C 34 34 7C 31 31 37 36 33 37 39 31 34 36 38]|2019-11-28 13:24:07.792|2019-11-28| |[38 34 32 37 39 38 34 37 38 7C 32 7C 31 31 38 33 31 39 33 37 30 31 35] |2019-11-28 13:24:07.793|2019-11-28| |[38 36 36 32 30 31 37 33 36 7C 32 7C 31 31 38 33 32 37 30 39 34 38 35] |2019-11-28 13:24:07.793|2019-11-28| |[38 36 34 34 30 33 36 36 38 7C 31 7C 31 31 37 37 30 37 37 35 30 39 39] |2019-11-28 13:24:07.798|2019-11-28| |[38 36 34 34 32 31 33 38 33 7C 31 7C 31 31 37 37 31 30 36 38 30 30 30] |2019-11-28 13:24:07.798|2019-11-28| |[38 36 33 33 38 38 32 34 32 7C 35 7C 31 31 37 39 34 37 36 39 31 38 38] |2019-11-28 13:24:07.809|2019-11-28| |[38 35 34 34 35 31 31 31 31 7C 33 7C 31 31 38 33 32 30 31 33 33 32 31] |2019-11-28 13:24:07.81 |2019-11-28| |[38 36 35 36 37 38 38 33 35 7C 31 7C 31 31 37 39 33 34 36 38 31 31 32] |2019-11-28 13:24:07.823|2019-11-28| |[38 35 34 38 34 39 38 30 36 7C 32 7C 31 31 38 33 32 32 35 32 38 38 36] |2019-11-28 13:24:07.823|2019-11-28| |[36 39 35 36 32 37 32 32 33 7C 38 39 7C 31 31 37 36 37 39 30 32 31 38 38]|2019-11-28 13:24:05.651|2019-11-28| |[38 36 36 33 31 30 39 36 36 7C 31 7C 31 31 38 30 38 38 36 31 31 37 33] |2019-11-28 13:24:05.653|2019-11-28| |[37 39 39 30 35 32 31 36 31 7C 34 7C 31 31 37 36 37 37 30 35 30 32 33] |2019-11-28 13:24:05.653|2019-11-28| |[36 39 35 36 31 38 32 36 38 7C 38 36 7C 31 31 37 36 37 38 39 30 37 31 37]|2019-11-28 13:24:05.655|2019-11-28| |[38 33 35 33 34 34 38 38 38 7C 35 39 7C 31 31 37 36 31 36 37 30 37 35 30]|2019-11-28 13:24:05.949|2019-11-28| |[38 36 34 38 38 39 31 33 39 7C 31 7C 31 31 37 38 30 30 35 30 35 31 33] |2019-11-28 13:24:05.951|2019-11-28| |[38 36 33 36 38 37 34 33 37 7C 31 7C 31 31 37 35 31 32 37 35 33 31 34] |2019-11-28 13:24:05.951|2019-11-28| |[38 34 36 35 36 30 33 39 33 7C 34 32 7C 31 31 37 36 31 33 30 30 38 36 37]|2019-11-28 13:24:05.952|2019-11-28| |[38 36 34 34 35 39 38 35 33 7C 31 7C 31 31 37 37 31 35 32 30 39 37 39] |2019-11-28 13:24:05.952|2019-11-28| |[38 36 30 39 36 33 37 37 33 7C 31 7C 31 31 36 38 38 30 33 36 33 33 30] |2019-11-28 13:24:05.952|2019-11-28| +-------------------------------------------------------------------------+-----------------------+----------+ ``` Also, if I use the **dt** column as the **PARTITIONPATH_FIELD_OPT_KEY**, when I ls the output directory, the partition name is /18228, is this the expected behaviour?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services