jvaesteves opened a new issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488
 
 
   I was trying Hudi on some ORC backup files from my Kafka broker, to see if 
it would be a nice deduplication process for the messages.
   
   The source has 16767835 rows, with 4049589 unique keys, but when I first 
create the Hudi table from it, it only contains 5 rows. I tried to cast the key 
to string and it actually worked, droping the duplicate keys, but I want to 
know if this casting is a requirement for Hudi to properly work.
   
   Here is a sample from the data, and the snippet that I used.
   
   ```scala
   import org.apache.spark.sql.SaveMode
   import org.apache.spark.sql.functions._
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.HoodieWriteConfig
   
   //Set up various input values as variables
   val inputDataPath = ""
   val hudiTableName = "test"
   val hudiTablePath = "/tmp/kafka-hudi"
   
   // Set up our Hudi Data Source Options
   val hudiOptions = Map[String,String](
       DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "key",
       DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "dt", 
       HoodieWriteConfig.TABLE_NAME -> hudiTableName,
       DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp"
   )
   
   // Read data from S3 and create a DataFrame with Partition and Record Key
   val inputDF = spark.read.orc(inputDataPath).withColumn("dt", 
to_date($"timestamp"))
   
   // Write data into the Hudi dataset
   
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath)
   ```
   ```
   
+-------------------------------------------------------------------------+-----------------------+----------+
   |key                                                                      
|timestamp              |dt        |
   
+-------------------------------------------------------------------------+-----------------------+----------+
   |[38 33 39 36 35 34 32 38 30 7C 33 32 7C 31 31 37 36 31 35 37 34 35 32 
37]|2019-11-28 13:24:07.792|2019-11-28|
   |[38 34 34 39 34 33 34 37 32 7C 34 34 7C 31 31 37 36 33 37 39 31 34 36 
38]|2019-11-28 13:24:07.792|2019-11-28|
   |[38 34 32 37 39 38 34 37 38 7C 32 7C 31 31 38 33 31 39 33 37 30 31 35]   
|2019-11-28 13:24:07.793|2019-11-28|
   |[38 36 36 32 30 31 37 33 36 7C 32 7C 31 31 38 33 32 37 30 39 34 38 35]   
|2019-11-28 13:24:07.793|2019-11-28|
   |[38 36 34 34 30 33 36 36 38 7C 31 7C 31 31 37 37 30 37 37 35 30 39 39]   
|2019-11-28 13:24:07.798|2019-11-28|
   |[38 36 34 34 32 31 33 38 33 7C 31 7C 31 31 37 37 31 30 36 38 30 30 30]   
|2019-11-28 13:24:07.798|2019-11-28|
   |[38 36 33 33 38 38 32 34 32 7C 35 7C 31 31 37 39 34 37 36 39 31 38 38]   
|2019-11-28 13:24:07.809|2019-11-28|
   |[38 35 34 34 35 31 31 31 31 7C 33 7C 31 31 38 33 32 30 31 33 33 32 31]   
|2019-11-28 13:24:07.81 |2019-11-28|
   |[38 36 35 36 37 38 38 33 35 7C 31 7C 31 31 37 39 33 34 36 38 31 31 32]   
|2019-11-28 13:24:07.823|2019-11-28|
   |[38 35 34 38 34 39 38 30 36 7C 32 7C 31 31 38 33 32 32 35 32 38 38 36]   
|2019-11-28 13:24:07.823|2019-11-28|
   |[36 39 35 36 32 37 32 32 33 7C 38 39 7C 31 31 37 36 37 39 30 32 31 38 
38]|2019-11-28 13:24:05.651|2019-11-28|
   |[38 36 36 33 31 30 39 36 36 7C 31 7C 31 31 38 30 38 38 36 31 31 37 33]   
|2019-11-28 13:24:05.653|2019-11-28|
   |[37 39 39 30 35 32 31 36 31 7C 34 7C 31 31 37 36 37 37 30 35 30 32 33]   
|2019-11-28 13:24:05.653|2019-11-28|
   |[36 39 35 36 31 38 32 36 38 7C 38 36 7C 31 31 37 36 37 38 39 30 37 31 
37]|2019-11-28 13:24:05.655|2019-11-28|
   |[38 33 35 33 34 34 38 38 38 7C 35 39 7C 31 31 37 36 31 36 37 30 37 35 
30]|2019-11-28 13:24:05.949|2019-11-28|
   |[38 36 34 38 38 39 31 33 39 7C 31 7C 31 31 37 38 30 30 35 30 35 31 33]   
|2019-11-28 13:24:05.951|2019-11-28|
   |[38 36 33 36 38 37 34 33 37 7C 31 7C 31 31 37 35 31 32 37 35 33 31 34]   
|2019-11-28 13:24:05.951|2019-11-28|
   |[38 34 36 35 36 30 33 39 33 7C 34 32 7C 31 31 37 36 31 33 30 30 38 36 
37]|2019-11-28 13:24:05.952|2019-11-28|
   |[38 36 34 34 35 39 38 35 33 7C 31 7C 31 31 37 37 31 35 32 30 39 37 39]   
|2019-11-28 13:24:05.952|2019-11-28|
   |[38 36 30 39 36 33 37 37 33 7C 31 7C 31 31 36 38 38 30 33 36 33 33 30]   
|2019-11-28 13:24:05.952|2019-11-28|
   
+-------------------------------------------------------------------------+-----------------------+----------+
   ```
   
   Also, if I use the **dt** column as the **PARTITIONPATH_FIELD_OPT_KEY**, 
when I ls the output directory, the partition name is /18228, is this the 
expected behaviour?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to