We have a use case for which we were planning to use Hudi tables for CDC purposes. Basically, my whole intention is to perform upserts along with the deletes. So, if a record in my source system is deleted, it should be deleted from my target as well.
I went through this link where a user is performing CDC using Hudi. https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b My question is how does Hudi internally recognize the records in the incremental data load? So how should the incremental file be using which we can recognize which records are meant to be appended/deleted/updates. I am actually confused with this part: S3_INCR_RAW_DATA = "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv" df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False, schema=coal_prod_schema) df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')") Where the user is directly filtering out on mode. Is "Mode" a column inside the dataset? Or how is it gonna be? I am a newbie to Hudi. Thanks, Sid