We have a use case for which we were planning to use Hudi tables for CDC
purposes. Basically, my whole intention is to perform upserts along with
the deletes. So, if a record in my source system is deleted, it should be
deleted from my target as well.

I went through this link where a user is performing CDC using Hudi.
https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b

My question is how does Hudi internally recognize the records in the
incremental data load? So how should the incremental file be using which we
can recognize which records are meant to be appended/deleted/updates.

I am actually confused with this part:

S3_INCR_RAW_DATA =
"s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False,
schema=coal_prod_schema)
df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")

Where the user is directly filtering out on mode. Is "Mode" a column inside
the dataset? Or how is it gonna be?

I am a newbie to Hudi.

Thanks,
Sid

Reply via email to