Re: How CDC works especially in delete case

Sid Kal Sun, 20 Feb 2022 23:32:43 -0800

Hi Danny,

Thank you for your response.


The file is actually used by the user who wrote that blog. In my actual
dataset, I have a schema like
customerid,customername,effective_date,customer_mob. In this case, how will
Hudi manage CDC?

Thanks,
Sid

On Mon, Feb 21, 2022 at 8:06 AM Danny Chan <danny0...@apache.org> wrote:

> Hello, what is the schema of the reading file: S3_INCR_RAW_DATA ?
>
> Best,
> Danny
>
> Sid Kal <flinkbyhe...@gmail.com> 于2022年2月21日周一 03:49写道：
> >
> >
> >
> >
> >
> >
> > We have a use case for which we were planning to use Hudi tables for CDC
> purposes. Basically, my whole intention is to perform upserts along with
> the deletes. So, if a record in my source system is deleted, it should be
> deleted from my target as well.
> >
> > I went through this link where a user is performing CDC using Hudi.
> >
> https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b
> >
> > My question is how does Hudi internally recognize the records in the
> incremental data load? So how should the incremental file be using which we
> can recognize which records are meant to be appended/deleted/updates.
> >
> > I am actually confused with this part:
> >
> > S3_INCR_RAW_DATA =
> "s3://aws-analytics-course/raw/dms/fossil/coal_prod/20200808-*.csv"
> > df_coal_prod_incr = spark.read.csv(S3_INCR_RAW_DATA, header=False,
> schema=coal_prod_schema)
> > df_coal_prod_incr_u_i=df_coal_prod_incr.filter("Mode IN ('U', 'I')")
> >
> > Where the user is directly filtering out on mode. Is "Mode" a column
> inside the dataset? Or how is it gonna be?
> >
> > I am a newbie to Hudi.
> >
> > Thanks,
> > Sid
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@hudi.apache.org
> For additional commands, e-mail: users-h...@hudi.apache.org
>
>

Re: How CDC works especially in delete case

Reply via email to