Re: accuracy validation of streaming pipeline

2022-05-24 Thread Leonard Xu
Hi, vtygoss

> I'm working on migrating from full-data-pipeline(with spark) to 
> incremental-data-pipeline(with flink cdc), and i met a problem about accuracy 
> validation between pipeline based flink and spark.

Glad to hear that !



> For bounded data, it's simple to validate the two result sets are consitent 
> or not. 
> But, for unbouned data and event-driven application, how to make sure the 
> data stream produced is correct, especially when there are some retract 
> functions with high impactions, e.g. row_number. 
> 
> Is there any document for this preblom?  Thanks for your any suggestions or 
> replies. 

The validation feature belongs data quality scope from my understanding, it’s 
usually provided by the platform e.g. the Data Integration Platform. As the 
underlying pipeline engine/tools, Flink CDC should expose more metrics or data 
quality checking abilities but we didn’t offers them yet, and these 
enhancements is on our roadmap.  Currently, you can use Flink source/sink 
operator’s metric as a rough validation, you can also compare the records count 
in your source database and sink system multiple times for more accurate 
validation.

Best,
Leonard



Re: accuracy validation of streaming pipeline

2022-05-24 Thread Shengkai Fang
Hi, all.

>From my understanding, the accuracy for the sync pipeline requires to
snapshot the source and sink at some points.  It is just like we have a
checkpoint that contains all the data at some time for both sink and
source. Then we can compare the content in the checkpoint and find the
difference.

The main problem is how can we snapshot the data in the source/sink or
provide some meaningful metrics to compare at the points.

Best,
Shengkai

Xuyang  于2022年5月24日周二 21:32写道:

> I think for an unbounded data, we can only check the result at one point
> of time, that is the work what Watermark[1] does. What about tag one time
> and to validate the data accuracy at that moment?
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/table/sql/create/#watermark
>
> 在 2022-05-20 16:02:39,"vtygoss"  写道:
>
> Hi community!
>
>
> I'm working on migrating from full-data-pipeline(with spark) to
> incremental-data-pipeline(with flink cdc), and i met a problem about
> accuracy validation between pipeline based flink and spark.
>
>
> For bounded data, it's simple to validate the two result sets are
> consitent or not.
>
> But, for unbouned data and event-driven application, how to make sure the
> data stream produced is correct, especially when there are some retract
> functions with high impactions, e.g. row_number.
>
>
> Is there any document for this preblom?  Thanks for your any suggestions
> or replies.
>
>
> Best Regards!
>
>


Re: accuracy validation of streaming pipeline

2022-05-23 Thread Shengkai Fang
It's a good question. Let me ping @Leonard to share more thoughts.

Best,
Shengkai

vtygoss  于2022年5月20日周五 16:04写道:

> Hi community!
>
>
> I'm working on migrating from full-data-pipeline(with spark) to
> incremental-data-pipeline(with flink cdc), and i met a problem about
> accuracy validation between pipeline based flink and spark.
>
>
> For bounded data, it's simple to validate the two result sets are
> consitent or not.
>
> But, for unbouned data and event-driven application, how to make sure the
> data stream produced is correct, especially when there are some retract
> functions with high impactions, e.g. row_number.
>
>
> Is there any document for this preblom?  Thanks for your any suggestions
> or replies.
>
>
> Best Regards!
>


accuracy validation of streaming pipeline

2022-05-20 Thread vtygoss
Hi community!


I'm working on migrating from full-data-pipeline(with spark) to 
incremental-data-pipeline(with flink cdc), and i met a problem about accuracy 
validation between pipeline based flink and spark.


For bounded data, it's simple to validate the two result sets are consitent or 
not. 
But, for unbouned data and event-driven application, how to make sure the data 
stream produced is correct, especially when there are some retract functions 
with high impactions, e.g. row_number. 


Is there any document for this preblom?  Thanks for your any suggestions or 
replies. 


Best Regards!