Re: accuracy validation of streaming pipeline

Leonard Xu Tue, 24 May 2022 19:57:40 -0700

Hi, vtygoss

> I'm working on migrating from full-data-pipeline(with spark) to 
> incremental-data-pipeline(with flink cdc), and i met a problem about accuracy 
> validation between pipeline based flink and spark.


Glad to hear that !



> For bounded data, it's simple to validate the two result sets are consitent 
> or not. 
> But, for unbouned data and event-driven application, how to make sure the 
> data stream produced is correct, especially when there are some retract 
> functions with high impactions, e.g. row_number. 
> 
> Is there any document for this preblom?  Thanks for your any suggestions or 
> replies. 

The validation feature belongs data quality scope from my understanding, it’s 
usually provided by the platform e.g. the Data Integration Platform. As the 
underlying pipeline engine/tools, Flink CDC should expose more metrics or data 
quality checking abilities but we didn’t offers them yet, and these 
enhancements is on our roadmap.  Currently, you can use Flink source/sink 
operator’s metric as a rough validation, you can also compare the records count 
in your source database and sink system multiple times for more accurate 
validation.

Best,
Leonard

Re: accuracy validation of streaming pipeline

Reply via email to