RE: ETL with changing reference data

qinnchen Sun, 10 Sep 2017 18:29:57 -0700

Hi Peter,

I think what you referred is typical amendment process where partial or all 
results need to modified. I think it is definitely interesting topic! Here is 
my two cents


In ideal world, reference data source can ingest updated used values as events 
and join with buffered events in windows . (it’s a bit counter intuitive, but 
think there is a magic function where we ingest all reference data as stream 
instead of doing on demand rpc)

Unfortunately, in lots of use cases, it seems hard to know exactly how 
reference data source used and dump reference data costs too much. So replay 
pipeline might be cheapest way to get things done in general.

In some cases,  results are partitioned and bounded. It makes possible to 
recomputed within bounded windows, that may requires a bit work to customize 
window which hold longer than watermark pass its endtime. I remember there was 
a Jira talk about retraction.  
In other cases, results are derived from long history which makes not rationale 
to keep. A side pipeline capture those events with late arriving event handling 
might interact with external storage and amend results.

Thanks,
Chen
 

From: Peter Lappo
Sent: Sunday, September 10, 2017 3:00 PM
To: user@flink.apache.org
Subject: ETL with changing reference data

hi,
We are building an ETL style application in Flink that consumes records from a 
file or a message bus as a DataStream. We are transforming records using SQL 
and UDFs. The UDF loads reference data in the open method and currently the 
data loaded remains in memory until the job is cancelled. The eval method of 
the UDF is used to do the actual transformation on a particular field.
So of course reference data changes and data will need to reprocessed. Lets 
assume we can identify and resubmit records for reprocessing what is the best 
design that
* keeps the Flink job running
* reloads the changed reference data
so that records are reprocessed in a deterministic fashion

Two options spring to mind
1) send a control record to the stream that reloads reference data or part of 
it and ensure resubmitted records are processed after the reload message
2) use a separate thread to poll the reference data source and reload any 
changed data which will of course suffer from race conditions

Or is there a better way of solving this type of problem with Flink?

Thanks
Peter

RE: ETL with changing reference data

Reply via email to