Best practices to maintain reference data for Flink Jobs

Sand Stone Thu, 18 May 2017 16:11:29 -0700

Hi. Say I have a few reference data sets need to be used for a
streaming job. The sizes range between 10M-10GB. The data is not
static, will be refreshed at minutes and/or day intervals.


With the new advancements in Flink, it seems there are quite a few options.
   A. Store all the data in an external (kv) database cluster. And use
async io calls
          * data refresh can be done in a few different ways
   B. Use the new Querytable State feature
            * it seems there is no "easy" API to discover the
queryable state at the moment. Need to use the restful API to figure
out the job id.
   C. Ingest the reference data into the job and cache them in memory
Any other option?

On paper, it seems option B with the Queryable State is the cleanest solution.

Any comment/suggestion is greatly appreciated in particular in terms
of robustness and consistent recovery.

Thanks much!

Best practices to maintain reference data for Flink Jobs

Reply via email to