Re: Best practices to maintain reference data for Flink Jobs

2017-05-20 Thread Fabian Hueske
Hi, if you need range queries for the lookups, you can only use Option A (async calls to an external store). Queryable State only supports key lookups but no range queries. Since version 1.2.0, Flink has a dedicated function type for async calls [1]. This might be helpful to implement your

Re: Best practices to maintain reference data for Flink Jobs

2017-05-19 Thread Sand Stone
Also, took a quick read on side input. it's unclear to me how side input could solve this issue better. At a high level, this is what I have in mind: flatmap(byte[] value, Collector<> output) { var iter = someStoreStateObject.seek(akeyprefix); //or

Re: Best practices to maintain reference data for Flink Jobs

2017-05-19 Thread Sand Stone
Thanks Gordon and Fabian. The enriching data is really reference data, e.g. the reverseIP database. It's hard to be keyed as the main data stream as the "ip address" in the event is not a primary key in the main data stream. QueryableState is close, but it does not support range scan as far as I

Re: Best practices to maintain reference data for Flink Jobs

2017-05-19 Thread Fabian Hueske
+1 to what Gordon said. Queryable state is rather meant as an external interface to streaming jobs than for lookups within jobs. Accessing co-located state should give you better performance and is probably easier to implement and maintain. Cheers, Fabian 2017-05-19 7:43 GMT+02:00 Tzu-Li

Re: Best practices to maintain reference data for Flink Jobs

2017-05-18 Thread Tzu-Li (Gordon) Tai
Hi, Can the enriching data be keyed? Or is it something that has to be broadcasted to each operator? Either way, I think Side Inputs (an upcoming feature in the future) is the best fit for this. You can take a look at  https://issues.apache.org/jira/browse/FLINK-6131. Regarding the 3 options

Best practices to maintain reference data for Flink Jobs

2017-05-18 Thread Sand Stone
Hi. Say I have a few reference data sets need to be used for a streaming job. The sizes range between 10M-10GB. The data is not static, will be refreshed at minutes and/or day intervals. With the new advancements in Flink, it seems there are quite a few options. A. Store all the data in an