Ok, thanks for the clarification Kostas. What about multiple jobs running at the same time?
On Wed, 9 May 2018, 14:39 Kostas Kloudas, <k.klou...@data-artisans.com> wrote: > Hi Flavio, > > Flink has no inherent limitations as far as state size is concerned, apart > from the fact that the state associated to a *single key* > (not the total state) should fit in memory. For production use, it is also > advised to use the RocksDB state backend, as this will > allow you to spill on disk when the state grows too large. > > Now for recommended DB/no-sql store, there is no recommendation from my > part. It depends on what you and your team are > more familiar with. I suppose you are talking about sink, right? In this > case, it also depends on what will optimize your batch jobs > that will read the updated dataset. > > Thanks, > Kostas > > On May 8, 2018, at 10:40 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > > Thanks! Both solutions are reasonable but ehat abiut max state size (per > key)?is there any suggested database/nosql store to use? > > On Tue, 8 May 2018, 18:09 TechnoMage, <mla...@technomage.com> wrote: > >> If you use a KeyedStream you can group records by key (city) and then use >> a RichFlatMap to aggregate state in a MapState or ListState per key. You >> can then have that operator publish the updated results as a new aggregated >> record, or send it to a database or such as you see fit. >> >> Michael >> >> On May 8, 2018, at 4:22 AM, Flavio Pompermaier <pomperma...@okkam.it> >> wrote: >> >> Hi all, >> I'd like to introduce in our pipeline an efficient way to aggregate >> incoming data around an entity. >> >> We have basically new incoming facts that are added (but also removed >> potentially) to an entity (by id). For example, when we receive a new name >> of a city we add this name to the known names of that city id (if the first >> field of the tuple is ADD, if it is DEL we remove it). >> At the moment we use batch job to generate an initial version of the >> entities, another job that add facts to this initial version of the >> entities, and another one that merges the base and the computed data. This >> is somehow very inefficient in terms of speed and disk space (because every >> step requires to materialize the data on the disk). >> >> I was wondering whether Flink could help here or not...there are a couple >> of requirements that make things very complicated: >> >> - states could be potentially large (a lot of data related to an >> entity). Is there any limitation about the size of the states? >> - data must be readable by a batch job. If I'm not wrong this could >> be easily solved flushing data periodically to an external sink (like >> HBase >> or similar) >> - how to keep the long-running stream job up and run a batch job at >> the same time? Will this be possible after Flip-6? >> - how to add ingest new data? Do I really need Kafka or can I just >> add new datasets to a staging HDFS dir (and move them to another dir once >> ingested)? >> >> Best, >> Flavio >> >> >> >