Re: Streaming and batch jobs together

Flavio Pompermaier Tue, 08 May 2018 13:41:23 -0700

Thanks! Both solutions are reasonable but ehat abiut max state size (per
key)?is there any suggested database/nosql store to use?


On Tue, 8 May 2018, 18:09 TechnoMage, <mla...@technomage.com> wrote:

> If you use a KeyedStream you can group records by key (city) and then use
> a RichFlatMap to aggregate state in a MapState or ListState per key.  You
> can then have that operator publish the updated results as a new aggregated
> record, or send it to a database or such as you see fit.
>
> Michael
>
> On May 8, 2018, at 4:22 AM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
> Hi all,
> I'd like to introduce in our pipeline an efficient way to aggregate
> incoming data around an entity.
>
> We have basically new incoming facts that are added (but also removed
> potentially) to an entity (by id). For example, when we receive a new name
> of a city we add this name to the known names of that city id (if the first
> field of the tuple is ADD, if it is DEL we remove it).
> At the moment we use batch job to generate an initial version of the
> entities, another job that add facts to this initial version of the
> entities, and another one that merges the base and the computed data. This
> is somehow very inefficient in terms of speed and disk space (because every
> step requires to materialize the data on the disk).
>
> I was wondering whether Flink could help here or not...there are a couple
> of requirements that make things very complicated:
>
>    - states could be potentially large (a lot of data related to an
>    entity). Is there any limitation about the size of the states?
>    - data must be readable by a batch job. If I'm not wrong this could be
>    easily solved flushing data periodically to an external sink (like HBase or
>    similar)
>    - how to keep the long-running stream job up and run a batch job at
>    the same time? Will this be possible after Flip-6?
>    - how to add ingest new data? Do I really need Kafka or can I just add
>    new datasets to a staging HDFS dir (and move them to another dir once
>    ingested)?
>
> Best,
> Flavio
>
>
>

Re: Streaming and batch jobs together

Reply via email to