Ok, thanks for the clarification Kostas. What about multiple jobs running
at the same time?

On Wed, 9 May 2018, 14:39 Kostas Kloudas, <k.klou...@data-artisans.com>
wrote:

> Hi Flavio,
>
> Flink has no inherent limitations as far as state size is concerned, apart
> from the fact that the state associated to a *single key*
> (not the total state) should fit in memory. For production use, it is also
> advised to use the RocksDB state backend, as this will
> allow you to spill on disk when the state grows too large.
>
> Now for recommended DB/no-sql store, there is no recommendation from my
> part. It depends on what you and your team are
> more familiar with. I suppose you are talking about sink, right? In this
> case, it also depends on what will optimize your batch jobs
> that will read the updated dataset.
>
> Thanks,
> Kostas
>
> On May 8, 2018, at 10:40 PM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
> Thanks! Both solutions are reasonable but ehat abiut max state size (per
> key)?is there any suggested database/nosql store to use?
>
> On Tue, 8 May 2018, 18:09 TechnoMage, <mla...@technomage.com> wrote:
>
>> If you use a KeyedStream you can group records by key (city) and then use
>> a RichFlatMap to aggregate state in a MapState or ListState per key.  You
>> can then have that operator publish the updated results as a new aggregated
>> record, or send it to a database or such as you see fit.
>>
>> Michael
>>
>> On May 8, 2018, at 4:22 AM, Flavio Pompermaier <pomperma...@okkam.it>
>> wrote:
>>
>> Hi all,
>> I'd like to introduce in our pipeline an efficient way to aggregate
>> incoming data around an entity.
>>
>> We have basically new incoming facts that are added (but also removed
>> potentially) to an entity (by id). For example, when we receive a new name
>> of a city we add this name to the known names of that city id (if the first
>> field of the tuple is ADD, if it is DEL we remove it).
>> At the moment we use batch job to generate an initial version of the
>> entities, another job that add facts to this initial version of the
>> entities, and another one that merges the base and the computed data. This
>> is somehow very inefficient in terms of speed and disk space (because every
>> step requires to materialize the data on the disk).
>>
>> I was wondering whether Flink could help here or not...there are a couple
>> of requirements that make things very complicated:
>>
>>    - states could be potentially large (a lot of data related to an
>>    entity). Is there any limitation about the size of the states?
>>    - data must be readable by a batch job. If I'm not wrong this could
>>    be easily solved flushing data periodically to an external sink (like 
>> HBase
>>    or similar)
>>    - how to keep the long-running stream job up and run a batch job at
>>    the same time? Will this be possible after Flip-6?
>>    - how to add ingest new data? Do I really need Kafka or can I just
>>    add new datasets to a staging HDFS dir (and move them to another dir once
>>    ingested)?
>>
>> Best,
>> Flavio
>>
>>
>>
>

Reply via email to