Hi all,
I'd like to introduce in our pipeline an efficient way to aggregate
incoming data around an entity.

We have basically new incoming facts that are added (but also removed
potentially) to an entity (by id). For example, when we receive a new name
of a city we add this name to the known names of that city id (if the first
field of the tuple is ADD, if it is DEL we remove it).
At the moment we use batch job to generate an initial version of the
entities, another job that add facts to this initial version of the
entities, and another one that merges the base and the computed data. This
is somehow very inefficient in terms of speed and disk space (because every
step requires to materialize the data on the disk).

I was wondering whether Flink could help here or not...there are a couple
of requirements that make things very complicated:

   - states could be potentially large (a lot of data related to an
   entity). Is there any limitation about the size of the states?
   - data must be readable by a batch job. If I'm not wrong this could be
   easily solved flushing data periodically to an external sink (like HBase or
   similar)
   - how to keep the long-running stream job up and run a batch job at the
   same time? Will this be possible after Flip-6?
   - how to add ingest new data? Do I really need Kafka or can I just add
   new datasets to a staging HDFS dir (and move them to another dir once
   ingested)?

Best,
Flavio

Reply via email to