Streaming use case: Row enrichment

Flavio Pompermaier Thu, 11 May 2017 06:07:09 -0700

Hi to all,
we have a particular use case where we have a tabular dataset on HDFS (e.g.
a CSV) that we want to enrich filling some cells with the content returned
by a query to a reverse index (e.g. solr/elasticsearch).
Since we want to be able to make this process resilient and scalable we
thought that Flink streaming could be a good fit since we can control the
"pressure" on the index by adding/removing consumers dynamically and there
is automatic error recovery.


Right now we developed 2 different solutions to the problem:

   1. *move the dataset from HDFS to a queue/topic* (like Kafka or
   RabbitMQ) and then let the queue consumers do the real job (pull Rows from
   the queue, enrich and then persist the enriched Rows). The questions here
   are:
      1. how to properly manage writing to HDFS ? if we read a set of rows,
      we enrich them and we need to write the result back to HDFS, is
it possible
      to automatically compact files in order to avoid the "too many
small files"
      problem on HDFS? How to avoid file name collision (put each batch of rows
      to a different file)?
      2. how to control the number dynamically? is it possible to change
      the parallelism once the job has started?
      2. in order to avoid useless data transfer from HDFS to a queue/topic
   (since we don't need all the Row fields to create the query..usually only
   2/5 fields are needed) we can create a Flink job that put the q*ueries
   into a queue/topic *and wait for the result. The problem with this
   approach is:
      1. how to correlate queries with their responses? creating a unique
      response queue/topic implies that all consumers reads all messages (and
      discard those that are not directed to them) while creating a queue/topic
      for each sub-task could be expansive (in terms of resources and
      managment..but we don't have any evidence/experience of this..it's just a
      possible problem).
   3. Maybe we can exploit *Flink async/IO *somehow...? But how?


Any suggestion/drawbacks on the 2 approaches?

Thanks in advance,
Flavio

Streaming use case: Row enrichment

Reply via email to