Hi,

   I need to have a streaming pipeline  Kafka->storm-> ElasticSearch.  The
volume of message produced to Kafka is in order of  millions. Hence, I need
to have maximum throughput in Elasticsearch writes.  Each message has an id
which is mapped to a Elasticsearch index.  The number of possible message
ids possible are less than 50(which means, max number of created ES
indices). I would like to batch the ES writes where messages are grouped by
index *as much as possible*.
The problem is that message counts per id are dynamic and certain ids can
have very huge message inflows when compared to other.  Largest message id
can have > 10x message inflow than the smallest. Hence, shuffle grouping on
ids doesn't work here.  Partial key grouping also won't work as I need more
number of output streams for largest message ids.

eg:  i have 10 tasks that write to ES


Say all my messages are spread across 2 message ids - ID1, ID2 which I have
to write to 2 separate index in ES

Say ID1 has 4 times more messages than ID2 at one instant

So, the best possible output would be,
First 8 tasks writes messages with ID1 to ES
Last 2 tasks writes messages with ID2 to ES


Say at a different instant,  ID1 has same number of messages as ID2

So, the best possible output would be,
First 5 tasks writes messages with ID1 to ES
Last 5 tasks writes messages with ID2 to ES


My grouping requirement is just an optimization but it is not a
requirement.   What is the best way where I can group messages *dynamically*
on input streams with hugely varying message counts* in the best way
possible*?  Also, I have the control over creating message ids if it helps
the data distribution

Reply via email to