Hi, I need to have a streaming pipeline Kafka->storm-> ElasticSearch. The volume of message produced to Kafka is in order of millions. Hence, I need to have maximum throughput in Elasticsearch writes. Each message has an id which is mapped to a Elasticsearch index. The number of possible message ids possible are less than 50(which means, max number of created ES indices). I would like to batch the ES writes where messages are grouped by index *as much as possible*. The problem is that message counts per id are dynamic and certain ids can have very huge message inflows when compared to other. Largest message id can have > 10x message inflow than the smallest. Hence, shuffle grouping on ids doesn't work here. Partial key grouping also won't work as I need more number of output streams for largest message ids.
eg: i have 10 tasks that write to ES Say all my messages are spread across 2 message ids - ID1, ID2 which I have to write to 2 separate index in ES Say ID1 has 4 times more messages than ID2 at one instant So, the best possible output would be, First 8 tasks writes messages with ID1 to ES Last 2 tasks writes messages with ID2 to ES Say at a different instant, ID1 has same number of messages as ID2 So, the best possible output would be, First 5 tasks writes messages with ID1 to ES Last 5 tasks writes messages with ID2 to ES My grouping requirement is just an optimization but it is not a requirement. What is the best way where I can group messages *dynamically* on input streams with hugely varying message counts* in the best way possible*? Also, I have the control over creating message ids if it helps the data distribution