I am in the process of optimizing my stream. Currently I expect 5 000 000 tuples to come out of my spout per minute. I am trying to beef up my topology in order to process this in real time without falling behind.
For some reason my batch size is capping out at 83 thousand tuples. I can't seem to make it any bigger. the processing time doesn't seem to get any smaller than 2-3 seconds either. I'm not sure how to configure the topology to get any faster / more efficient. Currently all the topology does is a groupby on time and an aggregation (Count) to aggregate everything. Here are some data points i've figured out. Batch Size:5mb num-workers: 1 parallelismHint: 2 (I'll write this a 5mb, 1, 2) 5mb, 1, 2 = 83K tuples / 6s 10mb, 1, 2 = 83k / 7s 5mb, 1, 4 = 83k / 6s 5mb, 2, 4 = 83k / 3s 5mb, 3, 6 = 83k / 3s 10mb, 3, 6 = 83k / 3s Can anybody help me figure out how to get it to process things faster ? My maxSpoutPending is at 1, but when I increased it to 2 it was the same. MessageTimeoutSec = 100 I've been following this blog: https://gist.github.com/mrflip/5958028 to an extent, not everything word for word though. I need to be able to process around 66,000 tuples per second and I'm starting to run out of ideas. Thanks -- Raphael Hsieh