7 million/minute with 200 instances, implies a latency of 200*1000/(7*1000^2/60) ~ 1.71 ms for bolt A. For throughput calculations, I would normalize it and visualize a single instance of the bolt running. So for 1 both instance, the latency would be 1.71/200.
By adding bolt B with 50 instances, the throughput came to 900000/min, which means 900000= 1000*60 / ((1.71/200)+(B/50)), solving this B would be around 2.9 ms. Even though you mentioned there is nothing in the Bolt B impl, the overall latency is 2.9ms, this means there is a lot of overhead in the network. Are you using fields grouping or because of the number of instances the tuples are sent over the network. Try using local grouping. Bumping bolt B to 100 instances, you got 1.1 million/minute. But this equation should give us: 1000*60 / ((1.71/200)+(2.9/100)) around 1.6 million/min. This means adding more instances added more overhead, could be that more number of tuples are going over the network (because they went in to different worker processes). Calculating the bolt b latency for 1.1 million gives = 4.6 ms of overhead for Bolt B. Adding more instances can be costly because of the IPC. The key would be to get more local shuffling. Check if you have any failures in the overall system and also check the Storm UI which will show the bottleneck in your topology (the capacity of the bolts and the latencies etc) On Fri, Aug 14, 2015 at 2:43 PM Javier Gonzalez <[email protected]> wrote: > Do you actually have 170 machines? Try sticking to one worker per machine > (tweak memory parameters in storm.yaml), makes inter bolt traffic much > faster. > On Aug 14, 2015 5:28 PM, "John Yost" <[email protected]> wrote: > >> Hey Javier, >> >> Cool, thanks for your response! I have 50 workers for 200 Bolt A/5 Bolt >> B and 120 workers for 400 Bolt A/100 Bolt B (this latter config is optimal, >> but cluster resources make it tricky to actually launch this). >> >> I will up the number of Ackers and see if that helps. If not, then I will >> try to vary the number of B bolts beyond 100. >> >> Thanks Again! >> >> --John >> >> On Fri, Aug 14, 2015 at 2:59 PM, Javier Gonzalez <[email protected]> >> wrote: >> >>> You will have a detrimental effect to wiring in boltB, even if it does >>> nothing but ack. Every tuple you have processed from A has to travel to a B >>> bolt, and the ack has to travel back. >>> >>> You could try modifying the number of ackers, and playing with the >>> number of A and B bolts. How many workers do you have for the topology? >>> >>> Regards, >>> JG >>> On Aug 14, 2015 12:31 PM, "John Yost" <[email protected]> wrote: >>> >>>> Hi Everyone, >>>> >>>> I have a topology where a highly CPU-intensive bolt (Bolt A) requires a >>>> much higher degree of parallelism than the bolt it emits tuples to (Bolt B) >>>> (200 Bolt A executors vs <= 100 Bolt B executors). >>>> >>>> I find that the throughput, as measured in number of tuples acked, goes >>>> from 7 million/minute to ~ 1 million/minute when I wire in Bolt B--even if >>>> all of the logic within the Bolt B execute method is disabled and the Bolt >>>> B is therefore simply acking the input tuples from Bolt A. In addition, I >>>> find that, going from 50 to 100 Bolt B executors causes the throughput to >>>> go from 900K/minute to ~ 1.1 million/minute. >>>> >>>> Is the fact that I am going from 200 bolt instances to 100 or less the >>>> problem? I've already experimented with executor.send.buffer.size and >>>> executor.receive.buffer.size, which helped drive throughput from 800K to >>>> 900K. I will try topology.transfer.buffer.size, perhaps set that higher to >>>> 2048. Any other ideas? >>>> >>>> Thanks >>>> >>>> --John >>>> >>>> >>
