Im writing a fairly basic trident topology as follows: - 4 spouts of events - merges into one stream - serializes the object as an event in a string - saves to db
I split the serialization task away from the spout as it was cpu intensive to speed it up. The problem I have is that after 10 minutes there is over 910k tuples emitted/transfered but only 193k records are saved. The overall load of the topology seems fine. - 536.404 ms complete latency at the topolgy level - The highest capacity of any bolt is 0.3 which is the serialization one. - each bolt task has sub 20 ms execute latency and sub 40 ms process latency. So it seems trident has all the records internally, but I need these events as close to realtime as possible. Does anyone have any guidance as to how to increase the throughput? Is it simply a matter of tweeking max spout pending and the batch size? Im running it on 2 m1-smalls for now. I dont see the need to upgrade it until the demand on the boxes seems higher. Although CPU usage on the nimbus box is pinned. Its at like 99%. Why would that be? Its at 99% even when all the topologies are killed. We are currently targeting processing 200 million records per day which seems like it should be quite easy based on what Ive read that other people have achieved. I realize that hardware should be able to boost this as well but my first goal is to get trident to push the records to the db quicker. Thanks in advance, Sean
