Hi, Find a value that suits your needs – don’t hardcode it – put it in a property file so you can change it easily. It does not automatically adjust to any changes downstream – it is merely meant to limit the # of tuples currently “in transit” – without this setting you’re guaranteed to face OOM since kafka will always be faster than your db.
“Without controlling our ingest rate, it’s possible to swamp our topology so that it collapses under the weight of incoming data. Max spout pending lets us erect a dam in front of our topology, apply back pressure, and avoid being overwhelmed. We recommend that, despite the optional nature of max spout pending, you always set it.” (see “Storm Applied: Strategies for real-time event processing”, chapter 6) Regards, Tom From: Jakes John [mailto:[email protected]] Sent: Mittwoch, 2. September 2015 17:22 To: [email protected] Subject: Re: Kafka Spout rate So, you are asking me to hard code to a value of say 1000. Still, it is a fixed value right? how does it automatically adjust to database flush rate?. If backend systems get slow, how does the topology automatically adjust and throttle the rate?. I once saw that actual error was in database writes but storm topology stalled because of OOM exception. I feel that this should be a common problem for all. Also, how can I view the number of pending spout messages at any given time? Thanks, Johnu On Wed, Sep 2, 2015 at 12:05 AM, Ziemer, Tom <[email protected]<mailto:[email protected]>> wrote: Hi, have a look at Config.TOPOLOGY_MAX_SPOUT_PENDING this should take care of the OOM, if set to a prudent value since it determines “The maximum number of tuples that can be pending on a spout task at any given time”. (https://nathanmarz.github.io/storm/doc/backtype/storm/Config.html) Regards, Tom From: Jakes John [mailto:[email protected]<mailto:[email protected]>] Sent: Mittwoch, 2. September 2015 07:57 To: [email protected]<mailto:[email protected]> Subject: Kafka Spout rate Hey, I have a 5 node storm cluster. I have deployed a storm topology with Kafka Spout that reads from Kafka cluster and a bolt that writes to database. When I tested java Kafka consumer independently, I got throughput around 1M messages per second. Also, when I tested my database independently, i got throughput maximum around 100k messages per second. Since, my database is very slow at consuming messages, I need to reduce the intake of messages by Kafka Spout. Adding more parallelism to DB bolt doesn't help as I have reached the maximum throughput of database. Periodically I am seeing "Out of memory exception" in Kafka Spout and processing stops. 1. How can I reduce the rate of Kafka Spout intake of messages? . I assume the reason for OOM exceptions is that Kafka Spout is fast to read more messages from Kafka but, DB bolt is not able to flush the messages to database at the same rate. Is that the reason? I tried playing around by reducing fetchsize but it didn't help. 2. Suppose, if my DB bolt is somehow able to flush entire messages to the database at the same rate as Kafka spout, but if database gets slow in the future, will the message intake rate get reduced dynamically to ensure that OOM exception doesn't happen? How can i pro actively take measures? 3. What is the best way to tune my system parameters? Also, how do I test performance(throughput) of my storm topology? I would like to see how the current storm community deals with my problem Thanks for your time
