Hi,

We are upgrading our Storm production cluster from 1.2.1 to 2.2.0 and we
are observing that the performance of our topologies got worse compared to
the 1.2.1 cluster.

Just to give insights into the performance impact, the processing QPS per
worker used to be 400 and now it is below 100 in 2.2.0 cluster. The
topology code hasn't been changed, except for some type-casting changes
(the code is in Scala and build was failing).  The storm.yaml config is
also mostly same as that of the 1.2.1 cluster, only these we have updated:

   - topology.spout.wait.strategy: from *SleepSpoutWaitStrategy* to
   *WaitStrategyProgressive*
   - Added *topology.disable.loadaware.messaging: true* (from this
   stackoverflow post
   
<https://stackoverflow.com/questions/64689408/apache-strom-upgrade-from-1-0-3-to-2-2-0-and-not-all-workers-are-used>
   )

The parallelism hint and number of workers is also the same for the
different bolts as they were in the 1.2.1 cluster. And the machines on
which workers are run are also the same (in terms of core, memory, etc.)

For the whole week, we've been trying to figure it out. Finally, we tried
to tweak batch configs (topology.transfer.batch.size
and topology.producer.batch.size) but the topologies are getting stuck. We
will spend more time on this, but ideally the topology should be able to
scale as much as it was on the 1.2.1 cluster. So, we need some other
direction here. Please suggest some potential issues that you've observed /
other configs that we can tweak.

Some of our observations:

   - Even after increasing parallelism hints for bolts / spouts, the
   overall QPS remains the same. So, if we have 3 workers (and reduced
   parallelism) we get ~400 qps / slot => overall qps to be ~1200. But with 9
   workers (increased parallelism) we are able to scale ~140 qps / bolt =>
   overall ~1200.
   - We have checked logs (application generated, nimbus, supervisor and
   worker) and don't see any issues there. Let us know particular logs we can
   look for.
   - The load-average has increased in the new cluster. Earlier it used to
   vary from 1 - 2.5 per machine, now it varies from 1 - 10 per machine. Given
   we have 12-core machines, we figured out it shouldn't be an issue.


Thanks

-- 
Regards,
Shubham Kumar Jha

Reply via email to