Observing performance degradation post updating Storm from 1.2.1 to 2.2.0

Shubham Kumar Jha Fri, 18 Nov 2022 22:09:05 -0800

Hi,

We are upgrading our Storm production cluster from 1.2.1 to 2.2.0 and we
are observing that the performance of our topologies got worse compared to
the 1.2.1 cluster.

Just to give insights into the performance impact, the processing QPS per
worker used to be 400 and now it is below 100 in 2.2.0 cluster. The
topology code hasn't been changed, except for some type-casting changes
(the code is in Scala and build was failing). The storm.yaml config is
also mostly same as that of the 1.2.1 cluster, only these we have updated:

- topology.spout.wait.strategy: from *SleepSpoutWaitStrategy* to
*WaitStrategyProgressive*
- Added *topology.disable.loadaware.messaging: true* (from this
stackoverflow post

<https://stackoverflow.com/questions/64689408/apache-strom-upgrade-from-1-0-3-to-2-2-0-and-not-all-workers-are-used>
)

The parallelism hint and number of workers is also the same for the
different bolts as they were in the 1.2.1 cluster. And the machines on
which workers are run are also the same (in terms of core, memory, etc.)

For the whole week, we've been trying to figure it out. Finally, we tried
to tweak batch configs (topology.transfer.batch.size
and topology.producer.batch.size) but the topologies are getting stuck. We
will spend more time on this, but ideally the topology should be able to
scale as much as it was on the 1.2.1 cluster. So, we need some other
direction here. Please suggest some potential issues that you've observed /
other configs that we can tweak.

Some of our observations:

- Even after increasing parallelism hints for bolts / spouts, the
overall QPS remains the same. So, if we have 3 workers (and reduced
parallelism) we get ~400 qps / slot => overall qps to be ~1200. But with 9
workers (increased parallelism) we are able to scale ~140 qps / bolt =>
overall ~1200.
- We have checked logs (application generated, nimbus, supervisor and
worker) and don't see any issues there. Let us know particular logs we can
look for.
- The load-average has increased in the new cluster. Earlier it used to
vary from 1 - 2.5 per machine, now it varies from 1 - 10 per machine. Given
we have 12-core machines, we figured out it shouldn't be an issue.

Thanks

--
Regards,
Shubham Kumar Jha

Observing performance degradation post updating Storm from 1.2.1 to 2.2.0

Reply via email to