Adding on to the above information, we are using KafkaSpout as input source
and one observation we have made is that the consumption rate per partition
is different. Mostly, one/two partitions are consumed very slowly.

We are currently thinking it is something to do with the KafkaSpout. Let us
know if you have made similar observations, and how you fixed it.

Thanks

On Sat, Nov 19, 2022 at 11:38 AM Shubham Kumar Jha <[email protected]>
wrote:

> Hi,
>
> We are upgrading our Storm production cluster from 1.2.1 to 2.2.0 and we
> are observing that the performance of our topologies got worse compared to
> the 1.2.1 cluster.
>
> Just to give insights into the performance impact, the processing QPS per
> worker used to be 400 and now it is below 100 in 2.2.0 cluster. The
> topology code hasn't been changed, except for some type-casting changes
> (the code is in Scala and build was failing).  The storm.yaml config is
> also mostly same as that of the 1.2.1 cluster, only these we have updated:
>
>    - topology.spout.wait.strategy: from *SleepSpoutWaitStrategy* to
>    *WaitStrategyProgressive*
>    - Added *topology.disable.loadaware.messaging: true* (from this
>    stackoverflow post
>    
> <https://stackoverflow.com/questions/64689408/apache-strom-upgrade-from-1-0-3-to-2-2-0-and-not-all-workers-are-used>
>    )
>
> The parallelism hint and number of workers is also the same for the
> different bolts as they were in the 1.2.1 cluster. And the machines on
> which workers are run are also the same (in terms of core, memory, etc.)
>
> For the whole week, we've been trying to figure it out. Finally, we tried
> to tweak batch configs (topology.transfer.batch.size
> and topology.producer.batch.size) but the topologies are getting stuck. We
> will spend more time on this, but ideally the topology should be able to
> scale as much as it was on the 1.2.1 cluster. So, we need some other
> direction here. Please suggest some potential issues that you've observed /
> other configs that we can tweak.
>
> Some of our observations:
>
>    - Even after increasing parallelism hints for bolts / spouts, the
>    overall QPS remains the same. So, if we have 3 workers (and reduced
>    parallelism) we get ~400 qps / slot => overall qps to be ~1200. But with 9
>    workers (increased parallelism) we are able to scale ~140 qps / bolt =>
>    overall ~1200.
>    - We have checked logs (application generated, nimbus, supervisor and
>    worker) and don't see any issues there. Let us know particular logs we can
>    look for.
>    - The load-average has increased in the new cluster. Earlier it used
>    to vary from 1 - 2.5 per machine, now it varies from 1 - 10 per machine.
>    Given we have 12-core machines, we figured out it shouldn't be an issue.
>
>
> Thanks
>
> --
> Regards,
> Shubham Kumar Jha
>


-- 
Regards,
Shubham Kumar Jha

Reply via email to