Hi Isha, Without deeper understanding of the situation I am not sure if the load comes from entirely this part of this given batch processing, but for the scope of the discussion I do assume this also with the assumption that this shows drastic contrast with the same measurements using DistributedMapCache as cache.
The EmbeddedHazelcastCacheManager was primarily added for simpler scenarios as an out of the box solution, might be "grabbed to the canvas” without much fuss. As of this, it has very limited customisation capabilities. As your scenario looks to utilize Hazelcast heavily, this might not be the ideal tool. Also it is important to mention, that in case of the embedded approach, the Hazelcast instances running on the same server, thus they are adding to the load already produced by other parts of the flow. Using ExternalHazelcastCacheManager can provide much more flexibility: as it works with standalone Hazelcast instances, this approach opens the whole range of performance optimization capabilities of it. You can use either one single instance touched by all the nodes (which comes with no synchronization between Hazelcast nodes but might be a bottleneck at some point) or even build up a separate cluster. Of course, the results are highly depend on network topology and other factors specific for your use case. Also I am not sure the details of your flows or if you prefer processing time over throughput or not, but it is also a possible optimization opportunity to distribute the batch in time resulting smaller peaks. Best regards, Bence > On 2023. Feb 21., at 21:45, Isha Lamboo <[email protected]> > wrote: > > Hi Simon, > > The Hazelcast cache is being used by a DetectDuplicate processor to cache and > eliminate message ids. These arrive in large daily batches with 300-500k > messages, most (90+%) of which are actually duplicates. This was previously > done with a DistributedMapCacheServer, but that involved using only one of > the nodes (hardcoded in the MapCacheClient controller), giving us a single > point of failure for the flow. We had hoped to use Hazelcast to have a > redundant cacheserver, but I’m starting to think that this scenario causes > too many concurrent updates of the cache, on top of the already heavy load > from other processing on the batch. > > What was new to me is the CPU load on the cluster in question going through > the roof, on all 3 nodes. I have no idea how a 16 vCPU server gets to a load > of 100+. > > The start roughly coincides with the arrival of the daily batch, though there > may have been other batch processes going on since it’s a Sunday. However, > the queues were pretty much empty again in an hour and yet the craziness kept > going until I finally decided to restart all nodes. > <image001.png> > > The hazelcast troubles might well be a side-effect of the NiFi servers being > overloaded. There could have been issues at the Azure VM level etc. But > activating the Hazelcast controller is the only change I *know* about. And it > doesn’t seem farfetched that it got into a loop trying to migrate/copy > partitions “lost” on other nodes. > > I’ve attached a file with selected hazelcast warnings and errors from the > nifi-app.log files, trying to include as many unique ones as possible. > > The errors that kept repeating where these (always together): > > 2023-02-19 08:58:39,899Z (UTC+0) ERROR > [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] > c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 > [nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! > New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, > newState=FROZEN}, current state: ACTIVE > 2023-02-19 08:58:39,900Z (UTC+0) WARN > [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] > c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 > [nifi] [4.2.5] While changing cluster state to FROZEN! > java.lang.IllegalStateException: Still have pending migration tasks, cannot > lock cluster state! New state: ClusterStateChange{type=class > com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE > > Thanks, > > Isha > > Van: Simon Bence <[email protected]> > Verzonden: dinsdag 21 februari 2023 08:52 > Aan: [email protected] > Onderwerp: Re: Embedded Hazelcast Cachemanager > > Hi Isha, > > Could you please share the error messages? It might bring light to something > might effect the performance. > > In the other hand, I am not aware of exhaustive performance tests for the > Hazelcast Cache. In general it should not be the bottleneck, but if you could > please give some details about the error and possibly the intended way of > usage, it could help to find a more specific answer. > > Best regards, > Bence Simon > > > On 2023. Feb 20., at 15:19, Isha Lamboo <[email protected]> > wrote: > > Hi all, > > This morning I had to fix up a cluster of NiFi 1.18.0 servers where the > primary was constantly crashing and moving to the next server. > > One of the recent changes was activating an Embedded Hazelcast Cache, and I > did see errors reported trying with promotions going wrong. I can’t tell if > this is cause or effect, so I’m trying to get a feeling for the performance > demands of Hazelcast, but there is nothing, only a time to live for cache > items. The diagnostics dump also didn’t give me anything on this > controllerservice. > > Does anyone have experience with tuning/diagnosing the Hazelcast components > within NiFi? > > Met vriendelijke groet, > > Isha Lamboo > Data Engineer > <image001.png> > > <nifi_hazelcast_log.txt>
