Hi Isha We are using Redis in a 3 node Redis Sentinel cluster for HA purpose. It works fine
Kind regards Jens M. Kofoed Den ons. 22. feb. 2023 kl. 11.36 skrev Isha Lamboo < [email protected]>: > Hi Simon, > > Thanks for your explanation. It will help me manage expectations with the > team that developed the flow. We were hoping to do exactly as you suggest, > drop in a redundant cache without the time and resource investment of > setting up an external cluster like Redis or Hazelcast. And in fact, it > runs fine on most days, but as currently set up it doesn't play nice when > the load on the cluster gets too high or nodes disconnect. > > If I get the time to run some tests I'll share the results, but for now > I'll advise the devs to accept a longer run and schedule the > DetectDuplicate less often or to revert to using the > DistributedMapCacheServer on a single node again. If neither is acceptable > they can request an external cache service cluster. > > Thank you very much, > > Isha > > -----Oorspronkelijk bericht----- > Van: Simon Bence <[email protected]> > Verzonden: woensdag 22 februari 2023 10:47 > Aan: [email protected] > Onderwerp: Re: Embedded Hazelcast Cachemanager > > Hi Isha, > > Without deeper understanding of the situation I am not sure if the load > comes from entirely this part of this given batch processing, but for the > scope of the discussion I do assume this also with the assumption that this > shows drastic contrast with the same measurements using DistributedMapCache > as cache. > > The EmbeddedHazelcastCacheManager was primarily added for simpler > scenarios as an out of the box solution, might be "grabbed to the canvas” > without much fuss. As of this, it has very limited customisation > capabilities. As your scenario looks to utilize Hazelcast heavily, this > might not be the ideal tool. Also it is important to mention, that in case > of the embedded approach, the Hazelcast instances running on the same > server, thus they are adding to the load already produced by other parts of > the flow. > > Using ExternalHazelcastCacheManager can provide much more flexibility: as > it works with standalone Hazelcast instances, this approach opens the whole > range of performance optimization capabilities of it. You can use either > one single instance touched by all the nodes (which comes with no > synchronization between Hazelcast nodes but might be a bottleneck at some > point) or even build up a separate cluster. Of course, the results are > highly depend on network topology and other factors specific for your use > case. > > Also I am not sure the details of your flows or if you prefer processing > time over throughput or not, but it is also a possible optimization > opportunity to distribute the batch in time resulting smaller peaks. > > Best regards, > Bence > > > > On 2023. Feb 21., at 21:45, Isha Lamboo <[email protected]> > wrote: > > > > Hi Simon, > > > > The Hazelcast cache is being used by a DetectDuplicate processor to > cache and eliminate message ids. These arrive in large daily batches with > 300-500k messages, most (90+%) of which are actually duplicates. This was > previously done with a DistributedMapCacheServer, but that involved using > only one of the nodes (hardcoded in the MapCacheClient controller), giving > us a single point of failure for the flow. We had hoped to use Hazelcast to > have a redundant cacheserver, but I’m starting to think that this scenario > causes too many concurrent updates of the cache, on top of the already > heavy load from other processing on the batch. > > > > What was new to me is the CPU load on the cluster in question going > through the roof, on all 3 nodes. I have no idea how a 16 vCPU server gets > to a load of 100+. > > > > The start roughly coincides with the arrival of the daily batch, though > there may have been other batch processes going on since it’s a Sunday. > However, the queues were pretty much empty again in an hour and yet the > craziness kept going until I finally decided to restart all nodes. > > <image001.png> > > > > The hazelcast troubles might well be a side-effect of the NiFi servers > being overloaded. There could have been issues at the Azure VM level etc. > But activating the Hazelcast controller is the only change I *know* about. > And it doesn’t seem farfetched that it got into a loop trying to > migrate/copy partitions “lost” on other nodes. > > > > I’ve attached a file with selected hazelcast warnings and errors from > the nifi-app.log files, trying to include as many unique ones as possible. > > > > The errors that kept repeating where these (always together): > > > > 2023-02-19 08:58:39,899Z (UTC+0) ERROR > [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] > c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 > [nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster > state! New state: ClusterStateChange{type=class > com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE > > 2023-02-19 08:58:39,900Z (UTC+0) WARN > [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] > c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 > [nifi] [4.2.5] While changing cluster state to FROZEN! > java.lang.IllegalStateException: Still have pending migration tasks, cannot > lock cluster state! New state: ClusterStateChange{type=class > com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE > > > > Thanks, > > > > Isha > > > > Van: Simon Bence <[email protected]> > > Verzonden: dinsdag 21 februari 2023 08:52 > > Aan: [email protected] > > Onderwerp: Re: Embedded Hazelcast Cachemanager > > > > Hi Isha, > > > > Could you please share the error messages? It might bring light to > something might effect the performance. > > > > In the other hand, I am not aware of exhaustive performance tests for > the Hazelcast Cache. In general it should not be the bottleneck, but if you > could please give some details about the error and possibly the intended > way of usage, it could help to find a more specific answer. > > > > Best regards, > > Bence Simon > > > > > > On 2023. Feb 20., at 15:19, Isha Lamboo <[email protected]> > wrote: > > > > Hi all, > > > > This morning I had to fix up a cluster of NiFi 1.18.0 servers where the > primary was constantly crashing and moving to the next server. > > > > One of the recent changes was activating an Embedded Hazelcast Cache, > and I did see errors reported trying with promotions going wrong. I can’t > tell if this is cause or effect, so I’m trying to get a feeling for the > performance demands of Hazelcast, but there is nothing, only a time to live > for cache items. The diagnostics dump also didn’t give me anything on this > controllerservice. > > > > Does anyone have experience with tuning/diagnosing the Hazelcast > components within NiFi? > > > > Met vriendelijke groet, > > > > Isha Lamboo > > Data Engineer > > <image001.png> > > > > <nifi_hazelcast_log.txt> > >
