RE: Embedded Hazelcast Cachemanager

Isha Lamboo Tue, 21 Feb 2023 12:45:45 -0800

Hi Simon,

The Hazelcast cache is being used by a DetectDuplicate processor to cache and 
eliminate message ids. These arrive in large daily batches with 300-500k 
messages, most (90+%) of which are actually duplicates. This was previously 
done with a DistributedMapCacheServer, but that involved using only one of the 
nodes (hardcoded in the MapCacheClient controller), giving us a single point of 
failure for the flow. We had hoped to use Hazelcast to have a redundant 
cacheserver, but I’m starting to think that this scenario causes too many 
concurrent updates of the cache, on top of the already heavy load from other 
processing on the batch.


What was new to me is the CPU load on the cluster in question going through the 
roof, on all 3 nodes. I have no idea how a 16 vCPU server gets to a load of 
100+.

The start roughly coincides with the arrival of the daily batch, though there 
may have been other batch processes going on since it’s a Sunday. However, the 
queues were pretty much empty again in an hour and yet the craziness kept going 
until I finally decided to restart all nodes.
[cid:[email protected]]

The hazelcast troubles might well be a side-effect of the NiFi servers being 
overloaded. There could have been issues at the Azure VM level etc. But 
activating the Hazelcast controller is the only change I *know* about. And it 
doesn’t seem farfetched that it got into a loop trying to migrate/copy 
partitions “lost” on other nodes.

I’ve attached a file with selected hazelcast warnings and errors from the 
nifi-app.log files, trying to include as many unique ones as possible.

The errors that kept repeating where these (always together):

2023-02-19 08:58:39,899Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
2023-02-19 08:58:39,900Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap. REDACTED.nl]:5701 
[nifi] [4.2.5] While changing cluster state to FROZEN! 
java.lang.IllegalStateException: Still have pending migration tasks, cannot 
lock cluster state! New state: ClusterStateChange{type=class 
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE

Thanks,

Isha

Van: Simon Bence <[email protected]>
Verzonden: dinsdag 21 februari 2023 08:52
Aan: [email protected]
Onderwerp: Re: Embedded Hazelcast Cachemanager

Hi Isha,

Could you please share the error messages? It might bring light to something 
might effect the performance.

In the other hand, I am not aware of exhaustive performance tests for the 
Hazelcast Cache. In general it should not be the bottleneck, but if you could 
please give some details about the error and possibly the intended way of 
usage, it could help to find a more specific answer.

Best regards,
Bence Simon


On 2023. Feb 20., at 15:19, Isha Lamboo 
<[email protected]<mailto:[email protected]>> wrote:

Hi all,

This morning I had to fix up a cluster of NiFi 1.18.0 servers where the primary 
was constantly crashing and moving to the next server.

One of the recent changes was activating an Embedded Hazelcast Cache, and I did 
see errors reported trying with promotions going wrong. I can’t tell if this is 
cause or effect, so I’m trying to get a feeling for the performance demands of 
Hazelcast, but there is nothing, only a time to live for cache items. The 
diagnostics dump also didn’t give me anything on this controllerservice.

Does anyone have experience with tuning/diagnosing the Hazelcast components 
within NiFi?

Met vriendelijke groet,

Isha Lamboo
Data Engineer
 <image001.png>

# First there are warnings about slow operations and time jumps, probably 
because the system is overloaded

2023-02-19 01:51:22,847Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.SlowOperationDetectorThread] 
c.h.s.i.o.s.SlowOperationDetector [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Slow operation detected: com.hazelcast.internal.nio.Packet
2023-02-19 01:51:24,077Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-36] 
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Resetting heartbeat timestamps because of huge system clock jump! 
Clock-Jump: 35000 ms, Heartbeat-Timeout: 60000 ms
2023-02-19 01:53:13,991Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0] 
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Ignoring heartbeat from Member [su20cnifi101-ap.REDACTED.nl]:5701 - 
91c9a123-a1b9-4db5-9a87-9bb4a229d5bd since it is expired (now: 2023-02-19 
02:51:34.755, timestamp: 2023-02-19 02:50:59.107)

# This continues for several hours until stuff gets worse:

2023-02-19 07:18:25,367Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-7] 
c.h.i.server.tcp.TcpServerConnection [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Connection[id=27, 
/10.41.29.6:48586->su20cnifi102-ap.REDACTED.nl/10.41.29.7:5701, qualifier=null, 
endpoint=[su20cnifi102-ap.REDACTED.nl]:5701, alive=false, 
connectionType=MEMBER, planeIndex=0] closed. Reason: Exception in 
Connection[id=27, 
/10.41.29.6:48586->su20cnifi102-ap.REDACTED.nl/10.41.29.7:5701, qualifier=null, 
endpoint=[su20cnifi102-ap.REDACTED.nl]:5701, alive=true, connectionType=MEMBER, 
planeIndex=0], 
thread=hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-7
java.io.IOException: Broken pipe
2023-02-19 07:22:59,313Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0] 
c.h.i.cluster.impl.MembershipManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Member [su20cnifi101-ap.REDACTED.nl]:5701 - 
91c9a123-a1b9-4db5-9a87-9bb4a229d5bd is suspected to be dead for reason: 
explicit suspicion
2023-02-19 07:24:28,992Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.InternalPartitionService [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Following unknown addresses are found in partition table sent 
from master[[su20cnifi103-ap.REDACTED.nl]:5701]. (Probably they have recently 
joined or left the cluster.)
2023-02-19 07:25:33,461Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-2] 
c.h.m.i.o.MapClearExpiredOperation [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Missing responseHandler for 
com.hazelcast.map.impl.operation.MapClearExpiredOperation{serviceName='hz:impl:mapService',
 identityHash=1822878528, partitionId=114, replicaIndex=0, callId=0, 
invocationTime=-1 (1970-01-01 00:59:59.999), waitTimeout=-1, 
callTimeout=9223372036854775807, 
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0, 
expirationPercentage=10}
com.hazelcast.spi.exception.PartitionMigratingException: Partition is 
migrating! this: [su20cnifi103-ap.REDACTED.nl]:5701, partitionId: 114, 
operation: com.hazelcast.map.impl.operation.MapClearExpiredOperation, service: 
hz:impl:mapService
        at 
com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.ensureNoPartitionProblems(OperationRunnerImpl.java:384)
2023-02-19 07:27:24,976Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-25] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Error during join check!
java.util.concurrent.ExecutionException: 
com.hazelcast.core.OperationTimeoutException: SplitBrainMergeValidationOp 
invocation failed to complete due to operation-heartbeat-timeout. Current time: 
2023-02-19 08:27:24.512. Start time: 2023-02-19 08:26:24.082. Total elapsed 
time: 60490 ms. Last operation heartbeat: never. Last operation heartbeat from 
member: never. 
Invocation{op=com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp{serviceName='hz:core:clusterService',
 identityHash=2069864435, partitionId=-1, replicaIndex=0, callId=-13509555, 
invocationTime=1676791086295 (2023-02-19 08:18:06.295), waitTimeout=-1, 
callTimeout=60000, 
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0}, 
tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, 
firstInvocationTimeMs=1676791584082, firstInvocationTime='2023-02-19 
08:26:24.082', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 
01:00:00.000', target=[su20cnifi101-ap.REDACTED.nl]:5701, 
pendingResponse={VOID}, backupsAcksExpected=-1, backupsAcksReceived=0, 
connection=Connection[id=35, /10.41.29.6:5701->/10.41.29.8:44554, 
qualifier=null, endpoint=[su20cnifi101-ap.REDACTED.nl]:5701, alive=true, 
connectionType=MEMBER, planeIndex=0]}

# There's many repeated lines like the above, and at some point it seems the 
Hazelcast nodes can't rejoin because they keep trying to lock the cluster with 
outstanding migrations that can't be resumed/finished.

2023-02-19 08:54:49,331Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0] 
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Cannot handle heartbeat from [su20cnifi101-ap.REDACTED.nl]:5701, could 
not acquire lock in time.
2023-02-19 08:58:38,893Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
2023-02-19 08:58:38,898Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] While changing cluster state to FROZEN! 
java.lang.IllegalStateException: Still have pending migration tasks, cannot 
lock cluster state! New state: ClusterStateChange{type=class 
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE

2023-02-19 08:59:05,697Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.generic-operation.thread-3] 
c.h.i.p.o.PartitionStateOperation [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! this: 
[su20cnifi103-ap.REDACTED.nl]:5701, caller: [su20cnifi101-ap.REDACTED.nl]:5701, 
partitionId: -1, operation: 
com.hazelcast.internal.partition.operation.PartitionStateOperation, service: 
hz:core:partitionService

# for the next day, these repeat many times, interspersed with some lost 
partitions. This is just a selection of the different events.

2023-02-19 23:44:04,475Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.InvocationMonitorThread] 
c.h.s.i.operationservice.impl.Invocation [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Retrying invocation: 
Invocation{op=com.hazelcast.map.impl.operation.PutIfAbsentWithExpiryOperation{serviceName='hz:impl:mapService',
 identityHash=791912699, partitionId=42, replicaIndex=0, callId=13530057, 
invocationTime=1676849589819 (2023-02-20 00:33:09.819), waitTimeout=-1, 
callTimeout=60000, 
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0, 
name=HARP}, tryCount=250, tryPauseMillis=500, invokeCount=160, 
callTimeoutMillis=60000, firstInvocationTimeMs=1676850188226, 
firstInvocationTime='2023-02-20 00:43:08.226', lastHeartbeatMillis=0, 
lastHeartbeatTime='1970-01-01 01:00:00.000', 
target=[su20cnifi103-ap.REDACTED.nl]:5701, pendingResponse={VOID}, 
backupsAcksExpected=-1, backupsAcksReceived=0, connection=null}, Reason: 
com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: 
[su20cnifi103-ap.REDACTED.nl]:5701 - f5c0ae70-5d96-4e91-b29e-718dee7763c5, 
partitionId: 42, operation: 
com.hazelcast.map.impl.operation.PutIfAbsentWithExpiryOperation, service: 
hz:impl:mapService
2023-02-20 03:42:44,844Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE

2023-02-20 03:42:45,846Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:42:45,847Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Rolling back cluster state! Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701
2023-02-20 03:42:45,848Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] While changing cluster state to FROZEN! 
java.lang.IllegalStateException: Still have pending migration tasks, cannot 
lock cluster state! New state: ClusterStateChange{type=class 
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE

2023-02-20 03:43:43,806Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] While changing cluster state to FROZEN! 
java.lang.IllegalStateException: Still have pending migration tasks, cannot 
lock cluster state! New state: ClusterStateChange{type=class 
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:43,953Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] partitionId=0 is completely lost!
2023-02-20 03:43:43,953Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] partitionId=1 is completely lost!
# note: many lines with all other partitions cut from here #
2023-02-20 03:43:44,080Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] partitionId=44 is completely lost!
2023-02-20 03:43:44,285Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0] 
c.h.i.cluster.impl.ClusterJoinManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] We should merge to [su20cnifi101-ap.REDACTED.nl]:5701 because their 
data member count is bigger than ours [2 > 1]

2023-02-20 03:43:45,120Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Locking cluster state. Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:45,120Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
# note: many lines with all other partitions cut from here #
2023-02-20 03:43:46,320Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] partitionId=270 is completely lost!
2023-02-20 03:43:46,363Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration] 
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Assigning new owners for 271 LOST partitions!
2023-02-20 03:43:47,381Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Locking cluster state. Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:47,916Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:47,917Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Rolling back cluster state! Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701
2023-02-20 03:43:47,917Z (UTC+0) WARN 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] While changing cluster state to FROZEN! 
java.lang.IllegalStateException: Still have pending migration tasks, cannot 
lock cluster state! New state: ClusterStateChange{type=class 
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:48,923Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Locking cluster state. Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:48,923Z (UTC+0) ERROR 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:48,924Z (UTC+0) INFO 
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3] 
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi] 
[4.2.5] Rolling back cluster state! Initiator: 
[su20cnifi103-ap.REDACTED.nl]:5701

RE: Embedded Hazelcast Cachemanager

Reply via email to