Hi Simon,
The Hazelcast cache is being used by a DetectDuplicate processor to cache and
eliminate message ids. These arrive in large daily batches with 300-500k
messages, most (90+%) of which are actually duplicates. This was previously
done with a DistributedMapCacheServer, but that involved using only one of the
nodes (hardcoded in the MapCacheClient controller), giving us a single point of
failure for the flow. We had hoped to use Hazelcast to have a redundant
cacheserver, but I’m starting to think that this scenario causes too many
concurrent updates of the cache, on top of the already heavy load from other
processing on the batch.
What was new to me is the CPU load on the cluster in question going through the
roof, on all 3 nodes. I have no idea how a 16 vCPU server gets to a load of
100+.
The start roughly coincides with the arrival of the daily batch, though there
may have been other batch processes going on since it’s a Sunday. However, the
queues were pretty much empty again in an hour and yet the craziness kept going
until I finally decided to restart all nodes.
[cid:[email protected]]
The hazelcast troubles might well be a side-effect of the NiFi servers being
overloaded. There could have been issues at the Azure VM level etc. But
activating the Hazelcast controller is the only change I *know* about. And it
doesn’t seem farfetched that it got into a loop trying to migrate/copy
partitions “lost” on other nodes.
I’ve attached a file with selected hazelcast warnings and errors from the
nifi-app.log files, trying to include as many unique ones as possible.
The errors that kept repeating where these (always together):
2023-02-19 08:58:39,899Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-19 08:58:39,900Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap. REDACTED.nl]:5701
[nifi] [4.2.5] While changing cluster state to FROZEN!
java.lang.IllegalStateException: Still have pending migration tasks, cannot
lock cluster state! New state: ClusterStateChange{type=class
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
Thanks,
Isha
Van: Simon Bence <[email protected]>
Verzonden: dinsdag 21 februari 2023 08:52
Aan: [email protected]
Onderwerp: Re: Embedded Hazelcast Cachemanager
Hi Isha,
Could you please share the error messages? It might bring light to something
might effect the performance.
In the other hand, I am not aware of exhaustive performance tests for the
Hazelcast Cache. In general it should not be the bottleneck, but if you could
please give some details about the error and possibly the intended way of
usage, it could help to find a more specific answer.
Best regards,
Bence Simon
On 2023. Feb 20., at 15:19, Isha Lamboo
<[email protected]<mailto:[email protected]>> wrote:
Hi all,
This morning I had to fix up a cluster of NiFi 1.18.0 servers where the primary
was constantly crashing and moving to the next server.
One of the recent changes was activating an Embedded Hazelcast Cache, and I did
see errors reported trying with promotions going wrong. I can’t tell if this is
cause or effect, so I’m trying to get a feeling for the performance demands of
Hazelcast, but there is nothing, only a time to live for cache items. The
diagnostics dump also didn’t give me anything on this controllerservice.
Does anyone have experience with tuning/diagnosing the Hazelcast components
within NiFi?
Met vriendelijke groet,
Isha Lamboo
Data Engineer
<image001.png>
# First there are warnings about slow operations and time jumps, probably
because the system is overloaded
2023-02-19 01:51:22,847Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.SlowOperationDetectorThread]
c.h.s.i.o.s.SlowOperationDetector [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Slow operation detected: com.hazelcast.internal.nio.Packet
2023-02-19 01:51:24,077Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-36]
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Resetting heartbeat timestamps because of huge system clock jump!
Clock-Jump: 35000 ms, Heartbeat-Timeout: 60000 ms
2023-02-19 01:53:13,991Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0]
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Ignoring heartbeat from Member [su20cnifi101-ap.REDACTED.nl]:5701 -
91c9a123-a1b9-4db5-9a87-9bb4a229d5bd since it is expired (now: 2023-02-19
02:51:34.755, timestamp: 2023-02-19 02:50:59.107)
# This continues for several hours until stuff gets worse:
2023-02-19 07:18:25,367Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-7]
c.h.i.server.tcp.TcpServerConnection [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Connection[id=27,
/10.41.29.6:48586->su20cnifi102-ap.REDACTED.nl/10.41.29.7:5701, qualifier=null,
endpoint=[su20cnifi102-ap.REDACTED.nl]:5701, alive=false,
connectionType=MEMBER, planeIndex=0] closed. Reason: Exception in
Connection[id=27,
/10.41.29.6:48586->su20cnifi102-ap.REDACTED.nl/10.41.29.7:5701, qualifier=null,
endpoint=[su20cnifi102-ap.REDACTED.nl]:5701, alive=true, connectionType=MEMBER,
planeIndex=0],
thread=hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-7
java.io.IOException: Broken pipe
2023-02-19 07:22:59,313Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0]
c.h.i.cluster.impl.MembershipManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Member [su20cnifi101-ap.REDACTED.nl]:5701 -
91c9a123-a1b9-4db5-9a87-9bb4a229d5bd is suspected to be dead for reason:
explicit suspicion
2023-02-19 07:24:28,992Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.InternalPartitionService [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Following unknown addresses are found in partition table sent
from master[[su20cnifi103-ap.REDACTED.nl]:5701]. (Probably they have recently
joined or left the cluster.)
2023-02-19 07:25:33,461Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.partition-operation.thread-2]
c.h.m.i.o.MapClearExpiredOperation [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Missing responseHandler for
com.hazelcast.map.impl.operation.MapClearExpiredOperation{serviceName='hz:impl:mapService',
identityHash=1822878528, partitionId=114, replicaIndex=0, callId=0,
invocationTime=-1 (1970-01-01 00:59:59.999), waitTimeout=-1,
callTimeout=9223372036854775807,
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0,
expirationPercentage=10}
com.hazelcast.spi.exception.PartitionMigratingException: Partition is
migrating! this: [su20cnifi103-ap.REDACTED.nl]:5701, partitionId: 114,
operation: com.hazelcast.map.impl.operation.MapClearExpiredOperation, service:
hz:impl:mapService
at
com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.ensureNoPartitionProblems(OperationRunnerImpl.java:384)
2023-02-19 07:27:24,976Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-25]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Error during join check!
java.util.concurrent.ExecutionException:
com.hazelcast.core.OperationTimeoutException: SplitBrainMergeValidationOp
invocation failed to complete due to operation-heartbeat-timeout. Current time:
2023-02-19 08:27:24.512. Start time: 2023-02-19 08:26:24.082. Total elapsed
time: 60490 ms. Last operation heartbeat: never. Last operation heartbeat from
member: never.
Invocation{op=com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp{serviceName='hz:core:clusterService',
identityHash=2069864435, partitionId=-1, replicaIndex=0, callId=-13509555,
invocationTime=1676791086295 (2023-02-19 08:18:06.295), waitTimeout=-1,
callTimeout=60000,
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0},
tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000,
firstInvocationTimeMs=1676791584082, firstInvocationTime='2023-02-19
08:26:24.082', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01
01:00:00.000', target=[su20cnifi101-ap.REDACTED.nl]:5701,
pendingResponse={VOID}, backupsAcksExpected=-1, backupsAcksReceived=0,
connection=Connection[id=35, /10.41.29.6:5701->/10.41.29.8:44554,
qualifier=null, endpoint=[su20cnifi101-ap.REDACTED.nl]:5701, alive=true,
connectionType=MEMBER, planeIndex=0]}
# There's many repeated lines like the above, and at some point it seems the
Hazelcast nodes can't rejoin because they keep trying to lock the cluster with
outstanding migrations that can't be resumed/finished.
2023-02-19 08:54:49,331Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0]
c.h.i.c.impl.ClusterHeartbeatManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Cannot handle heartbeat from [su20cnifi101-ap.REDACTED.nl]:5701, could
not acquire lock in time.
2023-02-19 08:58:38,893Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-19 08:58:38,898Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] While changing cluster state to FROZEN!
java.lang.IllegalStateException: Still have pending migration tasks, cannot
lock cluster state! New state: ClusterStateChange{type=class
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-19 08:59:05,697Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.generic-operation.thread-3]
c.h.i.p.o.PartitionStateOperation [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! this:
[su20cnifi103-ap.REDACTED.nl]:5701, caller: [su20cnifi101-ap.REDACTED.nl]:5701,
partitionId: -1, operation:
com.hazelcast.internal.partition.operation.PartitionStateOperation, service:
hz:core:partitionService
# for the next day, these repeat many times, interspersed with some lost
partitions. This is just a selection of the different events.
2023-02-19 23:44:04,475Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.InvocationMonitorThread]
c.h.s.i.operationservice.impl.Invocation [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Retrying invocation:
Invocation{op=com.hazelcast.map.impl.operation.PutIfAbsentWithExpiryOperation{serviceName='hz:impl:mapService',
identityHash=791912699, partitionId=42, replicaIndex=0, callId=13530057,
invocationTime=1676849589819 (2023-02-20 00:33:09.819), waitTimeout=-1,
callTimeout=60000,
tenantControl=com.hazelcast.spi.impl.tenantcontrol.NoopTenantControl@0,
name=HARP}, tryCount=250, tryPauseMillis=500, invokeCount=160,
callTimeoutMillis=60000, firstInvocationTimeMs=1676850188226,
firstInvocationTime='2023-02-20 00:43:08.226', lastHeartbeatMillis=0,
lastHeartbeatTime='1970-01-01 01:00:00.000',
target=[su20cnifi103-ap.REDACTED.nl]:5701, pendingResponse={VOID},
backupsAcksExpected=-1, backupsAcksReceived=0, connection=null}, Reason:
com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:
[su20cnifi103-ap.REDACTED.nl]:5701 - f5c0ae70-5d96-4e91-b29e-718dee7763c5,
partitionId: 42, operation:
com.hazelcast.map.impl.operation.PutIfAbsentWithExpiryOperation, service:
hz:impl:mapService
2023-02-20 03:42:44,844Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:42:45,846Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:42:45,847Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Rolling back cluster state! Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701
2023-02-20 03:42:45,848Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] While changing cluster state to FROZEN!
java.lang.IllegalStateException: Still have pending migration tasks, cannot
lock cluster state! New state: ClusterStateChange{type=class
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:43,806Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] While changing cluster state to FROZEN!
java.lang.IllegalStateException: Still have pending migration tasks, cannot
lock cluster state! New state: ClusterStateChange{type=class
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:43,953Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] partitionId=0 is completely lost!
2023-02-20 03:43:43,953Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] partitionId=1 is completely lost!
# note: many lines with all other partitions cut from here #
2023-02-20 03:43:44,080Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] partitionId=44 is completely lost!
2023-02-20 03:43:44,285Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.priority-generic-operation.thread-0]
c.h.i.cluster.impl.ClusterJoinManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] We should merge to [su20cnifi101-ap.REDACTED.nl]:5701 because their
data member count is bigger than ours [2 > 1]
2023-02-20 03:43:45,120Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Locking cluster state. Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:45,120Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
# note: many lines with all other partitions cut from here #
2023-02-20 03:43:46,320Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] partitionId=270 is completely lost!
2023-02-20 03:43:46,363Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.migration]
c.h.i.partition.impl.MigrationManager [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Assigning new owners for 271 LOST partitions!
2023-02-20 03:43:47,381Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Locking cluster state. Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:47,916Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:47,917Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Rolling back cluster state! Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701
2023-02-20 03:43:47,917Z (UTC+0) WARN
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] While changing cluster state to FROZEN!
java.lang.IllegalStateException: Still have pending migration tasks, cannot
lock cluster state! New state: ClusterStateChange{type=class
com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:48,923Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Locking cluster state. Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701, lease-time: 60000
2023-02-20 03:43:48,923Z (UTC+0) ERROR
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701
[nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state!
New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState,
newState=FROZEN}, current state: ACTIVE
2023-02-20 03:43:48,924Z (UTC+0) INFO
[hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-3]
c.h.i.c.i.o.RollbackClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 [nifi]
[4.2.5] Rolling back cluster state! Initiator:
[su20cnifi103-ap.REDACTED.nl]:5701