Mark Bean created NIFI-9433:
-------------------------------

             Summary: Load Balancer hangs
                 Key: NIFI-9433
                 URL: https://issues.apache.org/jira/browse/NIFI-9433
             Project: Apache NiFi
          Issue Type: Bug
          Components: Core Framework
    Affects Versions: 1.15.0
            Reporter: Mark Bean


Simplified scenario to demonstrate problem:
A 2-node cluster with a simple flow. GenerateFlowFile -> load-balanced 
connection -> UpdateAttribute. And, unconnected to the first two processors, 
Funnel #1 -> non-load-balanced Connection -> Funnel #2.
GenerateFlowFile is scheduled to run on Primary Node only. It is started. This 
causes the connection to be very busy load balancing (round robin). Then, the 
connection between the two funnels is removed.
Immediately, an error is thrown, and the flow gets stuck in a state of 
constantly throwing errors indicating that a connection (the one just deleted) 
does not exist and cannot be balanced.
It is unclear why this connection is being considered by the load balancer at 
all.

The sequence of errors include the following:
Primary Node reports 
2021-12-02 12:20:03,812 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue Unacknowledged from 
FlowFile Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap 
Files=[0], Unacknowledged=[0, 0 Bytes] ] to FlowFile Queue Size[ 
ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size
2021-12-02 12:20:03,813 ERROR [NiFi Web Server-1811] 
o.a.n.c.queue.SwappablePriorityQueue Updated Size of Queue active from FlowFile 
Queue Size[ ActiveQueue=[0, 0 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], 
Unacknowledged=[-206, -20600 Bytes] ] to FlowFile Queue Size[ ActiveQueue=[206, 
20600 Bytes], Swap Queue=[0, 0 Bytes], Swap Files=[0], Unacknowledged=[-206, 
-20600 Bytes] ]
java.lang.RuntimeException: Cannot create negative queue size

The above may be a symptom of subsequent errors in the log:
Primary Node reports:
2021-12-02 12:20:03,814 ERROR [Load-Balanced Client Thread-6] 
o.a.n.c.q.c.c.a.n.NioAsyncLoadBalanceClient Failed to communicate with Peer 
{host:port}
java.io.IOException: Failed to negotiate Protocol Version with Peer 
{host:port}. Recommended version 1 but instead of an ACCEPT or REJECT response 
got back a response of 33.

Non-Primary Node reports:
2021-12-02 12:20:03,828 ERROR [Load-Balance Server Thread-4] 
o.a.n.c.q.c.s.ConnectionLoadBalanceServer Failed to communicate with Peer 
{fqdn/IP:port}
java.io.IOException: Expected to receive Transaction Completion Indicator from 
Peer {fqdn} but instead received a value of 1

The highly concerning part is this error which indicates a Connection which was 
not scheduled to load balance was attempting to receive a FlowFile.
Non-Primary Node reports:
2021-12-02 12:29:05,228 ERROR [Load-Balance Server Thread-808] 
o.a.n.c.q.c.s.StandardLoadBalanceProtocol Attempted to receive FlowFiles from 
Peer {fqdn} for Connection with ID {uuid} but no connection exists with that ID.

Note the that {uuid} value in this message corresponds to the Connection that 
was removed causing the errors to begin. Should the above message ever occur? 
Does the load balancer ever consider Connections which are configured as "Do 
not load balance"

Users have also reported that FlowFiles have been load balanced from one 
Connection to another, unrelated Connection on the other Node. (This is still 
being verified.)

Finally, on the UI the load-balanced connection indicates it is actively load 
balancing some number (206 in this case) of FlowFiles currently in the 
connection. And, attempts to "list queue" on this connection show no FlowFiles. 
Presumably they are being held by the load balancer and are inaccessible in the 
queue.





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to