David Harvey created IGNITE-9026:
------------------------------------

             Summary: Two levels of Peer class loading fails in CONTINUOUS mode
                 Key: IGNITE-9026
                 URL: https://issues.apache.org/jira/browse/IGNITE-9026
             Project: Ignite
          Issue Type: Bug
    Affects Versions: 2.5
            Reporter: David Harvey


We had an seemingly functional system in SHARED_MODE, where we have a custom 
StreamReceiver that sometimes sends closures on the peer class loaded code to 
other servers.  However, we ended up running out of Metaspace, because we had > 
6000 class loaders!  We suspected a regression in this change 
[https://github.com/apache/ignite/commit/d2050237ee2b760d1c9cbc906b281790fd0976b4#diff-3fae20691c16a617d0c6158b0f61df3c],
 so we switched to CONTINUOUS mode.    We then started getting failures to load 
some of the classes for the closures on the second server.   Through some 
testing and code inspection, there seems to be the following flaws between 
GridDeploymentCommunication.sendResourceRequest and its two callers.

The callers iterate though all the participant nodes until they find an online 
node that responds to the request (timeout is treated as offline node), with 
either success or failure, and then the loop terminates.  The assumption is 
that all nodes are equally capable of providing the resource, so if one fails, 
then the others would also fail.   

The first flaw is that GridDeploymentCommunication.sendResourceRequest() has a 
check for a cycle, i.e., whether the destination node is one of the nodes that 
originated or forwarded this request, and in that case,  a failure response is 
faked.   However, that causes the caller's loop to terminate.  So depending on 
the order of the nodes in the participant list,  sendResourceRequest() may fail 
before trying any nodes because it has one of the calling nodes on this list.   
   It should instead be skipping any of the calling nodes.

Example with 1 client node a 2 server nodes:  C1 sends data to S1, which 
forwards closure to S2.   C1 also sends to S2 which forwards to S1.  So now the 
node lists on S1 and S2 contain C1 and the other S node.   If the order of the 
node lists on S1 is (S2,C1) and on S2 (S1,C1), then when S1 tries to load a 
class, it will try S2, then S2 will try S1, but will get a fake failure 
generated, causing S2 not to try more nodes (i.e., C1), and causing S1 also not 
to try more nodes.

The other flaw is the assumption that all participants have equal access to the 
resource.   Assume S1 knows about userVersion1 via S3 and S4, with S3 though C1 
and S4 through C2.   If C2 fails, then S4 is not capable of getting back to a 
master, but S1 has no way of knowing that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to