Re: Re: Nodes can not join the cluster after reboot

Evgenii Zhuravlev Tue, 23 Jan 2018 04:42:07 -0800

Hi Aaron,

Most probably this issue won't be fixed in 2.4, since it's not in progress
yet.


As a workaround, setting -DIGNITE_EXCHANGE_COMPATIBILITY_VER_1=true JVM
argument seems to fix this problem.

Evgenii

2018-01-23 14:47 GMT+03:00 [email protected] <[email protected]>:

> Hi Evgenii,
>
>
> Now that error gone , but another exception thrown:
>
>
> [ERROR] 2018-01-23 04:11:58.849 [srvc-deploy-#146] [ig]
> GridServiceProcessor - Error when executing service: null
>
> java.lang.IllegalStateException: Getting affinity for topology version
> earlier than affinity is calculated [locNode=TcpDiscoveryNode
> [id=a732a474-cd9a-4ae6-8366-0d058f1f80a8, addrs=[10.30.91.134], sockA
>
> ddrs=[fx2/10.30.91.134:47500], discPort=47500, order=15, intOrder=15,
> lastExchangeTime=1516680718809, loc=true, ver=2.3.0#20171028-sha1:8add7fd5,
> isClient=false], grp=ignite-sys-cache, topVer=AffinityTo
>
> pologyVersion [topVer=18, minorTopVer=0], head=AffinityTopologyVersion
> [topVer=19, minorTopVer=0], history=[AffinityTopologyVersion [topVer=15,
> minorTopVer=0], AffinityTopologyVersion [topVer=16, minorT
>
> opVer=0], AffinityTopologyVersion [topVer=17, minorTopVer=0],
> AffinityTopologyVersion [topVer=19, minorTopVer=0]]]
>
> at org.apache.ignite.internal.processors.affinity.
> GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:514)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.affinity.
> GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:419)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> nodesByPartition(GridCacheAffinityManager.java:220)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByPartition(GridCacheAffinityManager.java:256)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByKey(GridCacheAffinityManager.java:247)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.
> primaryByKey(GridCacheAffinityManager.java:271)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.service.GridServiceProcessor$
> TopologyListener$1.run0(GridServiceProcessor.java:1771)
> ~[ignite-core-2.3.0.jar!/:2.3.0]
>
> at org.apache.ignite.internal.processors.service.GridServiceProcessor$
> DepRunnable.run(GridServiceProcessor.java:1958)
> [ignite-core-2.3.0.jar!/:2.3.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_131]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_131]
>
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
>
>
>
> Notice there is a similar JIRA:  https://issues.apache.org/
> jira/browse/IGNITE-7366
> <https://issues.apache.org/jira/browse/IGNITE-7366?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>
>
> Not sure whether this issue confirm and any plan to fix it in 2.4?
>
>
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <[email protected]>
> *Date:* 2018-01-19 16:36
> *To:* user <[email protected]>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> Please let us know if this helped you
>
> Evgenii
>
> 2018-01-19 11:35 GMT+03:00 [email protected] <[email protected]>:
>
>> Hi Evgenii,
>>
>> I trying to remove this part and use the @LoggerResource;  will have a
>> try!  thanks for your time.
>>
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <[email protected]>
>> *Date:* 2018-01-19 16:28
>> *To:* user <[email protected]>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> Aaron,
>>
>> This Service instance after creating will be serialized and deserialized
>> on the target nodes. So, the field of Logger will be serialized too and I
>> don't think that it will be properly serialized with the possibility of
>> deserialization since it holds context internally. It's not recommended to
>> use such kinds of fields in Service, you should use
>>
>> @LoggerResource
>> private IgniteLogger log;
>>
>> instead. I'm not sure if it's the root cause, but it's definitely could 
>> cause some problems.
>>
>> Evgenii
>>
>>
>> 2018-01-19 4:51 GMT+03:00 [email protected] <[email protected]>:
>>
>>> HI Evgenii,,
>>>
>>> Sure, thanks for your time!  this service work as a delegate and all
>>> request will route to a bean in our spring context.
>>>
>>> Thanks again!
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>>
>>> *From:* Evgenii Zhuravlev <[email protected]>
>>> *Date:* 2018-01-18 21:59
>>> *To:* user <[email protected]>
>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>> Aaron, could you share code of 
>>> com.tophold.trade.ignite.service.CommandRemoteService
>>> ?
>>>
>>> Thanks,
>>> Evgenii
>>>
>>> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <[email protected]>:
>>>
>>>> Hi Aaron,
>>>>
>>>> I think that the main problem is here:
>>>>
>>>> GridServiceProcessor - Error when executing service: null
>>>>
>>>> diagnostic - Pending transactions:
>>>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
>>>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
>>>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
>>>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
>>>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
>>>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
>>>> [explicitVers=null, started=true, commitAllowed=0,
>>>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
>>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
>>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
>>>> val=CacheObjectImpl [val=GridServiceAssignments
>>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
>>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>>> .ignite.service.CommandRemoteService, svcCls=,
>>>> nodeFilterCls=CommandServiceNodeFilter], 
>>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP,
>>>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>>>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[],
>>>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[],
>>>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry
>>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>>> val=CacheObjectImpl [val=GridServiceAssignments
>>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
>>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>>> .ignite.service.CommandRemoteService, svcCls=,
>>>> nodeFilterCls=CommandServiceNodeFilter], 
>>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion
>>>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172,
>>>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
>>>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
>>>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
>>>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
>>>> minorTopVer=0], reentry=null, 
>>>> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null,
>>>> serOrder=null, key=KeyCacheObjectImpl [part=72,
>>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>>> hasValBytes=true], masks=local=0|owner=0|ready=0|
>>>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
>>>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
>>>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>>>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
>>>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
>>>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
>>>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
>>>> loc=false, threadId=585, startTime=1516186483489,
>>>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
>>>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
>>>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
>>>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
>>>> invalidParts=null, state=PREPARED, timedOut=false,
>>>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>>> duration=36138ms, onePhaseCommit=false]]]]
>>>>
>>>> You have the pending transaction in logs related to the service
>>>> deployment. Most possible that your service threw NPE in init(or other)
>>>> method and wasn't deployed. Could you check if it's possible that your
>>>> service will throw NPE?
>>>>
>>>> Evgenii
>>>>
>>>>
>>>> 2018-01-17 15:40 GMT+03:00 [email protected] <[email protected]>:
>>>>
>>>>> Hi Evgenii,
>>>>>
>>>>> What's more interesting If we reboot them in very shut time like one
>>>>> hour,  from our monitor log we can find
>>>>>
>>>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>>>>
>>>>> But if after several hours, problem below sure will happen if you try
>>>>> to reboot any node from cluster.
>>>>>
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>> *From:* [email protected]
>>>>> *Date:* 2018-01-17 20:05
>>>>> *To:* user <[email protected]>
>>>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>>>> hi Evgenii,
>>>>>
>>>>> Thanks!  We collect some logs, one is the server which is reboot,
>>>>> another two are two servers exist,  one client only nodes.  after reboot:
>>>>>
>>>>> 1. the reboot node never be totally brought up, waiting for ever.
>>>>> 2. other server nodes after get notification the reboot node down,
>>>>> soon hang up there also.
>>>>> 3. the pure client node, only call a remote service on the reboot
>>>>> node, also hang up there
>>>>>
>>>>> At around 2018-01-17 10:54  we reboot the node. From the log we can
>>>>> find:
>>>>>
>>>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>>>>> overyEvents - All server nodes for the following caches have
>>>>>  left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE
>>>>> ', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGen
>>>>> ericDomainEventEntry'
>>>>>
>>>>> Soon a ERROR log(Seem the only ERROR level log):
>>>>>
>>>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>>>>> ServiceProcessor - Error when executing service: null java.l
>>>>> ang.IllegalStateException: Getting affinity for topology ver
>>>>> sion earlier than affinity is calculated
>>>>>
>>>>> Then a lot WARN of
>>>>>
>>>>> "Failed to wait for partition release future........................."
>>>>>
>>>>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>>>>  All node eventually output very similar.
>>>>>
>>>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending explicit locks:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending cache futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending atomic cache futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending data streamer futures:
>>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>>> iagnostic - Pending transaction deadlock detection futures:
>>>>>
>>>>> Some of our environment:
>>>>>
>>>>> 1. we open the peer class loading flag, but in fact we use fat jar
>>>>> every class is shared.
>>>>> 2. some nodes deploy service, we use them as RPC way.
>>>>> 3. most cache in fact is LOCAL, only when must we make them shared
>>>>> 4. use JDBC to persist important caches
>>>>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>>>>
>>>>> All others configuration is according to the stand.
>>>>>
>>>>> Thanks for your time!
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>>
>>>>> *From:* Evgenii Zhuravlev <[email protected]>
>>>>> *Date:* 2018-01-16 20:32
>>>>> *To:* user <[email protected]>
>>>>> *Subject:* Re: Nodes can not join the cluster after reboot
>>>>> Hi,
>>>>>
>>>>> Most possible that on the of the nodes you have hanged
>>>>> transaction/future/lock or even a deadlock, that's why new nodes can't 
>>>>> join
>>>>> cluster - they can't perform exchange due to pending operation. Please
>>>>> share full logs from all nodes with thread dumps, it will help to find a
>>>>> root cause.
>>>>>
>>>>> Evgenii
>>>>>
>>>>> 2018-01-16 5:35 GMT+03:00 [email protected] <[email protected]>:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>>>>> memory issue we schedule reboot those nodes at middle night.
>>>>>>
>>>>>> but in order to keep the service supplied, we reboot them one by one
>>>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, 
>>>>>> the
>>>>>> reboot nodes can never join to the cluster again.
>>>>>>
>>>>>> Eventually the entire cluster can not work any more forever waiting
>>>>>> for joining to the topology; we need to kill all and reboot from started,
>>>>>> this sound incredible.
>>>>>>
>>>>>> I not sure whether any more meet this issue before, or any mistake we
>>>>>> may make, attached is the ignite log.
>>>>>>
>>>>>>
>>>>>> Thanks for your time!
>>>>>>
>>>>>> Regards
>>>>>> Aaron
>>>>>> ------------------------------
>>>>>> Aaron.Kuai
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Reply via email to