Hi Aaron, Most probably this issue won't be fixed in 2.4, since it's not in progress yet.
As a workaround, setting -DIGNITE_EXCHANGE_COMPATIBILITY_VER_1=true JVM argument seems to fix this problem. Evgenii 2018-01-23 14:47 GMT+03:00 [email protected] <[email protected]>: > Hi Evgenii, > > > Now that error gone , but another exception thrown: > > > [ERROR] 2018-01-23 04:11:58.849 [srvc-deploy-#146] [ig] > GridServiceProcessor - Error when executing service: null > > java.lang.IllegalStateException: Getting affinity for topology version > earlier than affinity is calculated [locNode=TcpDiscoveryNode > [id=a732a474-cd9a-4ae6-8366-0d058f1f80a8, addrs=[10.30.91.134], sockA > > ddrs=[fx2/10.30.91.134:47500], discPort=47500, order=15, intOrder=15, > lastExchangeTime=1516680718809, loc=true, ver=2.3.0#20171028-sha1:8add7fd5, > isClient=false], grp=ignite-sys-cache, topVer=AffinityTo > > pologyVersion [topVer=18, minorTopVer=0], head=AffinityTopologyVersion > [topVer=19, minorTopVer=0], history=[AffinityTopologyVersion [topVer=15, > minorTopVer=0], AffinityTopologyVersion [topVer=16, minorT > > opVer=0], AffinityTopologyVersion [topVer=17, minorTopVer=0], > AffinityTopologyVersion [topVer=19, minorTopVer=0]]] > > at org.apache.ignite.internal.processors.affinity. > GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:514) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.affinity. > GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:419) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager. > nodesByPartition(GridCacheAffinityManager.java:220) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager. > primaryByPartition(GridCacheAffinityManager.java:256) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager. > primaryByKey(GridCacheAffinityManager.java:247) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager. > primaryByKey(GridCacheAffinityManager.java:271) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.service.GridServiceProcessor$ > TopologyListener$1.run0(GridServiceProcessor.java:1771) > ~[ignite-core-2.3.0.jar!/:2.3.0] > > at org.apache.ignite.internal.processors.service.GridServiceProcessor$ > DepRunnable.run(GridServiceProcessor.java:1958) > [ignite-core-2.3.0.jar!/:2.3.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [?:1.8.0_131] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [?:1.8.0_131] > > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] > > > > Notice there is a similar JIRA: https://issues.apache.org/ > jira/browse/IGNITE-7366 > <https://issues.apache.org/jira/browse/IGNITE-7366?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel> > > Not sure whether this issue confirm and any plan to fix it in 2.4? > > > > Regards > Aaron > ------------------------------ > Aaron.Kuai > > > *From:* Evgenii Zhuravlev <[email protected]> > *Date:* 2018-01-19 16:36 > *To:* user <[email protected]> > *Subject:* Re: Re: Nodes can not join the cluster after reboot > Please let us know if this helped you > > Evgenii > > 2018-01-19 11:35 GMT+03:00 [email protected] <[email protected]>: > >> Hi Evgenii, >> >> I trying to remove this part and use the @LoggerResource; will have a >> try! thanks for your time. >> >> >> Regards >> Aaron >> ------------------------------ >> Aaron.Kuai >> >> >> *From:* Evgenii Zhuravlev <[email protected]> >> *Date:* 2018-01-19 16:28 >> *To:* user <[email protected]> >> *Subject:* Re: Re: Nodes can not join the cluster after reboot >> Aaron, >> >> This Service instance after creating will be serialized and deserialized >> on the target nodes. So, the field of Logger will be serialized too and I >> don't think that it will be properly serialized with the possibility of >> deserialization since it holds context internally. It's not recommended to >> use such kinds of fields in Service, you should use >> >> @LoggerResource >> private IgniteLogger log; >> >> instead. I'm not sure if it's the root cause, but it's definitely could >> cause some problems. >> >> Evgenii >> >> >> 2018-01-19 4:51 GMT+03:00 [email protected] <[email protected]>: >> >>> HI Evgenii,, >>> >>> Sure, thanks for your time! this service work as a delegate and all >>> request will route to a bean in our spring context. >>> >>> Thanks again! >>> >>> Regards >>> Aaron >>> ------------------------------ >>> Aaron.Kuai >>> >>> >>> *From:* Evgenii Zhuravlev <[email protected]> >>> *Date:* 2018-01-18 21:59 >>> *To:* user <[email protected]> >>> *Subject:* Re: Re: Nodes can not join the cluster after reboot >>> Aaron, could you share code of >>> com.tophold.trade.ignite.service.CommandRemoteService >>> ? >>> >>> Thanks, >>> Evgenii >>> >>> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <[email protected]>: >>> >>>> Hi Aaron, >>>> >>>> I think that the main problem is here: >>>> >>>> GridServiceProcessor - Error when executing service: null >>>> >>>> diagnostic - Pending transactions: >>>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] >>>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, >>>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote >>>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, >>>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, >>>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, >>>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter >>>> [explicitVers=null, started=true, commitAllowed=0, >>>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry >>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey >>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true], >>>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, >>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], >>>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, >>>> val=CacheObjectImpl [val=GridServiceAssignments >>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15, >>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade >>>> .ignite.service.CommandRemoteService, svcCls=, >>>> nodeFilterCls=CommandServiceNodeFilter], >>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], >>>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, >>>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, >>>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[], >>>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], >>>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry >>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey >>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true], >>>> val=CacheObjectImpl [val=GridServiceAssignments >>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, >>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade >>>> .ignite.service.CommandRemoteService, svcCls=, >>>> nodeFilterCls=CommandServiceNodeFilter], >>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], >>>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion >>>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172, >>>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, >>>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, >>>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], >>>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, >>>> minorTopVer=0], reentry=null, >>>> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, >>>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, >>>> serOrder=null, key=KeyCacheObjectImpl [part=72, >>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], >>>> hasValBytes=true], masks=local=0|owner=0|ready=0| >>>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, >>>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, >>>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, >>>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], >>>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000, >>>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion >>>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false, >>>> loc=false, threadId=585, startTime=1516186483489, >>>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion >>>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null, >>>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, >>>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, >>>> invalidParts=null, state=PREPARED, timedOut=false, >>>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], >>>> duration=36138ms, onePhaseCommit=false]]]] >>>> >>>> You have the pending transaction in logs related to the service >>>> deployment. Most possible that your service threw NPE in init(or other) >>>> method and wasn't deployed. Could you check if it's possible that your >>>> service will throw NPE? >>>> >>>> Evgenii >>>> >>>> >>>> 2018-01-17 15:40 GMT+03:00 [email protected] <[email protected]>: >>>> >>>>> Hi Evgenii, >>>>> >>>>> What's more interesting If we reboot them in very shut time like one >>>>> hour, from our monitor log we can find >>>>> >>>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly . >>>>> >>>>> But if after several hours, problem below sure will happen if you try >>>>> to reboot any node from cluster. >>>>> >>>>> >>>>> Regards >>>>> Aaron >>>>> ------------------------------ >>>>> Aaron.Kuai >>>>> >>>>> *From:* [email protected] >>>>> *Date:* 2018-01-17 20:05 >>>>> *To:* user <[email protected]> >>>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot >>>>> hi Evgenii, >>>>> >>>>> Thanks! We collect some logs, one is the server which is reboot, >>>>> another two are two servers exist, one client only nodes. after reboot: >>>>> >>>>> 1. the reboot node never be totally brought up, waiting for ever. >>>>> 2. other server nodes after get notification the reboot node down, >>>>> soon hang up there also. >>>>> 3. the pure client node, only call a remote service on the reboot >>>>> node, also hang up there >>>>> >>>>> At around 2018-01-17 10:54 we reboot the node. From the log we can >>>>> find: >>>>> >>>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc >>>>> overyEvents - All server nodes for the following caches have >>>>> left the cluster: 'PortfolioCommandService_SVC_CO_DUM_CACHE >>>>> ', 'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGen >>>>> ericDomainEventEntry' >>>>> >>>>> Soon a ERROR log(Seem the only ERROR level log): >>>>> >>>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid >>>>> ServiceProcessor - Error when executing service: null java.l >>>>> ang.IllegalStateException: Getting affinity for topology ver >>>>> sion earlier than affinity is calculated >>>>> >>>>> Then a lot WARN of >>>>> >>>>> "Failed to wait for partition release future........................." >>>>> >>>>> Then this forever loop there, from the diagnose nothing seem suspicious, >>>>> All node eventually output very similar. >>>>> >>>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d >>>>> iagnostic - Pending explicit locks: >>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>>> iagnostic - Pending cache futures: >>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>>> iagnostic - Pending atomic cache futures: >>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>>> iagnostic - Pending data streamer futures: >>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>>> iagnostic - Pending transaction deadlock detection futures: >>>>> >>>>> Some of our environment: >>>>> >>>>> 1. we open the peer class loading flag, but in fact we use fat jar >>>>> every class is shared. >>>>> 2. some nodes deploy service, we use them as RPC way. >>>>> 3. most cache in fact is LOCAL, only when must we make them shared >>>>> 4. use JDBC to persist important caches >>>>> 5. TcpDiscoveryJdbcIpFinder as the finder >>>>> >>>>> All others configuration is according to the stand. >>>>> >>>>> Thanks for your time! >>>>> >>>>> Regards >>>>> Aaron >>>>> ------------------------------ >>>>> Aaron.Kuai >>>>> >>>>> >>>>> *From:* Evgenii Zhuravlev <[email protected]> >>>>> *Date:* 2018-01-16 20:32 >>>>> *To:* user <[email protected]> >>>>> *Subject:* Re: Nodes can not join the cluster after reboot >>>>> Hi, >>>>> >>>>> Most possible that on the of the nodes you have hanged >>>>> transaction/future/lock or even a deadlock, that's why new nodes can't >>>>> join >>>>> cluster - they can't perform exchange due to pending operation. Please >>>>> share full logs from all nodes with thread dumps, it will help to find a >>>>> root cause. >>>>> >>>>> Evgenii >>>>> >>>>> 2018-01-16 5:35 GMT+03:00 [email protected] <[email protected]>: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We have a ignite cluster running about 20+ nodes, for any case JVM >>>>>> memory issue we schedule reboot those nodes at middle night. >>>>>> >>>>>> but in order to keep the service supplied, we reboot them one by one >>>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, >>>>>> the >>>>>> reboot nodes can never join to the cluster again. >>>>>> >>>>>> Eventually the entire cluster can not work any more forever waiting >>>>>> for joining to the topology; we need to kill all and reboot from started, >>>>>> this sound incredible. >>>>>> >>>>>> I not sure whether any more meet this issue before, or any mistake we >>>>>> may make, attached is the ignite log. >>>>>> >>>>>> >>>>>> Thanks for your time! >>>>>> >>>>>> Regards >>>>>> Aaron >>>>>> ------------------------------ >>>>>> Aaron.Kuai >>>>>> >>>>> >>>>> >>>> >>> >> >
