Yeah I restarted the server nodes. But I guess the client didn't reconnect.... Hummmmm....
On Tue., Jul. 7, 2020, 5:52 p.m. Evgenii Zhuravlev, < [email protected]> wrote: > John, > > Unfortunately, I didn't find messages about lost partitions for this > cache, there is a chance that it happened before. What Partition Loss > policy do you have? > > Logs says that there is a problem with partition distribution: > "Local node affinity assignment distribution is not ideal [cache=cache1, > expectedPrimary=512.00, actualPrimary=493, expectedBackups=512.00, > actualBackups=171, warningThreshold=50.00%]" > How do you restart nodes? Do you wait until rebalance completed? > > Evgenii > > > > пт, 3 июл. 2020 г. в 09:03, John Smith <[email protected]>: > >> Hi Evgenii, did you have a chance to look at the latest logs? >> >> On Thu, 25 Jun 2020 at 11:32, John Smith <[email protected]> wrote: >> >>> Ok >>> >>> stdout.copy.zip >>> >>> https://www.dropbox.com/sh/ejcddp2gcml8qz2/AAD_VfUecE0hSNZX7wGbfDh3a?dl=0 >>> >>> On Thu, 25 Jun 2020 at 11:01, John Smith <[email protected]> wrote: >>> >>>> Because in between it's all the business logs. Let me make sure I >>>> didn't filter anything relevant. So maybe in those 13 hours nothing >>>> happened? >>>> >>>> >>>> On Thu, 25 Jun 2020 at 10:53, Evgenii Zhuravlev < >>>> [email protected]> wrote: >>>> >>>>> This doesn't seem to be a full log. There is a gap for more than 13 >>>>> hours in the log : >>>>> {"appTimestamp":"2020-06-23T23:06:41.658+00:00","threadName":"ignite-update-notifier-timer","level":"WARN","loggerName":"org.apache.ignite.internal.processors.cluster.GridUpdateNotifier","message":"New >>>>> version is available at ignite.apache.org: 2.8.1"} >>>>> {"appTimestamp":"2020-06-24T12:58:42.294+00:00","threadName":"disco-event-worker-#35%xxxxxx%","level":"INFO","loggerName":"org.apache.ignite.internal.managers.discovery.GridDiscoveryManager","message":"Node >>>>> left topology: TcpDiscoveryNode [id=02949ae0-4eea-4dc9-8aed-b3f50e8d7238, >>>>> addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx.xxx.xxx.73], >>>>> sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, >>>>> xxxxxx-task-0003/xxx.xxx.xxx.73:0], discPort=0, order=1258, intOrder=632, >>>>> lastExchangeTime=1592890182021, loc=false, >>>>> ver=2.7.0#20181130-sha1:256ae401, isClient=true]"} >>>>> >>>>> I don't see any exceptions in the log. When did the issue happen? Can >>>>> you share the full log? >>>>> >>>>> Evgenii >>>>> >>>>> чт, 25 июн. 2020 г. в 07:36, John Smith <[email protected]>: >>>>> >>>>>> Hi Evgenii, same folder shared stdout.copy >>>>>> >>>>>> Just in case: >>>>>> https://www.dropbox.com/sh/ejcddp2gcml8qz2/AAD_VfUecE0hSNZX7wGbfDh3a?dl=0 >>>>>> >>>>>> On Wed, 24 Jun 2020 at 21:23, Evgenii Zhuravlev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> No, it's not. It's not clear when it happened and what was with the >>>>>>> cluster and the client node itself at this moment. >>>>>>> >>>>>>> Evgenii >>>>>>> >>>>>>> ср, 24 июн. 2020 г. в 18:16, John Smith <[email protected]>: >>>>>>> >>>>>>>> Ok I'll try... The stack trace isn't enough? >>>>>>>> >>>>>>>> On Wed., Jun. 24, 2020, 4:30 p.m. Evgenii Zhuravlev, < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> John, right, didn't notice them before. Can you share the full log >>>>>>>>> for the client node with an issue? >>>>>>>>> >>>>>>>>> Evgenii >>>>>>>>> >>>>>>>>> ср, 24 июн. 2020 г. в 12:29, John Smith <[email protected]>: >>>>>>>>> >>>>>>>>>> I thought I did! The link doesn't have them? >>>>>>>>>> >>>>>>>>>> On Wed., Jun. 24, 2020, 2:43 p.m. Evgenii Zhuravlev, < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Can you share full log files from server nodes? >>>>>>>>>>> >>>>>>>>>>> ср, 24 июн. 2020 г. в 10:47, John Smith <[email protected] >>>>>>>>>>> >: >>>>>>>>>>> >>>>>>>>>>>> The logs for server are here: >>>>>>>>>>>> https://www.dropbox.com/sh/ejcddp2gcml8qz2/AAD_VfUecE0hSNZX7wGbfDh3a?dl=0 >>>>>>>>>>>> >>>>>>>>>>>> The error from the client: >>>>>>>>>>>> >>>>>>>>>>>> javax.cache.CacheException: class >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: >>>>>>>>>>>> Failed to execute cache operation (all partition owners have left >>>>>>>>>>>> the grid, >>>>>>>>>>>> partition data has been lost) [cacheName=cache1, part=580, >>>>>>>>>>>> key=UserKeyCacheObjectImpl [part=580, val=14385045508, >>>>>>>>>>>> hasValBytes=false]] >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1337) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.IgniteCacheFutureImpl.convertException(IgniteCacheFutureImpl.java:62) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.util.future.IgniteFutureImpl.get(IgniteFutureImpl.java:137) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.data.impl.IgniteCacheRepository.lambda$executeAsync$d94e711a$1(IgniteCacheRepository.java:55) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.util.future.AsyncFutureListener$1.run(AsyncFutureListener.java:53) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.data.impl.VertxIgniteExecutorAdapter.lambda$execute$0(VertxIgniteExecutorAdapter.java:18) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.core.impl.WorkerContext.lambda$wrapTask$0(WorkerContext.java:35) >>>>>>>>>>>> at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) >>>>>>>>>>>> at >>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>>>>>> at >>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>>>>> at >>>>>>>>>>>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) >>>>>>>>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>>>>>>>> Caused by: >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.CacheInvalidStateException: >>>>>>>>>>>> Failed to execute cache operation (all partition owners have left >>>>>>>>>>>> the grid, >>>>>>>>>>>> partition data has been lost) [cacheName=cache1, part=580, >>>>>>>>>>>> key=UserKeyCacheObjectImpl [part=580, val=14385045508, >>>>>>>>>>>> hasValBytes=false]] >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validatePartitionOperation(GridDhtTopologyFutureAdapter.java:169) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:116) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:208) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.getAsync0(GridDhtAtomicCache.java:1428) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$1600(GridDhtAtomicCache.java:135) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$16.apply(GridDhtAtomicCache.java:474) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$16.apply(GridDhtAtomicCache.java:472) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.asyncOp(GridDhtAtomicCache.java:761) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.getAsync(GridDhtAtomicCache.java:472) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.GridCacheAdapter.getAsync(GridCacheAdapter.java:4749) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.GridCacheAdapter.getAsync(GridCacheAdapter.java:1477) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.getAsync(IgniteCacheProxyImpl.java:937) >>>>>>>>>>>> at >>>>>>>>>>>> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.getAsync(GatewayProtectedCacheProxy.java:652) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.data.impl.IgniteCacheRepository.lambda$get$1(IgniteCacheRepository.java:28) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.data.impl.IgniteCacheRepository.executeAsync(IgniteCacheRepository.java:51) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.data.impl.IgniteCacheRepository.get(IgniteCacheRepository.java:28) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.impl.CarrierCodeServiceImpl.getCarrierIdOfPhone(CarrierCodeServiceImpl.java:65) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.impl.SmppGatewayServiceImpl.sendSms(SmppGatewayServiceImpl.java:39) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.impl.MtEventProcessor.process(MtEventProcessor.java:46) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.kafka.impl.KafkaProcessorImpl.lambda$null$4(KafkaProcessorImpl.java:83) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableCreate.subscribeActual(CompletableCreate.java:39) >>>>>>>>>>>> at io.reactivex.Completable.subscribe(Completable.java:2309) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableTimeout.subscribeActual(CompletableTimeout.java:53) >>>>>>>>>>>> at io.reactivex.Completable.subscribe(Completable.java:2309) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletablePeek.subscribeActual(CompletablePeek.java:51) >>>>>>>>>>>> at io.reactivex.Completable.subscribe(Completable.java:2309) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableResumeNext.subscribeActual(CompletableResumeNext.java:41) >>>>>>>>>>>> at io.reactivex.Completable.subscribe(Completable.java:2309) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableToFlowable.subscribeActual(CompletableToFlowable.java:32) >>>>>>>>>>>> at io.reactivex.Flowable.subscribe(Flowable.java:14918) >>>>>>>>>>>> at io.reactivex.Flowable.subscribe(Flowable.java:14865) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFlatMap$MergeSubscriber.onNext(FlowableFlatMap.java:163) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFromIterable$IteratorSubscription.slowPath(FlowableFromIterable.java:236) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFromIterable$BaseRangeSubscription.request(FlowableFromIterable.java:124) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFlatMap$MergeSubscriber.drainLoop(FlowableFlatMap.java:546) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFlatMap$MergeSubscriber.drain(FlowableFlatMap.java:366) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.flowable.FlowableFlatMap$InnerSubscriber.onComplete(FlowableFlatMap.java:678) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.observers.SubscriberCompletableObserver.onComplete(SubscriberCompletableObserver.java:33) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableResumeNext$ResumeNextObserver.onComplete(CompletableResumeNext.java:68) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:115) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableTimeout$TimeOutObserver.onComplete(CompletableTimeout.java:87) >>>>>>>>>>>> at >>>>>>>>>>>> io.reactivex.internal.operators.completable.CompletableCreate$Emitter.onComplete(CompletableCreate.java:64) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.common.vertx.ext.kafka.impl.KafkaProcessorImpl.lambda$null$3(KafkaProcessorImpl.java:86) >>>>>>>>>>>> at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:157) >>>>>>>>>>>> at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:118) >>>>>>>>>>>> at >>>>>>>>>>>> com.xxxxxx.impl.MtEventProcessor.lambda$process$0(MtEventProcessor.java:83) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.handleDispatchResponse(HttpContext.java:310) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:297) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:272) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:69) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:32) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:269) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:279) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.dispatchResponse(HttpContext.java:240) >>>>>>>>>>>> at >>>>>>>>>>>> io.vertx.ext.web.client.impl.HttpContext.lambda$null$2(HttpContext.java:370) >>>>>>>>>>>> ... 7 common frames omitted >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 24 Jun 2020 at 13:28, John Smith < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Not sure about the wrong configuration... All the apps work >>>>>>>>>>>>> this seems to happen every few weeks. We don't have any >>>>>>>>>>>>> particular heavy >>>>>>>>>>>>> load. >>>>>>>>>>>>> >>>>>>>>>>>>> I just bounced the client application and the errors went away. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 24 Jun 2020 at 12:57, Evgenii Zhuravlev < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It means that there are no nodes in the cluster that holds >>>>>>>>>>>>>> certain partitions. So, probably you have a wrong configuration >>>>>>>>>>>>>> or some of >>>>>>>>>>>>>> the nodes left the cluster and you don't have backups in the >>>>>>>>>>>>>> cluster for >>>>>>>>>>>>>> these partitions. I believe more can be found from logs. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Evgenii >>>>>>>>>>>>>> >>>>>>>>>>>>>> ср, 24 июн. 2020 г. в 09:52, John Smith < >>>>>>>>>>>>>> [email protected]>: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also I'm assuming that the thin client wouldn't be >>>>>>>>>>>>>>> susceptible to this error? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, 24 Jun 2020 at 12:38, John Smith < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The cluster is showing active when running control.sh >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But the client is showing "all partition owners have left >>>>>>>>>>>>>>>> the grid" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The client node is marked as client=true so it's not a >>>>>>>>>>>>>>>> server node. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is this split brain as well? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>
