Hi, As this discussion was originally started in https://issues.apache.org/jira/browse/ZOOKEEPER-2172?focusedCommentId=16389390&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16389390
How can we proceed here any further? Best regards Stefan -----Original Message----- From: Yuval Dori [mailto:[email protected]] Sent: Montag, 26. März 2018 13:14 To: [email protected] Subject: Re: IllegalStateException: instance must be started before calling this method Hi Martin, Not in a Docker. We are using the word container for one of our services. Thanks, Yuval On Sun, Mar 25, 2018 at 11:57 PM, Martin Gainty <[email protected]> wrote: > > > in docker container? > > if so how do you resolve hosts (.hosts file, kubernetes, swarm)? > > M > ________________________________ > From: Yuval Dori <[email protected]> > Sent: Sunday, March 25, 2018 5:34 AM > To: [email protected] > Cc: Yuval Dori > Subject: IllegalStateException: instance must be started before calling > this method > > Hi, > > Our customer is using ZK 3.4.3 and got IllegalStateException: instance > must be started before calling this method > > > with the following stack trace: > 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; > [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<ht > tp://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing > shutdown while sock still open > java.net<http://java.net>.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at org.apache.zookeeper.server.quorum.QuorumPacket. > deserialize(QuorumPacket.java:83) > at org.apache.jute.BinaryInputArchive.readRecord( > BinaryInputArchive.java:99) > at org.apache.zookeeper.server.quorum.LearnerHandler.run( > LearnerHandler.java:542) > > > This is the description: > > At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3. > > 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/1 Space instance is Primary > > After some relocation of ps-main instances a new backup instance is > started on the same GSC. > > 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146 > SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [ > frqMain_container1:frqMain ] has been selected as Backup > > Later a failover test is performed by shutting down WSOSTSLXBMS02 and > WSOSTSLXMSS02. > We would expect that ps-main.1 backup will be elected as primary because > the former primary was located at WSOSTSLXBMS02 but no election happend. In > the end in the whole cluster there was no primary of ps-main.1 running. > > We found suspicious exception in the logs: > 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership > [sessionId=0x36128253b820001] > 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98 > SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop > exiting > 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED > 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session: > 0x36128253b820001 closed > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an > exception > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException: > instance must be started before calling this method > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.shaded.com<http://org.apache.curator.shaded.com > >.google.common.base.Preconditions.checkState(Preconditions.java:176) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.imps.CuratorFrameworkImpl. > delete(CuratorFrameworkImpl.java:359) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.locks.LockInternals. > deleteOurPath(LockInternals.java:339) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.locks.LockInternals. > releaseLock(LockInternals.java:123) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.locks.InterProcessMutex.release( > InterProcessMutex.java:154) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.leader.LeaderSelector. > doWork(LeaderSelector.java:449) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.leader.LeaderSelector. > doWorkLoop(LeaderSelector.java:466) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.leader.LeaderSelector. > access$100(LeaderSelector.java:65) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.leader.LeaderSelector$ > 2.call(LeaderSelector.java:246) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > org.apache.curator.framework.recipes.leader.LeaderSelector$ > 2.call(LeaderSelector.java:240) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent. > FutureTask.run(FutureTask.java:266) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent. > Executors$RunnableAdapter.call(Executors.java:511) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent. > FutureTask.run(FutureTask.java:266) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent. > ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent. > ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector at > java.lang.Thread.run(Thread.java:748) > 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120 > SPACE/Curator-LeaderSelector-0/LeaderSelector > 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97 > SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn > EventThread shut down for session: 0x36128253b820001 > 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed > [target=frqMain1_1, target url=jini://*/frqMain_ > container1_1/frqMain?groups=POLBAY&ignoreValidation=true& > schema=persistent&id=1&total_members=4,1&versioned=true& > mirror=true&cluster_schema=partitioned&locators= > WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine > connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_ > 2538032931939568226_details[class com.gigaspaces.internal. > cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ > ConnectionEndpoint(frqMain_container1_1:frqMain)]] > 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel > is closed [target=ps-main-mirror-service, target > url=jini://*/ps-main-mirror-service_container/ps-main- > mirror-service?schema=persistent&id=1&total_members= > 4,1&versioned=true&mirror=true&groups=POLBAY&cluster_ > schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, > target machine connection url=NIO://WSOSTSLXBMS01:11002/ > pid[3220]/25906827215_3_7196393537085178059_details[class > com.gigaspaces.internal.cluster.node.impl.router. > AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main- > mirror-service_container:ps-main-mirror-service)]] > 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete > 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/container Container > <frqMain_container1> shutdown completed > > 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to > delete deployed processing unit from [C:\Program > Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\ > processing-units\ps-main_1_2008048111] > 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24 > SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped > > Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 > +01:00 > WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The > ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142: > 11001/>] is not reachable or is reachable but with no matching invocation > in progress at the server peer, closing invalid connection with local > address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote > invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net< > http://java.net>.SocketTimeoutException] > > > Could it be ZK bug? > > Thanks, > > Yuval > > >
