Hi Jorn, Thanks for reaching out to us, this is a very important exercise to make sure the upgrade path works as expected.
- Please do an `ls -al` in your data dir to make sure you have valid snapshot files. - It would be also useful to expose the Admin port (8080/tcp by default) and check the output of `lastSnapshotCommand`. Regards, Andor > On 2019. Aug 14., at 7:13, Jörn Franke <jornfra...@gmail.com> wrote: > > For me the issue occurred only in standalone mode. With the ensemble I simply > cleared the data directory and it received the zookeeper data from the > quorum. > >> Am 13.08.2019 um 15:42 schrieb Koen De Groote <koen.degro...@limecraft.com>: >> >> I would also like to know if this is possible. >> >> From going over the github page, it seems there is a JMX method to force >> the creation of a snapshot. Yet the docker image is configured as such that >> a port will never be assigned to the JMX process. >> >> Is there any way to bypass this? >> >>> On Tue, Jul 30, 2019 at 8:51 AM Jörn Franke <jornfra...@gmail.com> wrote: >>> >>> Thanks. It is possible to force Zookeeper to create a snapshot? I will >>> check I think the snapshot count is set to 1 in the cfg >>> >>>> Am 30.07.2019 um 08:06 schrieb Enrico Olivelli <eolive...@gmail.com>: >>>> >>>> Il giorno lun 29 lug 2019 alle ore 23:59 Jörn Franke < >>> jornfra...@gmail.com> >>>> ha scritto: >>>> >>>>> ok, then let me verify tomorrow if a snapshot file is indeed there. If >>> it >>>>> is missing then I wonder why it was missing. There was no crash or >>> whatever >>>>> and 3.4.14 works without issue, but of course it could have loaded them >>>>> from the log files. However, then I wonder why it does not create one. >>>>> >>>> >>>> >>>> >>>> I remember now that some other user, I think Sijie, reported a similar >>>> problem some month ago, that it is not possible to upgrade from 3.4 to >>> 3.5 >>>> if no snapshot is present. >>>> IIRC The fix was to force the creation of at least one snapshot file and >>>> then upgrade >>>> >>>> Enrico >>>> >>>> >>>>> >>>>> On Mon, Jul 29, 2019 at 11:45 PM Michael Han <h...@apache.org> wrote: >>>>> >>>>>>>> I just wonder why it does not find a valid snapshot. >>>>>> >>>>>> If there are local snapshot files and the files are valid, then it's a >>>>> bug >>>>>> that server fails to load them. >>>>>> >>>>>>>> Is it because the format changed in 3.5.5 compared to 3.4.14? >>>>>> >>>>>> Not I am aware of. There are some format changes (added compression >>>>>> support) in master branch, but that's not shipped with 3.5.5. >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jul 29, 2019 at 2:31 PM Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>>> >>>>>>> ok, then it affects basically all standalone nodes? This is fine, >>>>> despite >>>>>>> that it means some extra work (for uncritical lab environments). >>>>>>> I am not sure it is ZOOKEEPER-2325, but I don't know the full history >>>>>>> behind it).The logs are fine (it works in 3.4.14 without issues, even >>>>>> after >>>>>>> downgrading back). There is no issue with disk space and there are no >>> 0 >>>>>>> byte files. I just wonder why it does not find a valid snapshot. Is >>> it >>>>>>> because the format changed in 3.5.5 compared to 3.4.14? >>>>>>> >>>>>>> On Mon, Jul 29, 2019 at 11:25 PM Michael Han <h...@apache.org> wrote: >>>>>>> >>>>>>>>>> java.io.IOException: No snapshot found, but there are log entries. >>>>>>>> Something is broken! >>>>>>>> >>>>>>>> This is expected behavior introduced in ZOOKEEPER-2325. We don't want >>>>>> to >>>>>>>> end up with potential inconsistent state across the ensemble when >>>>>>>> recovering from empty snapshot. >>>>>>>> >>>>>>>> To continue upgrade, just delete all txn log files and let the node >>>>>> sync >>>>>>>> the snapshot from the quorum. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jul 29, 2019 at 1:38 PM Enrico Olivelli <eolive...@gmail.com >>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Il lun 29 lug 2019, 22:32 Jörn Franke <jornfra...@gmail.com> ha >>>>>>> scritto: >>>>>>>>> >>>>>>>>>> It also seems that 3.5.5 does not attempt to read all of the >>>>>> logfiles >>>>>>>> (I >>>>>>>>>> have to still confirm), but the two it reads exist, it has access >>>>>> and >>>>>>>>> they >>>>>>>>>> are much more than 0 byte >>>>>>>>>> >>>>>>>>> >>>>>>>>> We should have the stackstace of the EOFException. >>>>>>>>> >>>>>>>>> Anyone on this list has a better idea? >>>>>>>>> >>>>>>>>> Enrico >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mon, Jul 29, 2019 at 10:13 PM Jörn Franke < >>>>> jornfra...@gmail.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> (of course i do not run them at the same time) >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 29, 2019 at 10:10 PM Jörn Franke < >>>>>> jornfra...@gmail.com >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> thank you for the quick reply. They read from the same disk >>>>>> paths >>>>>>>> and >>>>>>>>>>>> have the same access rights (in fact the RHEL service executes >>>>>>> them >>>>>>>> as >>>>>>>>>> the >>>>>>>>>>>> same specific user). >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 29, 2019 at 10:09 PM Enrico Olivelli < >>>>>>>> eolive...@gmail.com >>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Il lun 29 lug 2019, 21:50 Jörn Franke <jornfra...@gmail.com> >>>>>> ha >>>>>>>>>> scritto: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I tried to migrate a lab environment from Zookeepr 3.4.14 >>>>>> (used >>>>>>>> for >>>>>>>>>>>>> Solr) >>>>>>>>>>>>>> to 3.5.5 and encountered an issue. It is ZooKeeper in >>>>>>> standalone >>>>>>>>> mode >>>>>>>>>>>>>> (other environments have a proper ensemble). I increased >>>>>>>>>> jute.maxbuffer >>>>>>>>>>>>>> beyond the default (but not excessively) - this was working >>>>>>>>> perfectly >>>>>>>>>>>>> fine >>>>>>>>>>>>>> in 3.4.14. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Basically I reuse for the migration the same config files, >>>>>>> except >>>>>>>>>> that >>>>>>>>>>>>> I >>>>>>>>>>>>>> whitelist some commands (later I am also interested in >>>>> adding >>>>>>>> SSL). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have the following error message when starting Zookeeper >>>>>> with >>>>>>>>> 3.5.5 >>>>>>>>>>>>>> (basically, I just changed the symboling link from >>>>> zookeeper >>>>>> to >>>>>>>>> point >>>>>>>>>>>>> to >>>>>>>>>>>>>> 3.5.5 instead of the 3.4.14 directory: >>>>>>>>>>>>>> 2019-07-29 15:16:25,217 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@655] >>>>>>>>>>>>>> - Created new input stream /zookeeper/version-2/log.b34 >>>>>>>>>>>>>> 2019-07-29 15:16:25,217 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@658] >>>>>>>>>>>>>> - Created new input archive /zookeeper/version-2/log.b34 >>>>>>>>>>>>>> 2019-07-29 15:16:25,222 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@696] >>>>>>>>>>>>>> - EOF exception java.io.EOFException: Failed to read >>>>>>>>>>>>>> /zookeeper/version-2/log.b34 >>>>>>>>>>>>>> 2019-07-29 15:16:25,223 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@655] >>>>>>>>>>>>>> - Created new input stream /zookeeper/version-2/log.b72 >>>>>>>>>>>>>> 2019-07-29 15:16:25,223 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@658] >>>>>>>>>>>>>> - Created new input archive /zookeeper/version-2/log.b72 >>>>>>>>>>>>>> 2019-07-29 15:16:25,224 [myid:] - DEBUG >>>>>>>>>>>>>> [main:FileTxnLog$FileTxnIterator@696] >>>>>>>>>>>>>> - EOF exception java.io.EOFException: Failed to read >>>>>>>>>>>>>> /zookeeper/version-2/log.b72 >>>>>>>>>>>>>> 2019-07-29 15:16:25,224 [myid:] - ERROR >>>>>>>>> [main:ZooKeeperServerMain@83 >>>>>>>>>> ] >>>>>>>>>>>>> - >>>>>>>>>>>>>> Unexpected exception, exiting abnormally >>>>>>>>>>>>>> java.io.IOException: No snapshot found, but there are log >>>>>>>> entries. >>>>>>>>>>>>>> Something is broken! >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:211) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:290) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:450) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:764) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ServerCnxnFactory.startup(ServerCnxnFactory.java:98) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:144) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128) >>>>>>>>>>>>>> at >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Strangely enough, if I switch back to 3.4.14 the issue is >>>>>>>> resolved >>>>>>>>>> and >>>>>>>>>>>>>> Zookeeper works normally. However, I would like to leverage >>>>>> the >>>>>>>> new >>>>>>>>>>>>> version >>>>>>>>>>>>>> 3.5.5. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are no 0 bytes files. Disk space is plenty available. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Can you compare these logs with logs of 3.4.x ? Are they >>>>>> reading >>>>>>>>> from >>>>>>>>>>>>> the >>>>>>>>>>>>> same disk paths? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Any idea beyond erasing the data dir (I would try to avoid >>>>>> it, >>>>>>> I >>>>>>>>> can >>>>>>>>>>>>>> reconstruct it, but still)? I will try also in the other >>>>>>>>>> environments >>>>>>>>>>>>> and >>>>>>>>>>>>>> also with an environment with an ensemble, but i would like >>>>>> to >>>>>>>> know >>>>>>>>>>>>> before >>>>>>>>>>>>>> what the issue could be. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Not sure if it is relevant, but: >>>>>>>>>>>>>> Activated Kerberos Authentication and Kerberos SSL for >>>>>> clients >>>>>>>> and >>>>>>>>>>>>> quorum. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Quorum? In standalone mode there is no 'quorum' auth >>>>>>>>>>>>> >>>>>>>>>>>>> Enrico >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>