2018-05-14 13:29:22 UTC - Byron: Good morning folks. This is my first time running Bookkeeper.. and my 3-node test cluster ran out of space on two of the nodes (i believe the ledgers directory). So the two nodes are failing to startup as a result which makes the cluster basically inaccessible. I am curious how one can recover from this situation if I am unable to increase the volume size for those two nodes? ---- 2018-05-14 14:16:30 UTC - Ivan Kelly: they aren't starting in readonly mode? ---- 2018-05-14 14:17:23 UTC - Byron: Does not appear to be.. ---- 2018-05-14 14:17:29 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAQCADZQX/-.txt|Untitled> ---- 2018-05-14 14:18:41 UTC - Ivan Kelly: it's the journal directory that's full ---- 2018-05-14 14:18:53 UTC - Byron: correct ---- 2018-05-14 14:18:56 UTC - Ivan Kelly: could you upload the whole log somewhere? ---- 2018-05-14 14:19:09 UTC - Byron: error log? ---- 2018-05-14 14:19:15 UTC - Ivan Kelly: bookie.log ---- 2018-05-14 14:20:06 UTC - Ivan Kelly: this looks similar to something else we saw recently, and i recall the root cause was logged earlier in log ---- 2018-05-14 14:20:55 UTC - Byron: hm. ok i am running in kubernetes.. with a readwriteonly volume. i will see if i can remount the volume to get the log ---- 2018-05-14 14:22:13 UTC - Ivan Kelly: how did you get that snippet? I'm not overly familiar with k8s ---- 2018-05-14 14:22:26 UTC - Byron: that is the stderr log ---- 2018-05-14 14:22:51 UTC - Byron: its possible bookie.log is redirected there? ---- 2018-05-14 14:23:36 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAPB1AHLJ/-.sh|Untitled> ---- 2018-05-14 14:23:50 UTC - Byron: that is the full start to end output ---- 2018-05-14 14:24:55 UTC - Byron: the bookie is being started from the `apachepulsar/pulsar` docker image in case that is relevant ---- 2018-05-14 14:27:46 UTC - Byron: i see `readOnlyModeEnabled=true` in the default bookkeeper.conf file. maybe something weird is happening with the env variables overriding the config ---- 2018-05-14 14:29:41 UTC - Ivan Kelly: looking ---- 2018-05-14 14:33:02 UTC - Ivan Kelly: what version of pulsar is this? ---- 2018-05-14 14:34:29 UTC - Ivan Kelly: <https://github.com/apache/bookkeeper/issues/1349> <- yup, there's an outstanding issue for this in bookkeeper. i guess you don't have access to the disk in question? ---- 2018-05-14 16:02:56 UTC - Byron: 1.22 ---- 2018-05-14 16:11:12 UTC - Byron: i see bookkeeper on this image is 4.3.1.91 ---- 2018-05-14 16:14:06 UTC - Matteo Merli: @Byron the read-only mode only applies to the “storage” device. When that disk if full (actually, when it reaches 95%) the bookie turn itself into read-only mode.
For Journal device unfortunately there’s currently no such check. The main reason is that typically the storage amount on journal device is fixed (~10GB) and doesn’t grow above that. ---- 2018-05-14 16:14:49 UTC - Matteo Merli: In your case, do you have both directories on the same disk ? ---- 2018-05-14 16:15:43 UTC - Byron: I have a separate ledgers and journal volume ---- 2018-05-14 16:16:07 UTC - Matteo Merli: Good, and how big the journal volume? ---- 2018-05-14 16:16:13 UTC - Matteo Merli: One thing to note is that by default bookkeeper keeps the last 5 journals, even though all the data was already flushed and indexed ---- 2018-05-14 16:16:52 UTC - Byron: only 5 Gi for journal and 10 Gi for ledgers per node (3) ---- 2018-05-14 16:16:58 UTC - Matteo Merli: that can be configured `journalMaxBackups=5` ---- 2018-05-14 16:17:20 UTC - Matteo Merli: ok, if you set `journalMaxBackups=0` that 5Gb should not get filled up ---- 2018-05-14 16:17:23 UTC - Byron: again this is a test instance.. but i am more interested in figuring how to deal with these issues now before going to production ---- 2018-05-14 16:17:39 UTC - Byron: ok ---- 2018-05-14 16:17:44 UTC - Matteo Merli: to get out of the woods: you can delete few of the old journal files ---- 2018-05-14 16:17:55 UTC - Byron: and then add more bookies presumably? ---- 2018-05-14 16:18:05 UTC - Byron: to distribute the data? ---- 2018-05-14 16:18:34 UTC - Matteo Merli: you don’t necessarely need more bookies ---- 2018-05-14 16:19:03 UTC - Byron: i am just saying if i wanted to support more storage in the future ---- 2018-05-14 16:19:04 UTC - Matteo Merli: if you change the setting to `journalMaxBackups=0` and restart, the bookies should be fine ---- 2018-05-14 16:19:15 UTC - Byron: not to fix the current problem ---- 2018-05-14 16:19:27 UTC - Matteo Merli: oh, then sure ---- 2018-05-14 16:19:44 UTC - Byron: alright going to set that config and restart the pods ---- 2018-05-14 16:22:56 UTC - Byron: hm still failing to start up due to the out of space error ---- 2018-05-14 16:23:16 UTC - Byron: i wonder if it is trying to do writes before checking that option and purging old data ---- 2018-05-14 16:23:17 UTC - Matteo Merli: yes, you need to delete some of the old journals ---- 2018-05-14 16:23:55 UTC - Matteo Merli: with the previous config, it was trying to keep up to 5 journal files (each is 2GB) ---- 2018-05-14 16:24:13 UTC - Byron: ok, so changing the config will not autopurge existing ones ---- 2018-05-14 16:24:14 UTC - Matteo Merli: that data is already flushed, so there’s no risk ---- 2018-05-14 16:24:37 UTC - Matteo Merli: > ok, so changing the file will not autopurge existing ones I think that only works once it’s up :slightly_smiling_face: ---- 2018-05-14 16:24:41 UTC - Byron: right ---- 2018-05-14 16:25:39 UTC - Byron: hm. i guess this is challenge with persistent volumes.. how to access them outside of main pod. i guess i can attach them to a different temp pod, delete the backups then spin up the existing pods ---- 2018-05-14 16:26:55 UTC - Matteo Merli: ouch, good point. you can try to change the spec to add a sleep 300 before the actual command ---- 2018-05-14 16:27:33 UTC - Byron: is there a bookie shell command to run? ---- 2018-05-14 16:27:45 UTC - Byron: i can change the pod command to use that instead of starting the bookie server ---- 2018-05-14 16:27:52 UTC - Byron: as a one-off ---- 2018-05-14 16:28:05 UTC - Byron: rather.. container command ---- 2018-05-14 16:30:28 UTC - Byron: or can i just delete the files in the `journal/` directory ---- 2018-05-14 16:31:03 UTC - Matteo Merli: delete the files, you can just delete the oldest `1323213.txn` file ---- 2018-05-14 16:31:23 UTC - Matteo Merli: since that’s just a backup file ---- 2018-05-14 16:42:13 UTC - Byron: hm. now i am getting an exception that the journal file is missing and it can’t recover ---- 2018-05-14 16:42:21 UTC - Byron: @Byron uploaded a file: <https://apache-pulsar.slack.com/files/UACD54WB1/FAQB79DLN/-.txt|Untitled> ---- 2018-05-14 16:42:30 UTC - Byron: but the one-off command worked at least ---- 2018-05-14 16:42:37 UTC - Matteo Merli: Uhm, how many files were there? ---- 2018-05-14 16:43:23 UTC - Byron: there were 4 or 5 `.txn` files ---- 2018-05-14 16:43:40 UTC - Byron: i deleted all but the most recent. probably a bad idea. i assumed they were independent ---- 2018-05-14 16:44:06 UTC - Matteo Merli: probably the last 2 were the ones still used ---- 2018-05-14 16:45:13 UTC - Matteo Merli: it’s based on a marker file `lastMark` in the storage directory ---- 2018-05-14 16:46:10 UTC - Byron: ah and there is a `shell lastmark` command ---- 2018-05-14 16:46:30 UTC - Matteo Merli: yes, I forgot about that one ---- 2018-05-14 16:46:49 UTC - Byron: ok good to know for the future. but the data is clearly borked now. is there a way to reset a bookie? ---- 2018-05-14 16:47:14 UTC - Byron: bookieformat ---- 2018-05-14 16:47:15 UTC - Byron: ? ---- 2018-05-14 16:47:21 UTC - Matteo Merli: `bin/bookkeeper shell bookieformat -deleteCookie`` ---- 2018-05-14 16:47:25 UTC - Byron: cool ---- 2018-05-14 16:47:26 UTC - Byron: thanks ---- 2018-05-14 16:49:26 UTC - Matteo Merli: No problem. In any case I think we should refuse to start the bookie at the very beginning, if the disk size is < (2GB * 5) ---- 2018-05-14 16:58:57 UTC - Byron: back in business ---- 2018-05-14 20:41:14 UTC - Guillaume LECROC: @Guillaume LECROC has joined the channel ---- 2018-05-15 07:13:42 UTC - Sachin: @Sachin has joined the channel ----
