Thanks for the write up! That's great that you were able to get things back up and running. I was following your conversation in the Slack channel. Hopefully, this will help others if they run into something similar.
Also, just wanted to mention, since you said you were running 1.7.0, that 1.7.0 is subject to CVE-2020-17533, as well as lots of other bugs. At the very least, you should be able to upgrade to the latest 1.7 (1.7.4) as a drop-in, which will fix at least a few critical bugs, such as at least one involving potential data loss. Ideally, though, you should try to upgrade to 1.10.2, which is the latest (and only) still-maintained 1.x version. On Sat, Feb 26, 2022 at 11:25 AM James Srinivasan <james.sriniva...@gmail.com> wrote: > > For the benefit of Google and/or future me, and with huge thanks to Ed > Coleman, here’s a quick summary of an issue we hit with Accumulo 1.7.0 and > the fix. Details are in Slack but with a few red herrings (thanks to me). > Some of this is fat-fingered so apologies for any typos: > > > > We recently needed to bounce our moderate (19 node) cluster (log4j on other > stuff), but Accumulo failed to restart. Four of the nodes had been down for > some time (root cause unknown). > > > > Symptoms > > > > 1) Accumulo monitor showed list of tables but "-" against every entry > > 2) Accumulo files looked ok in HDFS > > 3) scan -t accumulo.root (debug on) in accumulo shell gave “Failed to locate > tablet for table : +r row :” > > 4) There were some Zookeeper warnings in some logs (I forget precisely which) > but they weren't hugely informative - ConnectionLoss for > /accumulo/{uuid}/root_tablet/walogs. This turns out to be critical, but I > didn't realise it at the time. > > 5) Zookeeper nodes showed that a tserver should host the root tablet > (/accumulo/{id}/root_tablet/location), but that tserver did not have a lock > for the root tablet > (/accumulo/{id}/tservers/mytservername.domain:9997/zlock-00000000) > > 6) Using the zookeeper cli, ls /accumulo/{id}/root_tablet/walogs bombed out > with familiar looking ConnectionLoss, although with some more helpful info > "Packet len is out of range" > > > Cause > > > Zookeeper clients (CLI or Accumulo tserver) are failing to list znode with > large numbers of children due to insufficient buffer space. See the docs on > jute.maxbuffer here - > https://zookeeper.apache.org/doc/r3.7.0/zookeeperAdmin.html#Unsafe+Options > > Quite why there were so many children of the walogs node is unknown, but may > have been due to the four inactive tservers > > > Fix > > > Set "-Djute.maxbuffer=big_value" for all Accumulo processes seemed to fix > things. For me, big_value was around 8000000 (i.e. 8MB). Accumulo came back > slowly, found all its data files and then the number of children of the zk > walogs node dropped substantially. > >