Hi Jayesh,
Can you quantify some rough size numbers for us? Are you seeing
exceptions in the Accumulo tserver/master logs?
One thought is that when Accumulo creates new WAL files, it sets the
blocksize to be 1G (as a trick to force HDFS into making some
"non-standard" guarantees for us). As a result, it will appear that
there are a number of very large WAL files (but they're essentially empty).
If your instance is in some situation where Accumulo is repeatedly
failing to write to a WAL, it might think the WAL is bad, abandon it,
and try to create a new one. If this is happening each time, I could see
it explain the situation you described. However, you should see the
TabletServers complaining loudly that they cannot write to the WALs.
Jayesh Patel wrote:
We have a 3 node Accumulo 1.7 cluster running as VMWare VMs with minute
amount of data compared to Accumulo standards.
We have run into a situation multiple times now where all the nodes have
a power failure and when they are trying to recover from it
simultaneously, walog grows exponentially and fills up all the available
disk space. We have confirmed that the walog folder under /accumulo in
hdfs is consuming 99% of the disk space.
We have tried freeing enough space to be able to run Accumulo processes
in the hopes of it burning through walog without success. Walog just
grew to take up the freed space.
Given that we need to better manage the power situation, we’re trying to
understand what could be causing this and if there’s anything we can do
to avoid this situation.
We have some heartbeat data being written to a table at a very small
constant rate which is not sufficient to cause a such large write-ahead
log even if HDFS was pulled from under Accumulo’s feet, so to speak
during the power failure in case you’re wondering.
Thank you,
Jayesh