This is the troubleshooting steps I take (writing it down as it may eventually be more generally useful to people):

If there are continually unassigned tablets, it's _likely_ that tablets need to have log-recovery performed and, for some reason, that isn't happening.

1. Ensure that the Accumulo system tables (!METADATA in 1.5 and accumulo.root and accumulo.metadata in >=1.6) are fully available. You should be able to `scan -np -t <table>` these tables without issue -- you should be able to read the entire table and the scan command should not hang. If you cannot, your problem is worst-case and you may want to consider (see Metadata File Corruption under [1]).

2. If the system tables are OK, you can move to the assumption that it's a user table that these tablets are for. `accumulo admin checkTablets` may be of use. You have two options at this point

2a. Accept data loss. See instructions at [1] on removing log entries for tablets.

2b. Recover the corrupt data from HDFS (not covered here..)

I've seen situations where tablets that fail recovery don't send their logs to the Monitor. The master will likely have record of the reason the recovery failed, the tabletserver will definitely have record. Check the ends of the log files for both processes and you'll likely find an Exception as to why recovery keeps failing.

[1] http://accumulo.apache.org/1.6/accumulo_user_manual.html#_hdfs_failure

Bill Slacum wrote:
After a catasrophic failure, the Master Server section of the monitor =
will report that there are 16 unassigned tablets (out of thousands), but =
no table shows any offline tablets.=20

There were corrup files under the recovery directory. These were =
removed.

Otherwise, things seem fine with the cluster (we are having ingest =
processes hang, which may or may not be related).

What should I do, as an operator, when Accumulo is in this state?

I have no logs provide, unfortunately.  

Reply via email to