This is the troubleshooting steps I take (writing it down as it may
eventually be more generally useful to people):
If there are continually unassigned tablets, it's _likely_ that tablets
need to have log-recovery performed and, for some reason, that isn't
happening.
1. Ensure that the Accumulo system tables (!METADATA in 1.5 and
accumulo.root and accumulo.metadata in >=1.6) are fully available. You
should be able to `scan -np -t <table>` these tables without issue --
you should be able to read the entire table and the scan command should
not hang. If you cannot, your problem is worst-case and you may want to
consider (see Metadata File Corruption under [1]).
2. If the system tables are OK, you can move to the assumption that it's
a user table that these tablets are for. `accumulo admin checkTablets`
may be of use. You have two options at this point
2a. Accept data loss. See instructions at [1] on removing log entries
for tablets.
2b. Recover the corrupt data from HDFS (not covered here..)
I've seen situations where tablets that fail recovery don't send their
logs to the Monitor. The master will likely have record of the reason
the recovery failed, the tabletserver will definitely have record. Check
the ends of the log files for both processes and you'll likely find an
Exception as to why recovery keeps failing.
[1] http://accumulo.apache.org/1.6/accumulo_user_manual.html#_hdfs_failure
Bill Slacum wrote:
After a catasrophic failure, the Master Server section of the monitor =
will report that there are 16 unassigned tablets (out of thousands), but =
no table shows any offline tablets.=20
There were corrup files under the recovery directory. These were =
removed.
Otherwise, things seem fine with the cluster (we are having ingest =
processes hang, which may or may not be related).
What should I do, as an operator, when Accumulo is in this state?
I have no logs provide, unfortunately.