Disclaimer: I don't have much experience with Accumulo or Hadoop, I'm standing 
in because our resident expert is away on honeymoon!  We've done a great deal 
of reading and do not know if our situation is recoverable, so any and all 
advice would be very welcome.

Background:
We are running:
(a) Accumulo version: 1.7.0
(b) Hadoop version: 2.7.1
(c) Geomesa version: 1.2.1
We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log 
excerpts below).  Nodes are both data nodes and tablet servers, masters are 
also name nodes.  Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and 500GB 
or 1TB SSD each.
This is a production deployment where we are analysing 16TB (and growing) 
geospatial data, with the outcomes being used daily.  We have customers relying 
on our results.

Initial Issue:
The non-DFS storage used in our HDFS system was falsely reporting that it was 
using all of the free space we had available, resulting in HDFS rejecting 
writes from a variety of places across our cluster.  After research it appeared 
that this may be as a result of a bug, and that restarting HDFS services would 
resolve it.  After restarting the HDFS services the non-DFS storage used 
immediately returned to expected levels, but accumulo didn't seem to be 
responding to queries so we ran stop-all.sh and start-all.sh.  When running 
stop-all.sh it timed out trying to contact the master, and did a forced 
shutdown.

After restarting, Accumulo listed all the tables as being online (except for 
accumulo.replication which is offline) but none of the tables have their 
tablets associated except for:
(a) accumulo.metadata
(b) accumulo.root
All Geomesa tables are showing as online though the tablets, table sizes and 
record counts are not showing in the web UI.

In the logs (which are very large) there are a range of issues showing, the 
following seeming important from our Googling.

Log excerpts:
2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets as 
unassigned because they don't have current servers
2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata Tablets]: 1 
tablets are ASSIGNED_TO_DEAD_SERVER
2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets
2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata Tablets]: 1 
tablets are UNASSIGNED
2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was 
loaded on node03:9997

An Accumulo meta data node is offline.  In the accumulo master log file we see 
that there are 1101 WALs associated with a node (node08) that are linked to 
tablet !0<~.  Below are 2 instances of the message we get in the logs, which 
repeat over and over, and there are 1101 of them per repeat.  We're not sure 
why there are 1101 WALs for the one node, but we assume that this is the main 
cause of our problem.

2017-08-30 15:20:29,094 [conf.AccumuloConfiguration] INFO : Loaded class : 
org.apache.accumulo.server.master.recovery.HadoopLogCloser
2017-08-30 15:20:29,094 [recovery.RecoveryManager] INFO : Starting recovery of 
hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/fed84709-3d3b-45b0-8b77-020a71762b09
 (in : 300s), tablet !0;~< holds a reference
2017-08-30 15:20:29,142 [conf.AccumuloConfiguration] INFO : Loaded class : 
org.apache.accumulo.server.master.recovery.HadoopLogCloser
2017-08-30 15:20:29,142 [recovery.RecoveryManager] INFO : Starting recovery of 
hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ffc115dd-f094-443f-a98f-8e670fb2a924
 (in : 300s), tablet !0;~< holds a reference
2017-08-30 15:20:45,457 [replication.WorkMaker] INFO : Replication table is not 
yet online

Any query of the meta data table hangs, such as those recommended here: 
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_advanced_system_recovery
We are assuming that the above inability to recover the WALs is preventing use 
of the metadata table, even though it reports as being online.

Running:
(a)
 ./hdfs dfs -du -s -h 
hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns:
1.1 G  hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997

(b)
./hdfs dfs -count -h 
hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns:
            1      785.1 K              1.1 G 
hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997

(c)
./hdfs dfs -ls hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ 
returns:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.String.substring(String.java:1969)
        at java.net.URI$Parser.substring(URI.java:2869)
        at java.net.URI$Parser.parse(URI.java:3065)
        at java.net.URI.<init>(URI.java:746)
        at org.apache.hadoop.fs.Path.<init>(Path.java:108)
        at org.apache.hadoop.fs.Path.<init>(Path.java:93)
        at 
org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:230)
        at 
org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:263)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:830)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
        at 
org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
        at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
        at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:90)
        at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
        at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
        at 
org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
        at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
(d)
We have validated that the file permissions on the Accumulo tables are correct.

We don't understand why each of the 31 nodes that have WALs under 
hdfs://master01:9000/user/accumulo/accumulo/wal/ each have only a single WAL 
file within, yet for node08 there are 785,100 files.  Also, for a random sample 
of the 1101 WAL files mentioned in the logs referred to above, none of them 
seem to be in the folder (hdfs dfs -l reports file not found for all of the 
files we tried).

Judging from the notes under "Advanced System Recovery" in the manual, we're 
stuck because it suggests editing the metadata table to drop the WALs and, with 
data loss, get the system back online.  As the problems appear to be with the 
metadata table, and HDFS is reporting healthy with no corruption, we don't see 
how to proceed.

We have many large log files, which I'm happy to email separately if it helps.

Any suggestions as to what we might do to get back online?

Thank you very much,

Nick


This email (and any attachments) may contain confidential information and is 
intended solely for the recipient(s) to whom the email is addressed. If you 
received this email in error, please inform us immediately and delete the email 
and all attachments without further using, copying or disclosing the 
information. This email and any attachments are believed to be, but cannot be 
guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited 
is registered in England & Wales. Company Number: 7964746. Registered office: 
Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.

Reply via email to