Some immediate thoughts:

1. Regarding node08 having so many files, maybe it was the last DN that had 
free space?
2. Look in the trash folder for the missing referenced WAL files
3. For you OOME using the HDFS CLI, I think you can increase the amount of 
memory that the client will use with: export HADOOP_CLIENT_OPTS="-Xmx1G" (or 
something like that).

Still digesting the rest....


> On August 30, 2017 at 2:45 PM Nick Wise <nicholas.w...@sa.catapult.org.uk> 
> wrote:
> 
> 
>      
> 
>     Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m 
> standing in because our resident expert is away on honeymoon!  We’ve done a 
> great deal of reading and do not know if our situation is recoverable, so any 
> and all advice would be very welcome.
> 
>      
> 
>     Background:
> 
>     We are running:
> 
>     (a) Accumulo version: 1.7.0
> 
>     (b) Hadoop version: 2.7.1
> 
>     (c) Geomesa version: 1.2.1
> 
>     We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log 
> excerpts below).  Nodes are both data nodes and tablet servers, masters are 
> also name nodes.  Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and 
> 500GB or 1TB SSD each.
> 
>     This is a production deployment where we are analysing 16TB (and growing) 
> geospatial data, with the outcomes being used daily.  We have customers 
> relying on our results.
> 
>      
> 
>     Initial Issue:
> 
>     The non-DFS storage used in our HDFS system was falsely reporting that it 
> was using all of the free space we had available, resulting in HDFS rejecting 
> writes from a variety of places across our cluster.  After research it 
> appeared that this may be as a result of a bug, and that restarting HDFS 
> services would resolve it.  After restarting the HDFS services the non-DFS 
> storage used immediately returned to expected levels, but accumulo didn’t 
> seem to be responding to queries so we ran stop-all.sh and start-all.sh.  
> When running stop-all.sh it timed out trying to contact the master, and did a 
> forced shutdown.
> 
>      
> 
>     After restarting, Accumulo listed all the tables as being online (except 
> for accumulo.replication which is offline) but none of the tables have their 
> tablets associated except for:
> 
>     (a) accumulo.metadata
> 
>     (b) accumulo.root
> 
>     All Geomesa tables are showing as online though the tablets, table sizes 
> and record counts are not showing in the web UI.
> 
>      
> 
>     In the logs (which are very large) there are a range of issues showing, 
> the following seeming important from our Googling.
> 
>      
> 
>     Log excerpts:
> 
>     2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets 
> as unassigned because they don't have current servers
> 
>     2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata 
> Tablets]: 1 tablets are ASSIGNED_TO_DEAD_SERVER
> 
>     2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets
> 
>     2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata 
> Tablets]: 1 tablets are UNASSIGNED
> 
>     2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was 
> loaded on node03:9997
> 
>      
> 
>     An Accumulo meta data node is offline.  In the accumulo master log file 
> we see that there are 1101 WALs associated with a node (node08) that are 
> linked to tablet !0<~.  Below are 2 instances of the message we get in the 
> logs, which repeat over and over, and there are 1101 of them per repeat.  
> We’re not sure why there are 1101 WALs for the one node, but we assume that 
> this is the main cause of our problem.
> 
>      
> 
>     2017-08-30 15:20:29,094 [conf.AccumuloConfiguration] INFO : Loaded class 
> : org.apache.accumulo.server.master.recovery.HadoopLogCloser
> 
>     2017-08-30 15:20:29,094 [recovery.RecoveryManager] INFO : Starting 
> recovery of 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/fed84709-3d3b-45b0-8b77-020a71762b09
>  (in : 300s), tablet !0;~< holds a reference
> 
>     2017-08-30 15:20:29,142 [conf.AccumuloConfiguration] INFO : Loaded class 
> : org.apache.accumulo.server.master.recovery.HadoopLogCloser
> 
>     2017-08-30 15:20:29,142 [recovery.RecoveryManager] INFO : Starting 
> recovery of 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ffc115dd-f094-443f-a98f-8e670fb2a924
>  (in : 300s), tablet !0;~< holds a reference
> 
>     2017-08-30 15:20:45,457 [replication.WorkMaker] INFO : Replication table 
> is not yet online
> 
>      
> 
>     Any query of the meta data table hangs, such as those recommended here: 
> https://accumulo.apache.org/1.7/accumulo_user_manual.html#_advanced_system_recovery
> 
>     We are assuming that the above inability to recover the WALs is 
> preventing use of the metadata table, even though it reports as being online.
> 
>      
> 
>     Running:
> 
>     (a)
> 
>      ./hdfs dfs -du -s -h 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns:
> 
>     1.1 G  hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997
> 
>      
> 
>     (b)
> 
>     ./hdfs dfs -count -h 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns:
> 
>                 1      785.1 K              1.1 G 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997
> 
>                
> 
>     (c)
> 
>     ./hdfs dfs -ls 
> hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns:
> 
>     Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 
>             at java.lang.String.substring(String.java:1969)
> 
>             at java.net.URI$Parser.substring(URI.java:2869)
> 
>             at java.net.URI$Parser.parse(URI.java:3065)
> 
>             at java.net.URI.<init>(URI.java:746)
> 
>             at org.apache.hadoop.fs.Path.<init>(Path.java:108)
> 
>             at org.apache.hadoop.fs.Path.<init>(Path.java:93)
> 
>             at 
> org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:230)
> 
>             at 
> org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:263)
> 
>             at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:830)
> 
>             at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
> 
>             at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
> 
>             at 
> org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
> 
>             at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 
>             at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
> 
>             at 
> org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268)
> 
>             at 
> org.apache.hadoop.fs.shell.Command.recursePath(Command.java:373)
> 
>             at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:90)
> 
>             at 
> org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
> 
>             at 
> org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
> 
>             at 
> org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201)
> 
>             at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
> 
>             at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
> 
>             at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 
>             at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> 
>             at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
> 
>     (d)
> 
>     We have validated that the file permissions on the Accumulo tables are 
> correct.
> 
>      
> 
>     We don’t understand why each of the 31 nodes that have WALs under 
> hdfs://master01:9000/user/accumulo/accumulo/wal/ each have only a single WAL 
> file within, yet for node08 there are 785,100 files.  Also, for a random 
> sample of the 1101 WAL files mentioned in the logs referred to above, none of 
> them seem to be in the folder (hdfs dfs -l reports file not found for all of 
> the files we tried).
> 
>      
> 
>     Judging from the notes under “Advanced System Recovery” in the manual, 
> we’re stuck because it suggests editing the metadata table to drop the WALs 
> and, with data loss, get the system back online.  As the problems appear to 
> be with the metadata table, and HDFS is reporting healthy with no corruption, 
> we don’t see how to proceed.
> 
>      
> 
>     We have many large log files, which I’m happy to email separately if it 
> helps.
> 
>      
> 
>     Any suggestions as to what we might do to get back online?
> 
>      
> 
>     Thank you very much,
> 
>      
> 
>     Nick
> 
>      
> 
>      
> 
>     This email (and any attachments) may contain confidential information and 
> is intended solely for the recipient(s) to whom the email is addressed. If 
> you received this email in error, please inform us immediately and delete the 
> email and all attachments without further using, copying or disclosing the 
> information. This email and any attachments are believed to be, but cannot be 
> guaranteed to be, secure or virus-free. Satellite Applications Catapult 
> Limited is registered in England & Wales. Company Number: 7964746. Registered 
> office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire 
> OX11 0QR.
> 
 

Reply via email to