RE: accumulo.metadata table online but scans hang
Thank you Mike! It’s alive!! I’ve got 5TB showing free now (which previously was spuriously showing as non-DFS used). I’ll get trash enabled tomorrow now. The touchz approach seems to have worked, and all tables and tablets are showing online. I am able to access some data, so next steps are to find out where we need to reingest from. Reingestion should be the easier approach for us, and I can live with duplicate data. I realise this is an RTFM kind of question, but as a novice I’d like to be sure I’m taking the right next steps. What’s the best way to check the integrity of the system, and look for further problems? Are there integrity checking admin tools I should know about? Thank you again, and also Dave and Ivan for helping us through this! Nick From: Michael Wall [mailto:mjw...@gmail.com] Sent: 31 August 2017 14:08 To: user@accumulo.apache.org; d...@accumulo.apache.org Subject: Re: accumulo.metadata table online but scans hang Nick First step to make sure you have plenty of HDFS space. Seems like restarting fixed your du bug, so hopefully you are good there. Consider turning on HDFS trash as well, it can be very useful for recovery. Second step is to get your metadata table happy. Try the touchz for the WAL logs and get to a point where you can at least scan the metadata from the shell. Next step will be to recover data. I see 2 options. Dave already mentioned one, replay all your ingest from a point in time that was known good. That is probably the best way to reduce data loss. You could also try creating a new table and then bulk importing all the rfiles for a given table into that new table. You would want to import all rfiles referenced by the metadata and all rfiles you find in HDFS for that table that aren't referenced. Mike On Thu, Aug 31, 2017 at 8:17 AM Nick Wise <nicholas.w...@sa.catapult.org.uk<mailto:nicholas.w...@sa.catapult.org.uk>> wrote: Thank you Dave. I will try the touchz approach and see what happens. From: Dave Marion [mailto:dlmar...@comcast.net<mailto:dlmar...@comcast.net>] Sent: 31 August 2017 13:12 To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>; Nick Wise <nicholas.w...@sa.catapult.org.uk<mailto:nicholas.w...@sa.catapult.org.uk>>; d...@accumulo.apache.org<mailto:d...@accumulo.apache.org> Subject: RE: accumulo.metadata table online but scans hang I don't have any other options for you at this point. Seems like you have the necessary information to fixup the missing files and recover the system. You might be able to determine the timestamp of the first missing WAL file and replay data from that point. > On August 31, 2017 at 5:51 AM Nick Wise > <nicholas.w...@sa.catapult.org.uk<mailto:nicholas.w...@sa.catapult.org.uk>> > wrote: > > > > It does run as the accumulo user, but sadly still no trash. I'm told that > this is probably because we move a lot of files in and out of HDFS for > ingestion, and it's a space saving thing. In hindsight I'd rather have bought > more disks than have lost important files! > > I can't find any reference to deleting the WAL files in the gc logs, I do see > lots of lines like this around the time that things went wrong though: > > 2017-08-28 19:51:02,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking > replication table for > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/0007bea2-bf57-44ab-b2ca-a8a924c6b3c8 > > And after logs like the above for all of the referenced WAL files, lines like > this: > > 2017-08-28 19:51:05,775 [gc.GarbageCollectWriteAheadLogs] INFO : 1822 > replication entries scanned in 6.83 seconds > 2017-08-28 19:51:05,780 [gc.GarbageCollectWriteAheadLogs] INFO : 0 total logs > removed from 1 servers in 6.83 seconds > 2017-08-28 19:53:05,971 [impl.ThriftTransportPool] WARN : Thread "gc" stuck > on IO to master02: (0) for at least 120022 ms > > The only gc logging that happens after this point is startup properties print > outs. No further operational log messages come out. > > In terms of creating an empty table, when trying to do this it just hangs. I > assume this is because the metadata table is not working. > > root@instance> createtable emptytable > 2017-08-31 10:43:57,315 [impl.ThriftTransportPool] WARN : Thread "shell" > stuck on IO to master01: (0) for at least 120029 ms > > Are tables instance specific, or is it possible to get an empty rfile from > any instance? I don't suppose someone has such a file I could have..?! > > I believe it was Ivan who recommended using touchz to create an empty file in > place of the missing WALs, I'm assuming from the procedure to create an empty > table that this isn't the right thing to do, so I will hold off doing that > unless someone can confirm it is good
Re: accumulo.metadata table online but scans hang
Nick First step to make sure you have plenty of HDFS space. Seems like restarting fixed your du bug, so hopefully you are good there. Consider turning on HDFS trash as well, it can be very useful for recovery. Second step is to get your metadata table happy. Try the touchz for the WAL logs and get to a point where you can at least scan the metadata from the shell. Next step will be to recover data. I see 2 options. Dave already mentioned one, replay all your ingest from a point in time that was known good. That is probably the best way to reduce data loss. You could also try creating a new table and then bulk importing all the rfiles for a given table into that new table. You would want to import all rfiles referenced by the metadata and all rfiles you find in HDFS for that table that aren't referenced. Mike On Thu, Aug 31, 2017 at 8:17 AM Nick Wise <nicholas.w...@sa.catapult.org.uk> wrote: > Thank you Dave. I will try the touchz approach and see what happens. > > > > > > > > *From:* Dave Marion [mailto:dlmar...@comcast.net] > *Sent:* 31 August 2017 13:12 > *To:* user@accumulo.apache.org; Nick Wise < > nicholas.w...@sa.catapult.org.uk>; d...@accumulo.apache.org > > > *Subject:* RE: accumulo.metadata table online but scans hang > > > > I don't have any other options for you at this point. Seems like you have > the necessary information to fixup the missing files and recover the > system. You might be able to determine the timestamp of the first missing > WAL file and replay data from that point. > > > On August 31, 2017 at 5:51 AM Nick Wise < > nicholas.w...@sa.catapult.org.uk> wrote: > > > > > > > > It does run as the accumulo user, but sadly still no trash. I'm told > that this is probably because we move a lot of files in and out of HDFS for > ingestion, and it's a space saving thing. In hindsight I'd rather have > bought more disks than have lost important files! > > > > I can't find any reference to deleting the WAL files in the gc logs, I > do see lots of lines like this around the time that things went wrong > though: > > > > 2017-08-28 19:51:02,210 [gc.GarbageCollectWriteAheadLogs] INFO : > Checking replication table for > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/0007bea2-bf57-44ab-b2ca-a8a924c6b3c8 > > > > And after logs like the above for all of the referenced WAL files, lines > like this: > > > > 2017-08-28 19:51:05,775 [gc.GarbageCollectWriteAheadLogs] INFO : 1822 > replication entries scanned in 6.83 seconds > > 2017-08-28 19:51:05,780 [gc.GarbageCollectWriteAheadLogs] INFO : 0 total > logs removed from 1 servers in 6.83 seconds > > 2017-08-28 19:53:05,971 [impl.ThriftTransportPool] WARN : Thread "gc" > stuck on IO to master02: (0) for at least 120022 ms > > > > The only gc logging that happens after this point is startup properties > print outs. No further operational log messages come out. > > > > In terms of creating an empty table, when trying to do this it just > hangs. I assume this is because the metadata table is not working. > > > > root@instance> createtable emptytable > > 2017-08-31 10:43:57,315 [impl.ThriftTransportPool] WARN : Thread "shell" > stuck on IO to master01: (0) for at least 120029 ms > > > > Are tables instance specific, or is it possible to get an empty rfile > from any instance? I don't suppose someone has such a file I could have..?! > > > > I believe it was Ivan who recommended using touchz to create an empty > file in place of the missing WALs, I'm assuming from the procedure to > create an empty table that this isn't the right thing to do, so I will hold > off doing that unless someone can confirm it is good enough. > > > > Thank you again for your help! > > > > > > > > > > > > -Original Message- > > From: dlmar...@comcast.net [mailto:dlmar...@comcast.net > <dlmar...@comcast.net>] > > Sent: 31 August 2017 00:07 > > To: user@accumulo.apache.org; d...@accumulo.apache.org > > Subject: RE: accumulo.metadata table online but scans hang > > > > Re #2: Does your Accumulo processes run as the hdfs user on the O/S, or > as the accumulo user? Make sure you are checking the correct users trash > folder. Also, check the Accumulo garbage collector log to see if the GC > process deleted the WAL files. Take a look at [1] to see if you are hitting > this case. > > > > You can create empty rfiles and copy them into place. I believe the > procedure to do this is to create an empty table and run a compaction on > the table. Then you should be able to copy the resulting file into the >
RE: accumulo.metadata table online but scans hang
Thank you Dave. I will try the touchz approach and see what happens. From: Dave Marion [mailto:dlmar...@comcast.net] Sent: 31 August 2017 13:12 To: user@accumulo.apache.org; Nick Wise <nicholas.w...@sa.catapult.org.uk>; d...@accumulo.apache.org Subject: RE: accumulo.metadata table online but scans hang I don't have any other options for you at this point. Seems like you have the necessary information to fixup the missing files and recover the system. You might be able to determine the timestamp of the first missing WAL file and replay data from that point. > On August 31, 2017 at 5:51 AM Nick Wise > <nicholas.w...@sa.catapult.org.uk<mailto:nicholas.w...@sa.catapult.org.uk>> > wrote: > > > > It does run as the accumulo user, but sadly still no trash. I'm told that > this is probably because we move a lot of files in and out of HDFS for > ingestion, and it's a space saving thing. In hindsight I'd rather have bought > more disks than have lost important files! > > I can't find any reference to deleting the WAL files in the gc logs, I do see > lots of lines like this around the time that things went wrong though: > > 2017-08-28 19:51:02,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking > replication table for > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/0007bea2-bf57-44ab-b2ca-a8a924c6b3c8 > > And after logs like the above for all of the referenced WAL files, lines like > this: > > 2017-08-28 19:51:05,775 [gc.GarbageCollectWriteAheadLogs] INFO : 1822 > replication entries scanned in 6.83 seconds > 2017-08-28 19:51:05,780 [gc.GarbageCollectWriteAheadLogs] INFO : 0 total logs > removed from 1 servers in 6.83 seconds > 2017-08-28 19:53:05,971 [impl.ThriftTransportPool] WARN : Thread "gc" stuck > on IO to master02: (0) for at least 120022 ms > > The only gc logging that happens after this point is startup properties print > outs. No further operational log messages come out. > > In terms of creating an empty table, when trying to do this it just hangs. I > assume this is because the metadata table is not working. > > root@instance> createtable emptytable > 2017-08-31 10:43:57,315 [impl.ThriftTransportPool] WARN : Thread "shell" > stuck on IO to master01: (0) for at least 120029 ms > > Are tables instance specific, or is it possible to get an empty rfile from > any instance? I don't suppose someone has such a file I could have..?! > > I believe it was Ivan who recommended using touchz to create an empty file in > place of the missing WALs, I'm assuming from the procedure to create an empty > table that this isn't the right thing to do, so I will hold off doing that > unless someone can confirm it is good enough. > > Thank you again for your help! > > > > > > -Original Message- > From: dlmar...@comcast.net<mailto:dlmar...@comcast.net> > [mailto:dlmar...@comcast.net] > Sent: 31 August 2017 00:07 > To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>; > d...@accumulo.apache.org<mailto:d...@accumulo.apache.org> > Subject: RE: accumulo.metadata table online but scans hang > > Re #2: Does your Accumulo processes run as the hdfs user on the O/S, or as > the accumulo user? Make sure you are checking the correct users trash folder. > Also, check the Accumulo garbage collector log to see if the GC process > deleted the WAL files. Take a look at [1] to see if you are hitting this case. > > You can create empty rfiles and copy them into place. I believe the procedure > to do this is to create an empty table and run a compaction on the table. > Then you should be able to copy the resulting file into the desired locations > (devs - please correct me here if this is not correct). > > Finally, I would not do anything destructive yet. Let's see if we can get > some other devs to chime in with some ideas. > > [1] https://issues.apache.org/jira/browse/ACCUMULO-4157 > > > -Original Message- > From: Nick Wise [mailto:nicholas.w...@sa.catapult.org.uk] > Sent: Wednesday, August 30, 2017 5:36 PM > To: user@accumulo.apache.org<mailto:user@accumulo.apache.org> > Subject: RE: accumulo.metadata table online but scans hang > > > Thank you very much for the pointers Dave. Looking at those: > > 1. That seems reasonable, I’m not sure how to check after the fact but makes > sense. > 2. Ah. Looks like we don’t have trash enabled, there’s no /user/hdfs/.Trash > folder that I can see. I’m getting a sinking feeling… 3. I had to allocate 4G > but that worked and now I have a folder listing of 758k files. I’ve cross > referenced with the 1101 WAL files referred to in our logs and not a single > one exists. Sinking som
RE: accumulo.metadata table online but scans hang
I don't have any other options for you at this point. Seems like you have the necessary information to fixup the missing files and recover the system. You might be able to determine the timestamp of the first missing WAL file and replay data from that point. > On August 31, 2017 at 5:51 AM Nick Wise <nicholas.w...@sa.catapult.org.uk> > wrote: > > > > It does run as the accumulo user, but sadly still no trash. I'm told that > this is probably because we move a lot of files in and out of HDFS for > ingestion, and it's a space saving thing. In hindsight I'd rather have bought > more disks than have lost important files! > > I can't find any reference to deleting the WAL files in the gc logs, I do see > lots of lines like this around the time that things went wrong though: > > 2017-08-28 19:51:02,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking > replication table for > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/0007bea2-bf57-44ab-b2ca-a8a924c6b3c8 > > And after logs like the above for all of the referenced WAL files, lines like > this: > > 2017-08-28 19:51:05,775 [gc.GarbageCollectWriteAheadLogs] INFO : 1822 > replication entries scanned in 6.83 seconds > 2017-08-28 19:51:05,780 [gc.GarbageCollectWriteAheadLogs] INFO : 0 total logs > removed from 1 servers in 6.83 seconds > 2017-08-28 19:53:05,971 [impl.ThriftTransportPool] WARN : Thread "gc" stuck > on IO to master02: (0) for at least 120022 ms > > The only gc logging that happens after this point is startup properties print > outs. No further operational log messages come out. > > In terms of creating an empty table, when trying to do this it just hangs. I > assume this is because the metadata table is not working. > > root@instance> createtable emptytable > 2017-08-31 10:43:57,315 [impl.ThriftTransportPool] WARN : Thread "shell" > stuck on IO to master01: (0) for at least 120029 ms > > Are tables instance specific, or is it possible to get an empty rfile from > any instance? I don't suppose someone has such a file I could have..?! > > I believe it was Ivan who recommended using touchz to create an empty file in > place of the missing WALs, I'm assuming from the procedure to create an empty > table that this isn't the right thing to do, so I will hold off doing that > unless someone can confirm it is good enough. > > Thank you again for your help! > > > > > > -----Original Message- > From: dlmar...@comcast.net [mailto:dlmar...@comcast.net] > Sent: 31 August 2017 00:07 > To: user@accumulo.apache.org; d...@accumulo.apache.org > Subject: RE: accumulo.metadata table online but scans hang > > Re #2: Does your Accumulo processes run as the hdfs user on the O/S, or as > the accumulo user? Make sure you are checking the correct users trash folder. > Also, check the Accumulo garbage collector log to see if the GC process > deleted the WAL files. Take a look at [1] to see if you are hitting this case. > > You can create empty rfiles and copy them into place. I believe the procedure > to do this is to create an empty table and run a compaction on the table. > Then you should be able to copy the resulting file into the desired locations > (devs - please correct me here if this is not correct). > > Finally, I would not do anything destructive yet. Let's see if we can get > some other devs to chime in with some ideas. > > [1] https://issues.apache.org/jira/browse/ACCUMULO-4157 > > > -Original Message- > From: Nick Wise [mailto:nicholas.w...@sa.catapult.org.uk] > Sent: Wednesday, August 30, 2017 5:36 PM > To: user@accumulo.apache.org > Subject: RE: accumulo.metadata table online but scans hang > > > Thank you very much for the pointers Dave. Looking at those: > > 1. That seems reasonable, I’m not sure how to check after the fact but makes > sense. > 2. Ah. Looks like we don’t have trash enabled, there’s no /user/hdfs/.Trash > folder that I can see. I’m getting a sinking feeling… 3. I had to allocate 4G > but that worked and now I have a folder listing of 758k files. I’ve cross > referenced with the 1101 WAL files referred to in our logs and not a single > one exists. Sinking some more. > > So, it sounds like (speaking from a position of ignorance) that we have a > system where accumulo.metadata has outstanding WAL files to recover, but the > files don't exist, but the only way to restore the system is to convince the > metadata table that it doesn't need the WAL files, but to edit the metadata > table we have to resolve the outstanding WAL files, etc. > > What would happen if we created an empty file in place of the missing WAL > files? Would they be considered to b
RE: accumulo.metadata table online but scans hang
It does run as the accumulo user, but sadly still no trash. I'm told that this is probably because we move a lot of files in and out of HDFS for ingestion, and it's a space saving thing. In hindsight I'd rather have bought more disks than have lost important files! I can't find any reference to deleting the WAL files in the gc logs, I do see lots of lines like this around the time that things went wrong though: 2017-08-28 19:51:02,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication table for hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/0007bea2-bf57-44ab-b2ca-a8a924c6b3c8 And after logs like the above for all of the referenced WAL files, lines like this: 2017-08-28 19:51:05,775 [gc.GarbageCollectWriteAheadLogs] INFO : 1822 replication entries scanned in 6.83 seconds 2017-08-28 19:51:05,780 [gc.GarbageCollectWriteAheadLogs] INFO : 0 total logs removed from 1 servers in 6.83 seconds 2017-08-28 19:53:05,971 [impl.ThriftTransportPool] WARN : Thread "gc" stuck on IO to master02: (0) for at least 120022 ms The only gc logging that happens after this point is startup properties print outs. No further operational log messages come out. In terms of creating an empty table, when trying to do this it just hangs. I assume this is because the metadata table is not working. root@instance> createtable emptytable 2017-08-31 10:43:57,315 [impl.ThriftTransportPool] WARN : Thread "shell" stuck on IO to master01: (0) for at least 120029 ms Are tables instance specific, or is it possible to get an empty rfile from any instance? I don't suppose someone has such a file I could have..?! I believe it was Ivan who recommended using touchz to create an empty file in place of the missing WALs, I'm assuming from the procedure to create an empty table that this isn't the right thing to do, so I will hold off doing that unless someone can confirm it is good enough. Thank you again for your help! -Original Message- From: dlmar...@comcast.net [mailto:dlmar...@comcast.net] Sent: 31 August 2017 00:07 To: user@accumulo.apache.org; d...@accumulo.apache.org Subject: RE: accumulo.metadata table online but scans hang Re #2: Does your Accumulo processes run as the hdfs user on the O/S, or as the accumulo user? Make sure you are checking the correct users trash folder. Also, check the Accumulo garbage collector log to see if the GC process deleted the WAL files. Take a look at [1] to see if you are hitting this case. You can create empty rfiles and copy them into place. I believe the procedure to do this is to create an empty table and run a compaction on the table. Then you should be able to copy the resulting file into the desired locations (devs - please correct me here if this is not correct). Finally, I would not do anything destructive yet. Let's see if we can get some other devs to chime in with some ideas. [1] https://issues.apache.org/jira/browse/ACCUMULO-4157 -Original Message- From: Nick Wise [mailto:nicholas.w...@sa.catapult.org.uk] Sent: Wednesday, August 30, 2017 5:36 PM To: user@accumulo.apache.org Subject: RE: accumulo.metadata table online but scans hang Thank you very much for the pointers Dave. Looking at those: 1. That seems reasonable, I’m not sure how to check after the fact but makes sense. 2. Ah. Looks like we don’t have trash enabled, there’s no /user/hdfs/.Trash folder that I can see. I’m getting a sinking feeling… 3. I had to allocate 4G but that worked and now I have a folder listing of 758k files. I’ve cross referenced with the 1101 WAL files referred to in our logs and not a single one exists. Sinking some more. So, it sounds like (speaking from a position of ignorance) that we have a system where accumulo.metadata has outstanding WAL files to recover, but the files don't exist, but the only way to restore the system is to convince the metadata table that it doesn't need the WAL files, but to edit the metadata table we have to resolve the outstanding WAL files, etc. What would happen if we created an empty file in place of the missing WAL files? Would they be considered to be an invalid format and break things more (I'm not sure that's possible), or might they be accepted as needing no further resolution? Any other thoughts (anyone) on how we might save ourselves, besides starting from scratch? (When we first loaded our 16TB of data it took 6 weeks using the map/reduce method!) Thank you again! Nick From: Dave Marion [mailto:dlmar...@comcast.net] Sent: 30 August 2017 20:13 To: user@accumulo.apache.org; Nick Wise <nicholas.w...@sa.catapult.org.uk> Subject: Re: accumulo.metadata table online but scans hang Some immediate thoughts: 1. Regarding node08 having so many files, maybe it was the last DN that had free space? 2. Look in the trash folder for the missing referenced WAL files 3. For you OOME using the HDFS CLI, I think you can increase the amou
RE: accumulo.metadata table online but scans hang
Re #2: Does your Accumulo processes run as the hdfs user on the O/S, or as the accumulo user? Make sure you are checking the correct users trash folder. Also, check the Accumulo garbage collector log to see if the GC process deleted the WAL files. Take a look at [1] to see if you are hitting this case. You can create empty rfiles and copy them into place. I believe the procedure to do this is to create an empty table and run a compaction on the table. Then you should be able to copy the resulting file into the desired locations (devs - please correct me here if this is not correct). Finally, I would not do anything destructive yet. Let's see if we can get some other devs to chime in with some ideas. [1] https://issues.apache.org/jira/browse/ACCUMULO-4157 -Original Message- From: Nick Wise [mailto:nicholas.w...@sa.catapult.org.uk] Sent: Wednesday, August 30, 2017 5:36 PM To: user@accumulo.apache.org Subject: RE: accumulo.metadata table online but scans hang Thank you very much for the pointers Dave. Looking at those: 1. That seems reasonable, I’m not sure how to check after the fact but makes sense. 2. Ah. Looks like we don’t have trash enabled, there’s no /user/hdfs/.Trash folder that I can see. I’m getting a sinking feeling… 3. I had to allocate 4G but that worked and now I have a folder listing of 758k files. I’ve cross referenced with the 1101 WAL files referred to in our logs and not a single one exists. Sinking some more. So, it sounds like (speaking from a position of ignorance) that we have a system where accumulo.metadata has outstanding WAL files to recover, but the files don't exist, but the only way to restore the system is to convince the metadata table that it doesn't need the WAL files, but to edit the metadata table we have to resolve the outstanding WAL files, etc. What would happen if we created an empty file in place of the missing WAL files? Would they be considered to be an invalid format and break things more (I'm not sure that's possible), or might they be accepted as needing no further resolution? Any other thoughts (anyone) on how we might save ourselves, besides starting from scratch? (When we first loaded our 16TB of data it took 6 weeks using the map/reduce method!) Thank you again! Nick From: Dave Marion [mailto:dlmar...@comcast.net] Sent: 30 August 2017 20:13 To: user@accumulo.apache.org; Nick Wise <nicholas.w...@sa.catapult.org.uk> Subject: Re: accumulo.metadata table online but scans hang Some immediate thoughts: 1. Regarding node08 having so many files, maybe it was the last DN that had free space? 2. Look in the trash folder for the missing referenced WAL files 3. For you OOME using the HDFS CLI, I think you can increase the amount of memory that the client will use with: export HADOOP_CLIENT_OPTS="-Xmx1G" (or something like that). Still digesting the rest On August 30, 2017 at 2:45 PM Nick Wise <mailto:nicholas.w...@sa.catapult.org.uk> wrote: Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m standing in because our resident expert is away on honeymoon! We’ve done a great deal of reading and do not know if our situation is recoverable, so any and all advice would be very welcome. Background: We are running: (a) Accumulo version: 1.7.0 (b) Hadoop version: 2.7.1 (c) Geomesa version: 1.2.1 We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log excerpts below). Nodes are both data nodes and tablet servers, masters are also name nodes. Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and 500GB or 1TB SSD each. This is a production deployment where we are analysing 16TB (and growing) geospatial data, with the outcomes being used daily. We have customers relying on our results. Initial Issue: The non-DFS storage used in our HDFS system was falsely reporting that it was using all of the free space we had available, resulting in HDFS rejecting writes from a variety of places across our cluster. After research it appeared that this may be as a result of a bug, and that restarting HDFS services would resolve it. After restarting the HDFS services the non-DFS storage used immediately returned to expected levels, but accumulo didn’t seem to be responding to queries so we ran stop-all.sh and start-all.sh. When running stop-all.sh it timed out trying to contact the master, and did a forced shutdown. After restarting, Accumulo listed all the tables as being online (except for accumulo.replication which is offline) but none of the tables have their tablets associated except for: (a) accumulo.metadata (b) accumulo.root All Geomesa tables are showing as online though the tablets, table sizes and record counts are not showing in the web UI. In the logs (which are very large) there are a range of issues showing, the following seeming important from our Googling. Log excerpts: 2017-08-30 14:45:06,195 [master.EventCoordinator] I
RE: accumulo.metadata table online but scans hang
I think you may be able to simply touchz the WAL file to get it past this point. You will probably be loosing data of course. If these are WAL files for the accumulo.metadata table itself then we may have more issues to overcome. However if these are for mutations in other tables that you can replay then this will probably work. > On August 30, 2017 at 5:49 PM Nick Wise <nicholas.w...@sa.catapult.org.uk> > wrote: > > Thank you also Ivan. Would this circumstance also arise if the WAL file were > somehow deleted? I wonder if it might be related to this issue (although I > can't find the requisite entries in the gc log to prove this): > https://issues.apache.org/jira/browse/ACCUMULO-4157 > > It suggests touching the relevant WAL files to create empty ones which might > let the system move on, albeit with data loss, we can then reingest anything > that may have been impacted over the past few days. This would be a much > better solution than reingesting all 4 years worth of data! > > -Original Message- > From: ivan bella [mailto:i...@ivan.bella.name] > Sent: 30 August 2017 21:57 > To: user@accumulo.apache.org > Subject: Re: accumulo.metadata table online but scans hang > > The problem you may have run into is that the WAL logs never got closed and > accumulo is waiting for them to be closed before processing them: > https://issues.apache.org/jira/browse/HDFS-8406. There is a work around noted > which is to move the WAL file out of the way, then copy it back to the > original location. The copy will be appropriately closed and the WAL recovery > will continue. > > > On August 30, 2017 at 2:45 PM Nick Wise <nicholas.w...@sa.catapult.org.uk> > > wrote: > > > > Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m > > standing in because our resident expert is away on honeymoon! We’ve done a > > great deal of reading and do not know if our situation is recoverable, so > > any and all advice would be very welcome. > > > > Background: > > > > We are running: > > > > (a) Accumulo version: 1.7.0 > > > > (b) Hadoop version: 2.7.1 > > > > (c) Geomesa version: 1.2.1 > > > > We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log > > excerpts below). Nodes are both data nodes and tablet servers, masters are > > also name nodes. Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and > > 500GB or 1TB SSD each. > > > > This is a production deployment where we are analysing 16TB (and growing) > > geospatial data, with the outcomes being used daily. We have customers > > relying on our results. > > > > Initial Issue: > > > > The non-DFS storage used in our HDFS system was falsely reporting that it > > was using all of the free space we had available, resulting in HDFS > > rejecting writes from a variety of places across our cluster. After > > research it appeared that this may be as a result of a bug, and that > > restarting HDFS services would resolve it. After restarting the HDFS > > services the non-DFS storage used immediately returned to expected levels, > > but accumulo didn’t seem to be responding to queries so we ran stop-all.sh > > and start-all.sh. When running stop-all.sh it timed out trying to contact > > the master, and did a forced shutdown. > > > > After restarting, Accumulo listed all the tables as being online (except > > for accumulo.replication which is offline) but none of the tables have > > their tablets associated except for: > > > > (a) accumulo.metadata > > > > (b) accumulo.root > > > > All Geomesa tables are showing as online though the tablets, table sizes > > and record counts are not showing in the web UI. > > > > In the logs (which are very large) there are a range of issues showing, the > > following seeming important from our Googling. > > > > Log excerpts: > > > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets > > as unassigned because they don't have current servers > > > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata > > Tablets]: 1 tablets are ASSIGNED_TO_DEAD_SERVER > > > > 2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets > > > > 2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata > > Tablets]: 1 tablets are UNASSIGNED > > > > 2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was > > loaded on node03:9997 > > > >
RE: accumulo.metadata table online but scans hang
Thank you also Ivan. Would this circumstance also arise if the WAL file were somehow deleted? I wonder if it might be related to this issue (although I can't find the requisite entries in the gc log to prove this): https://issues.apache.org/jira/browse/ACCUMULO-4157 It suggests touching the relevant WAL files to create empty ones which might let the system move on, albeit with data loss, we can then reingest anything that may have been impacted over the past few days. This would be a much better solution than reingesting all 4 years worth of data! -Original Message- From: ivan bella [mailto:i...@ivan.bella.name] Sent: 30 August 2017 21:57 To: user@accumulo.apache.org Subject: Re: accumulo.metadata table online but scans hang The problem you may have run into is that the WAL logs never got closed and accumulo is waiting for them to be closed before processing them: https://issues.apache.org/jira/browse/HDFS-8406. There is a work around noted which is to move the WAL file out of the way, then copy it back to the original location. The copy will be appropriately closed and the WAL recovery will continue. > On August 30, 2017 at 2:45 PM Nick Wise <nicholas.w...@sa.catapult.org.uk> > wrote: > > Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m > standing in because our resident expert is away on honeymoon! We’ve done a > great deal of reading and do not know if our situation is recoverable, so any > and all advice would be very welcome. > > Background: > > We are running: > > (a) Accumulo version: 1.7.0 > > (b) Hadoop version: 2.7.1 > > (c) Geomesa version: 1.2.1 > > We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log > excerpts below). Nodes are both data nodes and tablet servers, masters are > also name nodes. Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and > 500GB or 1TB SSD each. > > This is a production deployment where we are analysing 16TB (and growing) > geospatial data, with the outcomes being used daily. We have customers > relying on our results. > > Initial Issue: > > The non-DFS storage used in our HDFS system was falsely reporting that it was > using all of the free space we had available, resulting in HDFS rejecting > writes from a variety of places across our cluster. After research it > appeared that this may be as a result of a bug, and that restarting HDFS > services would resolve it. After restarting the HDFS services the non-DFS > storage used immediately returned to expected levels, but accumulo didn’t > seem to be responding to queries so we ran stop-all.sh and start-all.sh. > When running stop-all.sh it timed out trying to contact the master, and did a > forced shutdown. > > After restarting, Accumulo listed all the tables as being online (except for > accumulo.replication which is offline) but none of the tables have their > tablets associated except for: > > (a) accumulo.metadata > > (b) accumulo.root > > All Geomesa tables are showing as online though the tablets, table sizes and > record counts are not showing in the web UI. > > In the logs (which are very large) there are a range of issues showing, the > following seeming important from our Googling. > > Log excerpts: > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets as > unassigned because they don't have current servers > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata Tablets]: > 1 tablets are ASSIGNED_TO_DEAD_SERVER > > 2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets > > 2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata Tablets]: > 1 tablets are UNASSIGNED > > 2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was > loaded on node03:9997 > > An Accumulo meta data node is offline. In the accumulo master log file we > see that there are 1101 WALs associated with a node (node08) that are linked > to tablet !0<~. Below are 2 instances of the message we get in the logs, > which repeat over and over, and there are 1101 of them per repeat. We’re not > sure why there are 1101 WALs for the one node, but we assume that this is the > main cause of our problem. > > 2017-08-30 15:20:29,094 [conf.AccumuloConfiguration] INFO : Loaded class : > org.apache.accumulo.server.master.recovery.HadoopLogCloser > > 2017-08-30 15:20:29,094 [recovery.RecoveryManager] INFO : Starting recovery > of > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/fed84709-3d3b-45b0-8b77-020a71762b09 > (in : 300s), tablet !0;~< holds a reference > > 2017-08-30 15:20:29,142 [conf.AccumuloConfiguration] INFO : Loaded class : > org.apache.accumulo.server.master.recov
Re: accumulo.metadata table online but scans hang
The problem you may have run into is that the WAL logs never got closed and accumulo is waiting for them to be closed before processing them: https://issues.apache.org/jira/browse/HDFS-8406. There is a work around noted which is to move the WAL file out of the way, then copy it back to the original location. The copy will be appropriately closed and the WAL recovery will continue. > On August 30, 2017 at 2:45 PM Nick Wise> wrote: > > Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m > standing in because our resident expert is away on honeymoon! We’ve done a > great deal of reading and do not know if our situation is recoverable, so any > and all advice would be very welcome. > > Background: > > We are running: > > (a) Accumulo version: 1.7.0 > > (b) Hadoop version: 2.7.1 > > (c) Geomesa version: 1.2.1 > > We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log > excerpts below). Nodes are both data nodes and tablet servers, masters are > also name nodes. Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and > 500GB or 1TB SSD each. > > This is a production deployment where we are analysing 16TB (and growing) > geospatial data, with the outcomes being used daily. We have customers > relying on our results. > > Initial Issue: > > The non-DFS storage used in our HDFS system was falsely reporting that it was > using all of the free space we had available, resulting in HDFS rejecting > writes from a variety of places across our cluster. After research it > appeared that this may be as a result of a bug, and that restarting HDFS > services would resolve it. After restarting the HDFS services the non-DFS > storage used immediately returned to expected levels, but accumulo didn’t > seem to be responding to queries so we ran stop-all.sh and start-all.sh. > When running stop-all.sh it timed out trying to contact the master, and did a > forced shutdown. > > After restarting, Accumulo listed all the tables as being online (except for > accumulo.replication which is offline) but none of the tables have their > tablets associated except for: > > (a) accumulo.metadata > > (b) accumulo.root > > All Geomesa tables are showing as online though the tablets, table sizes and > record counts are not showing in the web UI. > > In the logs (which are very large) there are a range of issues showing, the > following seeming important from our Googling. > > Log excerpts: > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets as > unassigned because they don't have current servers > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata Tablets]: > 1 tablets are ASSIGNED_TO_DEAD_SERVER > > 2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets > > 2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata Tablets]: > 1 tablets are UNASSIGNED > > 2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was > loaded on node03:9997 > > An Accumulo meta data node is offline. In the accumulo master log file we > see that there are 1101 WALs associated with a node (node08) that are linked > to tablet !0<~. Below are 2 instances of the message we get in the logs, > which repeat over and over, and there are 1101 of them per repeat. We’re not > sure why there are 1101 WALs for the one node, but we assume that this is the > main cause of our problem. > > 2017-08-30 15:20:29,094 [conf.AccumuloConfiguration] INFO : Loaded class : > org.apache.accumulo.server.master.recovery.HadoopLogCloser > > 2017-08-30 15:20:29,094 [recovery.RecoveryManager] INFO : Starting recovery > of > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/fed84709-3d3b-45b0-8b77-020a71762b09 > (in : 300s), tablet !0;~< holds a reference > > 2017-08-30 15:20:29,142 [conf.AccumuloConfiguration] INFO : Loaded class : > org.apache.accumulo.server.master.recovery.HadoopLogCloser > > 2017-08-30 15:20:29,142 [recovery.RecoveryManager] INFO : Starting recovery > of > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ffc115dd-f094-443f-a98f-8e670fb2a924 > (in : 300s), tablet !0;~< holds a reference > > 2017-08-30 15:20:45,457 [replication.WorkMaker] INFO : Replication table is > not yet online > > Any query of the meta data table hangs, such as those recommended here: > https://accumulo.apache.org/1.7/accumulo_user_manual.html#_advanced_system_recovery > > We are assuming that the above inability to recover the WALs is preventing > use of the metadata table, even though it reports as being online. > > Running: > > (a) > > ./hdfs dfs -du -s -h > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns: > > 1.1 G hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997 > > (b) > > ./hdfs dfs -count -h > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ returns: > > 1 785.1 K
Re: accumulo.metadata table online but scans hang
Some immediate thoughts: 1. Regarding node08 having so many files, maybe it was the last DN that had free space? 2. Look in the trash folder for the missing referenced WAL files 3. For you OOME using the HDFS CLI, I think you can increase the amount of memory that the client will use with: export HADOOP_CLIENT_OPTS="-Xmx1G" (or something like that). Still digesting the rest > On August 30, 2017 at 2:45 PM Nick Wise> wrote: > > > > > Disclaimer: I don’t have much experience with Accumulo or Hadoop, I’m > standing in because our resident expert is away on honeymoon! We’ve done a > great deal of reading and do not know if our situation is recoverable, so any > and all advice would be very welcome. > > > > Background: > > We are running: > > (a) Accumulo version: 1.7.0 > > (b) Hadoop version: 2.7.1 > > (c) Geomesa version: 1.2.1 > > We have 31 nodes, 2 masters and 3 zookeepers (obviously named in the log > excerpts below). Nodes are both data nodes and tablet servers, masters are > also name nodes. Nodes have 16GB RAM, Intel Core i5 dual core CPUs, and > 500GB or 1TB SSD each. > > This is a production deployment where we are analysing 16TB (and growing) > geospatial data, with the outcomes being used daily. We have customers > relying on our results. > > > > Initial Issue: > > The non-DFS storage used in our HDFS system was falsely reporting that it > was using all of the free space we had available, resulting in HDFS rejecting > writes from a variety of places across our cluster. After research it > appeared that this may be as a result of a bug, and that restarting HDFS > services would resolve it. After restarting the HDFS services the non-DFS > storage used immediately returned to expected levels, but accumulo didn’t > seem to be responding to queries so we ran stop-all.sh and start-all.sh. > When running stop-all.sh it timed out trying to contact the master, and did a > forced shutdown. > > > > After restarting, Accumulo listed all the tables as being online (except > for accumulo.replication which is offline) but none of the tables have their > tablets associated except for: > > (a) accumulo.metadata > > (b) accumulo.root > > All Geomesa tables are showing as online though the tablets, table sizes > and record counts are not showing in the web UI. > > > > In the logs (which are very large) there are a range of issues showing, > the following seeming important from our Googling. > > > > Log excerpts: > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : Marked 1 tablets > as unassigned because they don't have current servers > > 2017-08-30 14:45:06,195 [master.EventCoordinator] INFO : [Metadata > Tablets]: 1 tablets are ASSIGNED_TO_DEAD_SERVER > > 2017-08-30 14:45:13,425 [master.Master] INFO : Assigning 1 tablets > > 2017-08-30 14:45:13,441 [master.EventCoordinator] INFO : [Metadata > Tablets]: 1 tablets are UNASSIGNED > > 2017-08-30 14:45:13,975 [master.EventCoordinator] INFO : tablet !0<;~ was > loaded on node03:9997 > > > > An Accumulo meta data node is offline. In the accumulo master log file > we see that there are 1101 WALs associated with a node (node08) that are > linked to tablet !0<~. Below are 2 instances of the message we get in the > logs, which repeat over and over, and there are 1101 of them per repeat. > We’re not sure why there are 1101 WALs for the one node, but we assume that > this is the main cause of our problem. > > > > 2017-08-30 15:20:29,094 [conf.AccumuloConfiguration] INFO : Loaded class > : org.apache.accumulo.server.master.recovery.HadoopLogCloser > > 2017-08-30 15:20:29,094 [recovery.RecoveryManager] INFO : Starting > recovery of > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/fed84709-3d3b-45b0-8b77-020a71762b09 > (in : 300s), tablet !0;~< holds a reference > > 2017-08-30 15:20:29,142 [conf.AccumuloConfiguration] INFO : Loaded class > : org.apache.accumulo.server.master.recovery.HadoopLogCloser > > 2017-08-30 15:20:29,142 [recovery.RecoveryManager] INFO : Starting > recovery of > hdfs://master01:9000/user/accumulo/accumulo/wal/node08+9997/ffc115dd-f094-443f-a98f-8e670fb2a924 > (in : 300s), tablet !0;~< holds a reference > > 2017-08-30 15:20:45,457 [replication.WorkMaker] INFO : Replication table > is not yet online > > > > Any query of the meta data table hangs, such as those recommended here: > https://accumulo.apache.org/1.7/accumulo_user_manual.html#_advanced_system_recovery > > We are assuming that the above inability to recover the WALs is > preventing use of the metadata table, even though it reports as being online. > > > > Running: > > (a) > > ./hdfs dfs -du -s -h >