[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444394#comment-16444394 ] Keith Turner commented on ACCUMULO-4851: This is not a duplicate of #432 > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0/data > -rw-r--r-- 3 accumulo hdfs 642 2018-04-06 22:09 >
[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432674#comment-16432674 ] Josh Elser commented on ACCUMULO-4851: -- {quote}Do you think this should be a blocker or should I proceed with a release candidate? {quote} IMO, does not need to be a blocker. I meant to add the workaround here but forget. I resolved the issue for the customer by: * Identify the recovery directory in HDFS for the one WAL which has the {{failed}} and {{finished}} markers (e.g. /{{accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849)}} * Stop Accumulo Master(s) * Move or delete the recovery directory for this WAL * Start Accumulo Master(s) Accumulo will automatically initiate recovery for this WAL and _should_ succeed on retry. > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ >
[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432540#comment-16432540 ] Christopher Tubbs commented on ACCUMULO-4851: - I was interested in making a release candidate for 1.9.0 this week. Do you think this should be a blocker or should I proceed with a release candidate? > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 >
[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431479#comment-16431479 ] Josh Elser commented on ACCUMULO-4851: -- No worries. I think I know what the fix is, just thought I'd mention it to you on the off-chance it rang a bell. > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 >
[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431475#comment-16431475 ] Dave Marion commented on ACCUMULO-4851: --- I don't remember an issue like this. Sorry I couldn't be of any help here. > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0/data > -rw-r--r-- 3
[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter
[ https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431437#comment-16431437 ] Josh Elser commented on ACCUMULO-4851: -- [~dlmarion], a phrocker suggested that you might have run into a similar issue at some point :) > WAL recovery directory should be deleted before running LogSorter > - > > Key: ACCUMULO-4851 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4851 > Project: Accumulo > Issue Type: Bug > Components: tserver >Reporter: Josh Elser >Assignee: Josh Elser >Priority: Critical > Fix For: 1.9.0 > > > Noticed this one on a user's 1.7-ish system. > A number of tablets (~9) were unassigned and reported on the Monitor as > having failed to load. Digging into the exception, we could see the tablet > load failed due to a FileNotFoundException: > {noformat} > 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to > assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 > java.lang.RuntimeException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.io.FileNotFoundException: File does not > exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) > at > org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) > at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590) > ... 9 more > Caused by: java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823) > at > org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) > at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) > at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399) > at > org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113) > at > org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) > at > org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) > ... 11 more > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: > java.io.FileNotFoundException: File does not exist: > /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet > xk;... reporting failure to master > 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet > load in 600.00 seconds > {noformat} > Upon further investigation of the recovery directory in HDFS for this WAL, we > find the following: > {noformat} > $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed > -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished > drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 > accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0 > -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 >