[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-19 Thread Keith Turner (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444394#comment-16444394
 ] 

Keith Turner commented on ACCUMULO-4851:


This is not a duplicate of #432

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs  0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0/data
> -rw-r--r--   3 accumulo hdfs    642 2018-04-06 22:09 
> 

[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-10 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432674#comment-16432674
 ] 

Josh Elser commented on ACCUMULO-4851:
--

{quote}Do you think this should be a blocker or should I proceed with a release 
candidate?
{quote}
IMO, does not need to be a blocker. I meant to add the workaround here but 
forget. I resolved the issue for the customer by:
 * Identify the recovery directory in HDFS for the one WAL which has the 
{{failed}} and {{finished}} markers (e.g. 
/{{accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849)}}
 * Stop Accumulo Master(s)
 * Move or delete the recovery directory for this WAL
 * Start Accumulo Master(s)

Accumulo will automatically initiate recovery for this WAL and _should_ succeed 
on retry.

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 1.9.0
>
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> 

[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-10 Thread Christopher Tubbs (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432540#comment-16432540
 ] 

Christopher Tubbs commented on ACCUMULO-4851:
-

I was interested in making a release candidate for 1.9.0 this week. Do you 
think this should be a blocker or should I proceed with a release candidate?

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 1.9.0
>
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs  0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
> 

[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-09 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431479#comment-16431479
 ] 

Josh Elser commented on ACCUMULO-4851:
--

No worries. I think I know what the fix is, just thought I'd mention it to you 
on the off-chance it rang a bell.

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 1.9.0
>
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs  0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
> 

[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-09 Thread Dave Marion (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431475#comment-16431475
 ] 

Dave Marion commented on ACCUMULO-4851:
---

I don't remember an issue like this. Sorry I couldn't be of any help here.

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 1.9.0
>
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs  0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0/data
> -rw-r--r--   3 

[jira] [Commented] (ACCUMULO-4851) WAL recovery directory should be deleted before running LogSorter

2018-04-09 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431437#comment-16431437
 ] 

Josh Elser commented on ACCUMULO-4851:
--

[~dlmarion], a phrocker suggested that you might have run into a similar issue 
at some point :)

> WAL recovery directory should be deleted before running LogSorter
> -
>
> Key: ACCUMULO-4851
> URL: https://issues.apache.org/jira/browse/ACCUMULO-4851
> Project: Accumulo
>  Issue Type: Bug
>  Components: tserver
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 1.9.0
>
>
> Noticed this one on a user's 1.7-ish system.
> A number of tablets (~9) were unassigned and reported on the Monitor as 
> having failed to load. Digging into the exception, we could see the tablet 
> load failed due to a FileNotFoundException:
> {noformat}
> 2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to 
> assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
> java.lang.RuntimeException: java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:640)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:449)
>     at 
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at 
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does not 
> exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
>     at 
> org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
>     at org.apache.accumulo.tserver.tablet.Tablet.(Tablet.java:590)
>     ... 9 more
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
>     at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1823)
>     at 
> org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
>     at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
>     at org.apache.hadoop.io.MapFile$Reader.(MapFile.java:399)
>     at 
> org.apache.accumulo.tserver.log.MultiReader.(MultiReader.java:113)
>     at 
> org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
>     at 
> org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
>     ... 11 more
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: 
> java.io.FileNotFoundException: File does not exist: 
> /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet 
> xk;... reporting failure to master
> 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet 
> load in 600.00 seconds
> {noformat}
> Upon further investigation of the recovery directory in HDFS for this WAL, we 
> find the following:
> {noformat}
> $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:12 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
> -rwxr--r--   3 accumulo hdfs  0 2018-04-06 22:10 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
> drwxr-xr-x   - accumulo hdfs  0 2018-04-06 22:09 
> accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-0
> -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 
>