[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340216#comment-16340216 ] Ivan Bella commented on ACCUMULO-4777: -- I applied the changes to 1.7, merged into 1.8 and subsequently into master. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Assignee: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.7.4, 1.9.0, 2.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324162#comment-16324162 ] Ivan Bella commented on ACCUMULO-4777: -- so if we do an overflow check on that sequence, what would we do? Depending on a continuous sequence anywhere seems like a process destined to eventually fail. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324157#comment-16324157 ] Keith Turner commented on ACCUMULO-4777: bq. I believe the sequence you are referring to is pulled from CommitSession.getWALogSeq() Ok I see now, thanks for the pointer. My memories are slowly coming back on this. I think each mutation batch used to get a separate seq # in the log. The seq # logic was moved to the tablet (for a reason I can not remember) and it was only incremented on minor compactions. During sorting, the seq number is only needed to determine if a mutation was before or after a compaction. This vestigial code was left behind when CommitSession was created. This makes me realize that the seq # in commit session has no overflow check. If a tablet does over 1 billion minor compactions on the same tablet server, it could have strange recovery problems. I think it increments by 2 because one seq # is for minor compaction and the other is for mutations. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324086#comment-16324086 ] Ivan Bella commented on ACCUMULO-4777: -- [~kturner] I believe the sequence you are referring to is pulled from CommitSession.getWALogSeq() which is populated from the nextSeq int in TabletMemory. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324069#comment-16324069 ] Ivan Bella commented on ACCUMULO-4777: -- BTW An option here is to implement the backoff mechanism against a separate ticket so that we can get the unused sequence generation mechanism removed immediately. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16324050#comment-16324050 ] Ivan Bella commented on ACCUMULO-4777: -- I updated the pull request with a backoff mechanism and termination criteria when failing to write to the WALs. I used a mechanism parallel to the WAL creation backoff process. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.9.0, 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323254#comment-16323254 ] Ivan Bella commented on ACCUMULO-4777: -- [~kturner] As far as I can tell this sequence generator value is not actually being used anywhere. That may have been how it was used in the past, but no longer. I created a pull request that strips that out. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Labels: pull-request-available > Fix For: 1.8.2, 2.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322739#comment-16322739 ] Ivan Bella commented on ACCUMULO-4777: -- After several days of getting my head around this code, I think I figured it out. There is a AtomicInteger used as a sequence counter in the TabletServerLogger. When this sequence counter wraps (goes negative), an exception is thrown. However in the write method where it is thrown, it will subsequently close the current WAL, open a new one, and recursively call itself via the defineTablet method. This underlying call will fail for the same reason, and then close the WAL, and recursively call it self again...etc, etc, etc. So basically we have tablet servers that have been up long enough to actually incur over 2^31 writes into the WALs. Once this happens, the server will go into this loop. I am guessing that not many systems leave the tablet servers up long enough for this to happen. Also, this is happening for us on tservers for which only the accumulo.metadata is pinned (via the HostRegexBalancer). Hence it is actually more likely to happen first on these tservers. As far as I can tell, every path to this write method basically ignores the sequence number returned. So what is the real purpose of this sequence generator? I think I need to original authors of this code to tell me. My inclination is to basically reset the sequence generator back to 0 and just continue. Any thoughts out there on this? > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > Fix For: 1.8.2, 2.0.0 > > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316356#comment-16316356 ] Ivan Bella commented on ACCUMULO-4777: -- The stack overflow is basically as follows (accumulo 1.8.1): All in TabletServerLogger: defineTablet line 465 write line 382 write line 356 defineTablet line 465 write line 382 write line 356 ... > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ACCUMULO-4777) Root tablet got spammed with 1.8 million log entries
[ https://issues.apache.org/jira/browse/ACCUMULO-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316337#comment-16316337 ] Ivan Bella commented on ACCUMULO-4777: -- This happened to us again, however this time everything appeared to recover . This time the loop appeared to terminate with a stack overflow error instead of running out of file descriptors first which may have allowed the tserver to remedy the situation earlier. Also we had debug on so we are analyzing the logs to try and determine how it gets into this state in the first place. > Root tablet got spammed with 1.8 million log entries > > > Key: ACCUMULO-4777 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4777 > Project: Accumulo > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Ivan Bella >Priority: Critical > > We had a tserver that was handling accumulo.metadata tablets that somehow got > into a loop where it created over 22K empty wal logs. There were around 70 > metadata tablets and this resulted in around 1.8 million log entries in added > to the accumulo.root table. The only reason it stopped creating wal logs is > because it ran out of open file handles. This took us many hours and cups of > coffee to clean up. > The log contained the following messages in a tight loop: > log.TabletServerLogger INFO : Using next log hdfs://... > tserver.TabletServfer INFO : Writing log marker for hdfs://... > tserver.TabletServer INFO : Marking hdfs://... closed > log.DfsLogger INFO : Slow sync cost ... > ... > Unfortunately we did not have DEBUG turned on so we have no debug messages. > Tracking through the code there are three places where the > TabletServerLogger.close method is called: > 1) via resetLoggers in the TabletServerLogger, but nothing calls this method > so this is ruled out > 2) when the log gets too large or too old, but neither of those checks should > have been hitting here. > 3) In a loop that is executed (while (!success)) in the > TabletServerLogger.write method. In this case when we unsuccessfullty write > something to the wal, then that one is closed and a new one is created. This > loop will go forever until we successfully write out the entry. A > DfsLogger.LogClosedException seems the most logical reason. This is most > likely because a ClosedChannelException was thrown from the DfsLogger.write > methods (around line 609 in DfsLogger). > So the root cause was most likely hadoop related. However in accumulo we > probably should not be doing a tight retry loop around a hadoop failure. I > recommend at a minimum doing some sort of exponential back off and perhaps > setting a limit on the number of retries resulting in a critical tserver > failure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)