[jira] [Comment Edited] (HDFS-14655) SBN : Namenode crashes if one of The JN is down

Erik Krogen (JIRA) Mon, 22 Jul 2019 15:38:11 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890513#comment-16890513
 ]


Erik Krogen edited comment on HDFS-14655 at 7/22/19 10:37 PM:
--------------------------------------------------------------

Great discussion here. [~ayushtkn], particularly good call on the issue that 
cancelling is not fully sufficient to fix this issue.

I agree that calling cancel + limiting the size of the {{parallelExecutor}} 
seems to be a good approach. That executor is scoped to a single JN, so a limit 
will not affect other JNs if one is running slowly. Plus, the 
{{parallelExecutor}} is only used by {{getJournaledEdits}} and 
{{getEditLogManifest}} (others use the {{singleThreadExecutor}}) so no other 
operations besides edit log tailing should be affected. It seems we'll need to 
use {{new ThreadPoolExecutor()}} directly instead of the {{Executors}} 
convenience method.

You said that many {{InterruptedException}} instances are being logged, is 
there any way we can suppress them? Where are they logged from?

edit: [~ayushtkn], I am assigning to you for now since you seem to be driving 
the effort


was (Author: xkrogen):
Great discussion here. [~ayushtkn], particularly good call on the issue that 
cancelling is not fully sufficient to fix this issue.

I agree that calling cancel + limiting the size of the {{parallelExecutor}} 
seems to be a good approach. That executor is scoped to a single JN, so a limit 
will not affect other JNs if one is running slowly. Plus, the 
{{parallelExecutor}} is only used by {{getJournaledEdits}} and 
{{getEditLogManifest}} (others use the {{singleThreadExecutor}}) so no other 
operations besides edit log tailing should be affected. It seems we'll need to 
use {{new ThreadPoolExecutor()}} directly instead of the {{Executors}} 
convenience method.

You said that many {{InterruptedException}} instances are being logged, is 
there any way we can suppress them? Where are they logged from?

> SBN : Namenode crashes if one of The JN is down
> -----------------------------------------------
>
>                 Key: HDFS-14655
>                 URL: https://issues.apache.org/jira/browse/HDFS-14655
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Harshakiran Reddy
>            Assignee: Ayush Saxena
>            Priority: Major
>         Attachments: HDFS-14655.poc.patch
>
>
> {noformat}
> 2019-07-04 17:35:54,064 | INFO  | Logger channel (from parallel executor) to 
> XXXXXXX/XXXXXXX | Retrying connect to server: XXXXXXX/XXXXXXX. Already tried 
> 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS) | Client.java:975
> 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered 
> while tailing edits. Shutting down standby NN. | EditLogTailer.java:474
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:717)
>       at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>       at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>       at 
> com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440)
>       at 
> com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533)
>       at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508)
>       at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:360)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>       at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2019-07-04 17:35:54,112 | INFO  | Edit log tailer | Exiting with status 1: 
> java.lang.OutOfMemoryError: unable to create new native thread | 
> ExitUtil.java:210
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14655) SBN : Namenode crashes if one of The JN is down

Reply via email to