[ 
https://issues.apache.org/jira/browse/YARN-11826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Syed Shameerur Rahman updated YARN-11826:
-----------------------------------------
    Description: 
 

 

*Hadoop Version 3.3.6*

During NodeManager shutdown operation, it was noted that the shutdown operation 
was stuck. On taking multiple thread dumps, it was noticed that the NodeManager 
process was stuck during LevelDb close operation.

 
{code:java}
    java.lang.Thread.State: RUNNABLE
    at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.delete(Native Method)
    at org.fusesource.leveldbjni.internal.NativeDB.delete(NativeDB.java:175)
    at org.fusesource.leveldbjni.internal.JniDB.close(JniDB.java:55)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.closeStorage(NMLeveldbStateStoreService.java:201)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceStop(NMStateStoreService.java:378)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
    - locked <0x0000000085b13450> (a java.lang.Object)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:329)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
    - locked <0x00000000857da310> (a java.lang.Object)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$1.run(NodeManager.java:543)
 {code}
 

On further analysis - it was noted that the leveldb close will wait for any 
pending compaction before it can close - but it was seen that there was pending 
compaction

 
{code:java}
   java.lang.Thread.State: RUNNABLE
    at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.CompactRange(Native 
Method)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:423)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:418)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:404)
    at org.fusesource.leveldbjni.internal.JniDB.compactRange(JniDB.java:211)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService$CompactionTimerTask.run(NMLeveldbStateStoreService.java:1736)
    at java.util.TimerThread.mainLoop(java.base@17.0.14/Timer.java:566)
    at java.util.TimerThread.run(java.base@17.0.14/Timer.java:516) {code}
 

 
 # I checked the instance and it has enough disk space and other process are 
able to write to the disk.
 # I see the code pointers are same with latest trunk as well - I guess the 
issue will happen in Hadoop 3.4.1 or latest trunk as well

 

Is this some kind of issue with the level db ? Should NodeManager do timed 
waiting for levelDb close instead of waiting infinitely ?

  was:
 

 

*Hadoop Version 3.3.6*

During NodeManager shutdown operation, it was noted that the shutdown operation 
was stuck. On taking multiple thread dumps, it was noticed that the NodeManager 
process was stuck during LevelDb close operation.

 
{code:java}
    java.lang.Thread.State: RUNNABLE
    at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.delete(Native Method)
    at org.fusesource.leveldbjni.internal.NativeDB.delete(NativeDB.java:175)
    at org.fusesource.leveldbjni.internal.JniDB.close(JniDB.java:55)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.closeStorage(NMLeveldbStateStoreService.java:201)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceStop(NMStateStoreService.java:378)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
    - locked <0x0000000085b13450> (a java.lang.Object)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:329)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
    - locked <0x00000000857da310> (a java.lang.Object)
    at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$1.run(NodeManager.java:543)
 {code}
 

On further analysis - it was noted that the leveldb close will wait for any 
pending compaction before it can close - but it was seen that there was pending 
compaction

 
{code:java}
   java.lang.Thread.State: RUNNABLE
    at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.CompactRange(Native 
Method)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:423)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:418)
    at 
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:404)
    at org.fusesource.leveldbjni.internal.JniDB.compactRange(JniDB.java:211)
    at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService$CompactionTimerTask.run(NMLeveldbStateStoreService.java:1736)
    at java.util.TimerThread.mainLoop(java.base@17.0.14/Timer.java:566)
    at java.util.TimerThread.run(java.base@17.0.14/Timer.java:516) {code}
 

 
 # I checked the instance and it has enough disk space and other process are 
able to write to the disk.

 

Is this some kind of issue with the level db ? Should NodeManager do timed 
waiting for levelDb close instead of waiting infinitely ?


> NodeManger Process Stuck In LevelDB Close Operation While Shutingdown
> ---------------------------------------------------------------------
>
>                 Key: YARN-11826
>                 URL: https://issues.apache.org/jira/browse/YARN-11826
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.3.6
>            Reporter: Syed Shameerur Rahman
>            Priority: Major
>
>  
>  
> *Hadoop Version 3.3.6*
> During NodeManager shutdown operation, it was noted that the shutdown 
> operation was stuck. On taking multiple thread dumps, it was noticed that the 
> NodeManager process was stuck during LevelDb close operation.
>  
> {code:java}
>     java.lang.Thread.State: RUNNABLE
>     at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.delete(Native Method)
>     at org.fusesource.leveldbjni.internal.NativeDB.delete(NativeDB.java:175)
>     at org.fusesource.leveldbjni.internal.JniDB.close(JniDB.java:55)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.closeStorage(NMLeveldbStateStoreService.java:201)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceStop(NMStateStoreService.java:378)
>     at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>     - locked <0x0000000085b13450> (a java.lang.Object)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:329)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530)
>     at 
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
>     - locked <0x00000000857da310> (a java.lang.Object)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager$1.run(NodeManager.java:543)
>  {code}
>  
> On further analysis - it was noted that the leveldb close will wait for any 
> pending compaction before it can close - but it was seen that there was 
> pending compaction
>  
> {code:java}
>    java.lang.Thread.State: RUNNABLE
>     at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.CompactRange(Native 
> Method)
>     at 
> org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:423)
>     at 
> org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:418)
>     at 
> org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:404)
>     at org.fusesource.leveldbjni.internal.JniDB.compactRange(JniDB.java:211)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService$CompactionTimerTask.run(NMLeveldbStateStoreService.java:1736)
>     at java.util.TimerThread.mainLoop(java.base@17.0.14/Timer.java:566)
>     at java.util.TimerThread.run(java.base@17.0.14/Timer.java:516) {code}
>  
>  
>  # I checked the instance and it has enough disk space and other process are 
> able to write to the disk.
>  # I see the code pointers are same with latest trunk as well - I guess the 
> issue will happen in Hadoop 3.4.1 or latest trunk as well
>  
> Is this some kind of issue with the level db ? Should NodeManager do timed 
> waiting for levelDb close instead of waiting infinitely ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to