[
https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512837#comment-16512837
]
Eric Yang edited comment on YARN-8414 at 6/14/18 6:28 PM:
----------------------------------------------------------
[~rohithsharma] We have 9 node managers running 1000 applications, each app has
2 containers. Master container NM goes down when ATS-HBase is unavailable.
Sometimes NM goes down when many AMs are trying to talk to NM and runs out of
file descriptor while ATS-HBase is running.
On a healthy node manager netstat -tnapl looks like this:
{code}
tcp 0 0 0.0.0.0:7447 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:13562 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8040 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:46473 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8042 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:45454 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8048 0.0.0.0:* LISTEN
3400770/java
tcp 1 0 196.26.32.105:59462 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:50312 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:49858 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.111:41044 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:52339 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:59572 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:33316 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:37372 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:48964 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:48006 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:43014 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:46714 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:49158 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:44576 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:42900 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.112:58558 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:35058 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:39134 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.112:55064 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.111:41752 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:34892 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:41856 196.26.32.106:33915 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:56932 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:51486 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:35686 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:59954 196.26.32.106:33915 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:37614 196.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:47254 196.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:34356 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36030 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:50552 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:50826 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:39836 196.26.32.112:45839 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:47736 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.111:41584 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:51144 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:47411 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:39896 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36704 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:49854 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36246 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36032 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:56782 196.26.32.109:35169 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:41272 196.26.32.112:17020 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:59512 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:52320 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:43803 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.111:41980 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:41118 196.26.32.111:44675 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:33690 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:47856 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:39428 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:41128 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:48264 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:33813 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:43250 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:50558 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:58766 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:38632 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:52362 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:48720 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:60629 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:59448 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:35158 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:58251 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:32900 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:47098 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:42236 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36702 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:38479 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:34711 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:46894 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:48698 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:37716 196.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:51780 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:40948 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:40582 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36540 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:32936 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:49620 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:40782 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:56127 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:55422 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:54392 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:49724 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:51580 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36536 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:36254 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:59050 196.26.32.109:35169 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:56668 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:59410 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:42604 196.26.32.101:8031 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:43488 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:47036 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:46949 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:43440 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:32820 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:55650 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:59570 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.109:33688 ESTABLISHED
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:35682 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:54020 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.112:57912 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:38514 196.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:38022 196.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:46228 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 196.26.32.105:45375 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:35334 FIN_WAIT2
3400770/java
tcp 1 0 196.26.32.105:59081 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.108:34680 FIN_WAIT2
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.105:34822 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:43142 196.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 196.26.32.105:46473 196.26.32.106:50160 ESTABLISHED
3400770/java
tcp 1 0 196.26.32.105:51678 196.26.32.104:43939 CLOSE_WAIT
3400770/java
{code}
This full list has 386 entries. On a unhealthy node manager, the number
reaches 20,000 before crashing. We are losing 1 node manager every 12 hours
even with ATS-HBase running.
was (Author: eyang):
[~rohithsharma] We have 9 node managers running 1000 applications, each app has
2 containers. Master container NM goes down when ATS-HBase is unavailable.
Sometimes NM goes down when many AMs are trying to talk to NM and runs out of
file descriptor while ATS-HBase is running.
On a healthy node manager netstat -tnapl looks like this:
{code}
tcp 0 0 0.0.0.0:7447 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:13562 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8040 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:46473 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8042 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:45454 0.0.0.0:* LISTEN
3400770/java
tcp 0 0 0.0.0.0:8048 0.0.0.0:* LISTEN
3400770/java
tcp 1 0 172.26.32.105:59462 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:50312 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:49858 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.111:41044 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:52339 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:59572 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:33316 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:37372 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:48964 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:48006 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:43014 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:46714 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:49158 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:44576 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:42900 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.112:58558 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:35058 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:39134 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.112:55064 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.111:41752 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:34892 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:41856 172.26.32.106:33915 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:56932 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:51486 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:35686 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:59954 172.26.32.106:33915 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:37614 172.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:47254 172.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:34356 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36030 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:50552 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:50826 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:39836 172.26.32.112:45839 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:47736 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.111:41584 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:51144 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:47411 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:39896 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36704 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:49854 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36246 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36032 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:56782 172.26.32.109:35169 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:41272 172.26.32.112:17020 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:59512 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:52320 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:43803 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.111:41980 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:41118 172.26.32.111:44675 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:33690 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:47856 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:39428 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:41128 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:48264 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:33813 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:43250 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:50558 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:58766 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:38632 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:52362 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:48720 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:60629 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:59448 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:35158 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:58251 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:32900 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:47098 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:42236 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36702 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:38479 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:34711 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:46894 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:48698 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:37716 172.26.32.104:43939 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:51780 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:40948 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:40582 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36540 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:32936 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:49620 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:40782 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:56127 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:55422 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:54392 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:49724 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:51580 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36536 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:36254 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:59050 172.26.32.109:35169 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:56668 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:59410 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:42604 172.26.32.101:8031 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:43488 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:47036 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:46949 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:43440 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:32820 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:55650 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:59570 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.109:33688 ESTABLISHED
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:35682 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:54020 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.112:57912 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:38514 172.26.32.111:44675 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:38022 172.26.32.104:43939 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:46228 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 1 0 172.26.32.105:45375 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:35334 FIN_WAIT2
3400770/java
tcp 1 0 172.26.32.105:59081 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.108:34680 FIN_WAIT2
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.105:34822 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:43142 172.26.32.105:46473 CLOSE_WAIT
3400770/java
tcp 0 0 172.26.32.105:46473 172.26.32.106:50160 ESTABLISHED
3400770/java
tcp 1 0 172.26.32.105:51678 172.26.32.104:43939 CLOSE_WAIT
3400770/java
{code}
This list has 386 entries. On a unhealthy node manager, the number reaches
20,000 before crashing. We are losing 1 node manager every 12 hours even with
ATS-HBase running.
> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> ----------------------------------------------------------------
>
> Key: YARN-8414
> URL: https://issues.apache.org/jira/browse/YARN-8414
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 3.1.0
> Reporter: Eric Yang
> Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler
> queue changes. This crashes all node managers. It looks like node manager
> encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN server.AbstractConnector
> (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
> at
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN util.SysInfoLinux
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo;
> can't determine memory settings
> 2018-06-07 21:17:59,758 WARN util.SysInfoLinux
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo;
> can't determine memory settings
> 2018-06-07 21:18:00,842 WARN client.ConnectionUtils
> (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com,
> please check your network
> java.net.UnknownHostException: host1.example.com: System error
> at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at java.net.InetAddress.getByName(InetAddress.java:1076)
> at
> org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
> at
> org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess
> (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
> at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
> at
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
> at
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:18:36,266 INFO retry.RetryInvocationHandler
> (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException:
> Invalid host name: local host is: (unknown); destination host is:
> "host1.example.com":8020; java.net.UnknownHostException; For more details
> see: http://wiki.apache.org/hadoop/UnknownHost, while invoking
> ClientNamenodeProtocolTranslatorPB.getServerDefaults over
> host1.example.com:8020 after 10 failover attempts. Trying to failover after
> sleeping for 9634ms.
> 2018-06-07 21:18:36,612 WARN storage.HBaseTimelineWriterImpl
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of:
> flowName=null appId=application_1528316765723_0030 userId=csingh
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-06-07 21:18:38,396 INFO client.RpcRetryingCallerImpl
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6,
> retries=6, started=4213 ms ago, cancelled=false, msg=Call to
> host1.example.com/142.26.32.112:17020 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused: host12.example.com/142.26.32.112:17020, details=row
> 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999'
> on table 'hbase:meta' at region=hbase:meta,,1.1588230740,
> hostname=host12.example.com,17020,1528302866813, seqNum=-1
> 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager
> (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
> {code}
> Nodes were temporarily unable to resolve hostname to IP mapping.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]