[
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840096#comment-16840096
]
kyungwan nam commented on YARN-9521:
------------------------------------
I think the cause of this problem is as follows.
1. _fs_ is set by calling FileSystem.get() on
SystemServiceManagerImpl.serviceInit
2. RMAppImpl.appAdminClientCleanUp will be called on RMAppImpl.FinalTransition,
if APP_COMPLETED event occurs during RMStateStore recovery
{code}
static void appAdminClientCleanUp(RMAppImpl app) {
try {
AppAdminClient client = AppAdminClient.createAppAdminClient(app
.applicationType, app.conf);
int result = client.actionCleanUp(app.name, app.user);
{code}
ApiServiceClient.actionCleanUp
{code}
@Override
public int actionCleanUp(String appName, String userName) throws
IOException, YarnException {
ServiceClient sc = new ServiceClient();
sc.init(getConfig());
sc.start();
int result = sc.actionCleanUp(appName, userName);
sc.close();
return result;
}
{code}
ServiceClient instance has a FileSystem by calling FileSystem.get() at
initialization time. but, it might be a cached one.
the FileSystem cached will be closed by _sc.close()_
3. scanForUserServices is called on SystemServiceManagerImpl.serviceStart. but,
_fs_ has been closed already.
RM log
{code}
// 1. SystemServiceManagerImpl.serviceInit
//
2019-05-15 10:27:59,445 DEBUG service.AbstractService
(AbstractService.java:enterState(443)) - Service:
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl entered state
INITED
2019-05-15 10:27:59,446 INFO client.SystemServiceManagerImpl
(SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory is
configured to /services
2019-05-15 10:27:59,472 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3209)) - Loading filesystems
2019-05-15 10:27:59,483 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - file:// = class
org.apache.hadoop.fs.LocalFileSystem from
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,488 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - viewfs:// = class
org.apache.hadoop.fs.viewfs.ViewFileSystem from
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,491 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - har:// = class
org.apache.hadoop.fs.HarFileSystem from
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,492 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - http:// = class
org.apache.hadoop.fs.http.HttpFileSystem from
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,493 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - https:// = class
org.apache.hadoop.fs.http.HttpsFileSystem from
/usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,503 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - hdfs:// = class
org.apache.hadoop.hdfs.DistributedFileSystem from
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,511 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - webhdfs:// = class
org.apache.hadoop.hdfs.web.WebHdfsFileSystem from
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,512 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - swebhdfs:// = class
org.apache.hadoop.hdfs.web.SWebHdfsFileSystem from
/usr/hdp/3.1.0.0-78/hadoop-hdfs/hadoop-hdfs-client-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem
(FileSystem.java:loadFileSystems(3221)) - s3n:// = class
org.apache.hadoop.fs.s3native.NativeS3FileSystem from
/usr/hdp/3.1.0.0-78/hadoop-mapreduce/hadoop-aws-3.1.1.3.1.2.3.1.0.0-78.jar
2019-05-15 10:27:59,514 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs
2019-05-15 10:27:59,514 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3268)) - looking for configuration option
fs.hdfs.impl
2019-05-15 10:27:59,528 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for
implementation class
2019-05-15 10:27:59,528 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class
org.apache.hadoop.hdfs.DistributedFileSystem
// 2. APP_COMPLETED event occurs
//
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:handle(895)) -
Processing event for application_1556612756829_0001 of type RECOVER
2019-05-15 10:28:02,931 DEBUG rmapp.RMAppImpl (RMAppImpl.java:recover(933)) -
Recovering app: application_1556612756829_0001 with 2 attempts and final state
= FAILED
2019-05-15 10:28:02,931 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:<init>(544)) - yarn.app.attempt.diagnostics.limit.kc : 64
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:recover(969)) - Recovering attempt:
appattempt_1556612756829_0001_000001 with final state = FAILED
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:<init>(544)) - yarn.app.attempt.diagnostics.limit.kc : 64
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:recover(969)) - Recovering attempt:
appattempt_1556612756829_0001_000002 with final state = FAILED
2019-05-15 10:28:02,932 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(910)) - Processing event for
appattempt_1556612756829_0001_000001 of type RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(930)) - appattempt_1556612756829_0001_000001
State change from NEW to FAILED on event = RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(910)) - Processing event for
appattempt_1556612756829_0001_000002 of type RECOVER
2019-05-15 10:28:02,933 DEBUG attempt.RMAppAttemptImpl
(RMAppAttemptImpl.java:handle(930)) - appattempt_1556612756829_0001_000002
State change from NEW to FAILED on event = RECOVER
2019-05-15 10:28:02,933 DEBUG event.AsyncDispatcher
(AsyncDispatcher.java:dispatch(188)) - Dispatching the event
org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEvent.EventType:
APP_COMPLETED
2019-05-15 10:28:02,933 DEBUG resourcemanager.RMAppManager
(RMAppManager.java:handle(573)) - RMAppManager processing event for
application_1556612756829_0001 of type APP_COMPLETED
2019-05-15 10:28:02,934 DEBUG service.AbstractService
(AbstractService.java:enterState(443)) - Service:
org.apache.hadoop.yarn.client.api.AppAdminClient entered state INITED
2019-05-15 10:28:02,934 DEBUG service.CompositeService
(CompositeService.java:addService(74)) - Adding service
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl
2019-05-15 10:28:02,934 DEBUG service.CompositeService
(CompositeService.java:serviceInit(105)) -
org.apache.hadoop.yarn.client.api.AppAdminClient: initing services, size=1
2019-05-15 10:28:02,934 DEBUG service.AbstractService
(AbstractService.java:enterState(443)) - Service:
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl entered state INITED
2019-05-15 10:28:02,934 DEBUG service.CompositeService
(CompositeService.java:serviceStart(116)) -
org.apache.hadoop.yarn.client.api.AppAdminClient: starting services, size=1
2019-05-15 10:28:02,935 INFO client.RMProxy
(RMProxy.java:newProxyInstance(133)) - Connecting to ResourceManager at
test.io/10.1.1.1:8050
2019-05-15 10:28:02,935 DEBUG security.UserGroupInformation
(UserGroupInformation.java:logPrivilegedAction(1756)) - PrivilegedAction
as:yarn (auth:SIMPLE)
from:org.apache.hadoop.yarn.client.RMProxy.getProxy(RMProxy.java:147)
2019-05-15 10:28:02,935 DEBUG ipc.YarnRPC (YarnRPC.java:create(59)) - Creating
YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
2019-05-15 10:28:02,935 DEBUG ipc.HadoopYarnProtoRPC
(HadoopYarnProtoRPC.java:getProxy(47)) - Creating a HadoopYarnProtoRpc proxy
for protocol interface org.apache.hadoop.yarn.api.ApplicationClientProtocol
2019-05-15 10:28:02,936 DEBUG ipc.Client (ClientCache.java:getClient(63)) -
getting client out of cache: org.apache.hadoop.ipc.Client@560cbf1a
2019-05-15 10:28:02,936 DEBUG service.AbstractService
(AbstractService.java:start(197)) - Service
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl is started
2019-05-15 10:28:02,936 DEBUG service.AbstractService
(AbstractService.java:start(197)) - Service
org.apache.hadoop.yarn.client.api.AppAdminClient is started
2019-05-15 10:28:02,936 DEBUG service.AbstractService
(AbstractService.java:enterState(443)) - Service:
org.apache.hadoop.yarn.client.api.AppAdminClient entered state INITED
2019-05-15 10:28:02,936 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3264)) - Looking for FS supporting hdfs
2019-05-15 10:28:02,936 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3268)) - looking for configuration option
fs.hdfs.impl
2019-05-15 10:28:02,937 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3275)) - Looking in service filesystems for
implementation class
2019-05-15 10:28:02,937 DEBUG fs.FileSystem
(FileSystem.java:getFileSystemClass(3284)) - FS for hdfs is class
org.apache.hadoop.hdfs.DistributedFileSystem
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf
(DfsClientConf.java:<init>(673)) - dfs.client.use.legacy.blockreader.local =
false
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf
(DfsClientConf.java:<init>(675)) - dfs.client.read.shortcircuit = true
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf
(DfsClientConf.java:<init>(677)) - dfs.client.domain.socket.data.traffic = false
2019-05-15 10:28:02,937 DEBUG impl.DfsClientConf
(DfsClientConf.java:<init>(679)) - dfs.domain.socket.path =
/var/lib/hadoop-hdfs/dn_socket
2019-05-15 10:28:02,938 DEBUG hdfs.DFSClient (DFSClient.java:<init>(316)) -
Sets dfs.client.block.write.replace-datanode-on-failure.min-replication to 0
// 3. SystemServiceManagerImpl.serviceStart
//
2019-05-15 10:28:03,424 INFO client.SystemServiceManagerImpl
(SystemServiceManagerImpl.java:scanForUserServices(281)) - Scan for launch type
on /services
2019-05-15 10:28:03,427 DEBUG service.AbstractService
(AbstractService.java:noteFailure(257)) - noteFailure {}java.io.IOException:
Filesystem closed
2019-05-15 10:28:03,427 INFO service.AbstractService
(AbstractService.java:noteFailure(267)) - Service
org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl failed in state
STARTED
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
{code}
> RM filed to start due to system services
> ----------------------------------------
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: kyungwan nam
> Priority: Major
> Attachments: YARN-9521.001.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory
> is configured to /services
> 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO service.AbstractService
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException:
> Filesystem closed
> at
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1217)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1233)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1200)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to
> yarn-site
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]