[ https://issues.apache.org/jira/browse/YARN-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299101#comment-14299101 ]
Zhijie Shen commented on YARN-2808: ----------------------------------- I think the patch should work, though it's not guarantee all the containers will be returned for a running attempt due to some race condition that container is finished, its info is pushed to timeline server, but is still not persisted. Anyway, it will be a good improvement in terms of user experience. Some minor comments: 1. Is it possible to improve the performance? The application could be big to have hundreds of containers. It's not efficient to loop through them many times. Maybe run through them once, and put the ids in a hashset for check? {code} for (int i = 0; i < containersFromHistoryServer.size(); i++) { if (containersFromHistoryServer.get(i).getContainerId() .equals(tmp.getContainerId())) { containersFromHistoryServer.remove(i); //Remove containers from AHS as container from RM will have latest //information break; } } {code} 2. In the test can we add a case that the running container is in RM, and it's also in the timeline server as part of its information is written there, the container info cached in RM is sourced instead of the partial info in the timeline server. > yarn client tool can not list app_attempt's container info correctly > -------------------------------------------------------------------- > > Key: YARN-2808 > URL: https://issues.apache.org/jira/browse/YARN-2808 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Affects Versions: 2.6.0 > Reporter: Gordon Wang > Assignee: Naganarasimha G R > Attachments: YARN-2808.20150126-1.patch, YARN-2808.20150130-1.patch > > > When enabling timeline server, yarn client can not list the container info > for a application attempt correctly. > Here is the reproduce step. > # enabling yarn timeline server > # submit a MR job > # after the job is finished. use yarn client to list the container info of > the app attempt. > Then, since the RM has cached the application's attempt info, the output show > {noformat} > [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list > appattempt_1415168250217_0001_000001 > 14/11/05 01:19:15 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/11/05 01:19:15 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/11/05 01:19:16 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8032 > 14/11/05 01:19:16 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > Total number of containers :0 > Container-Id Start Time Finish > Time State Host > LOG-URL > {noformat} > But if the rm is restarted, client can fetch the container info from timeline > server correctly. > {noformat} > [hadoop@localhost hadoop-3.0.0-SNAPSHOT]$ ./bin/yarn container -list > appattempt_1415168250217_0001_000001 > 14/11/05 01:21:06 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/11/05 01:21:06 INFO impl.TimelineClientImpl: Timeline service address: > http://0.0.0.0:8188/ws/v1/timeline/ > 14/11/05 01:21:06 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8032 > 14/11/05 01:21:06 INFO client.AHSProxy: Connecting to Application History > server at /0.0.0.0:10200 > Total number of containers :4 > Container-Id Start Time Finish > Time State Host > LOG-URL > container_1415168250217_0001_01_000001 1415168318376 > 1415168349896 COMPLETE localhost.localdomain:47024 > http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_000001/container_1415168250217_0001_01_000001/hadoop > container_1415168250217_0001_01_000002 1415168326399 > 1415168334858 COMPLETE localhost.localdomain:47024 > http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_000002/container_1415168250217_0001_01_000002/hadoop > container_1415168250217_0001_01_000003 1415168326400 > 1415168335277 COMPLETE localhost.localdomain:47024 > http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_000003/container_1415168250217_0001_01_000003/hadoop > container_1415168250217_0001_01_000004 1415168335825 > 1415168343873 COMPLETE localhost.localdomain:47024 > http://0.0.0.0:8188/applicationhistory/logs/localhost.localdomain:47024/container_1415168250217_0001_01_000004/container_1415168250217_0001_01_000004/hadoop > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)