Hi, Yarn Developers, I am writing to report a strange case I meet here. I was using libyarn( https://github.com/apache/incubator-hawq/tree/master/depends/libyarn), a kind of C++ client library to communicate with Hadoop Yarn. It acts as a "Unmanaged Application Master", doing things such as register application, request to allocate/release containers from Yarn, activate containers, etc(sample app is in /libyarn/sample/c_client_main.c).
What is strange to me is, when configure timeline server and yarn.resourcemanager.system-metrics-publisher.enabled is enable in yarn-site.xml. Something strange happens. If the property "yarn.resourcemanager.system-metrics-publisher.enabled" is not enable and no timeline server is configured, things goes well. Once Yarn RM gets a heartbeat from my application, the progress becomes 50%, 50% is hard coded in allocate request. [gpadmin@master ~]$ yarn application -list 15/12/09 08:12:54 INFO impl.TimelineClientImpl: Timeline service address: http://master:8188/ws/v1/timeline/ Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL application_1449630430655_0007 libyarn YARN gpadmin default RUNNING UNDEFINED 50% url However if timeline server is configured, and "yarn.resourcemanager.system-metrics-publisher.enabled" is enable, Even my application can allocate/activate/release containers, the progress of my application is ALWAYS 0%! I've check the log of Yarn resource manager, I find a NullPointerException in log. I've google this exception with such callstack. there is a question, which is similar to mine. http://zh.hortonworks.com/community/forums/topic/error-in-handling-event-type-registered-for-applicationattempt/ I can see below information in log in log file: 2015-12-10 02:52:19,025 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1449744734026_0001 AttemptId: appattempt_1449744734026_0001_000001 MasterContainer: null ... 2015-12-10 02:52:19,946 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type REGISTERED for applicationAttempt application_1449744734026_0001 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.appAttemptRegistered(SystemMetricsPublisher.java:145) Below is related Yarn codes: private void storeAttempt() { // store attempt data in a non-blocking manner to prevent dispatcher // thread starvation and wait for state to be saved LOG.info("Storing attempt: AppId: " + getAppAttemptId().getApplicationId() + " AttemptId: " + getAppAttemptId() + " MasterContainer: " + masterContainer); rmContext.getStateStore().storeNewApplicationAttempt(this); } public void appAttemptRegistered(RMAppAttempt appAttempt, long registeredTime) { if (publishSystemMetrics) { dispatcher.getEventHandler().handle( new AppAttemptRegisteredEvent( appAttempt.getAppAttemptId(), appAttempt.getHost(), appAttempt.getRpcPort(), appAttempt.getTrackingUrl(), appAttempt.getOriginalTrackingUrl(), appAttempt.getMasterContainer().getId(), registeredTime)); } } I guess since libyarn is a "unmanaged" application master, there is no container allocated for it, so MasterContainer is null. But when Yarn register this application into SystemMetricsPublisher, it requires a container and its id. That's why this null exception happens. In a word, if a unmanaged AM tries to register in Yarn, when timeline server is configured and "yarn.resourcemanager.system-metrics-publisher.enabled" is enable, a java NullPointerException occurs in Yarn. I am not sure if it's my app's fault or Yarn's bug. Would you please help me take a look at it? Thanks, Wen 2015-12-10 02:52:18,842 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1 2015-12-10 02:52:18,873 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1 submitted by user gpadmin 2015-12-10 02:52:18,874 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1449744734026_0001 2015-12-10 02:52:18,874 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=gpadmin IP=10.32.34.218 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1449744734026_0001 2015-12-10 02:52:18,883 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1449744734026_0001 State change from NEW to NEW_SAVING 2015-12-10 02:52:18,884 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1449744734026_0001 2015-12-10 02:52:18,908 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1449744734026_0001 State change from NEW_SAVING to SUBMITTED 2015-12-10 02:52:18,910 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application added - appId: application_1449744734026_0001 user: gpadmin leaf-queue of parent: root #applications: 1 2015-12-10 02:52:18,911 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Accepted application application_1449744734026_0001 from user: gpadmin, in queue: default 2015-12-10 02:52:18,969 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1449744734026_0001 State change from SUBMITTED to ACCEPTED 2015-12-10 02:52:19,004 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1449744734026_0001_000001 2015-12-10 02:52:19,006 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1449744734026_0001_000001 State change from NEW to SUBMITTED 2015-12-10 02:52:19,023 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start 2015-12-10 02:52:19,024 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue for user, it is likely set too low. skipping enforcement to allow at least one application to start 2015-12-10 02:52:19,024 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application application_1449744734026_0001 from user: gpadmin activated in queue: default 2015-12-10 02:52:19,024 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application added - appId: application_1449744734026_0001 user: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue$User@6a7edb15, leaf-queue: default #user-pending-applications: 0 #user-active-applications: 1 #queue-pending-applications: 0 #queue-active-applications: 1 2015-12-10 02:52:19,024 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Added Application Attempt appattempt_1449744734026_0001_000001 to scheduler from user gpadmin in queue default 2015-12-10 02:52:19,025 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1449744734026_0001 AttemptId: appattempt_1449744734026_0001_000001 MasterContainer: null 2015-12-10 02:52:19,039 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1449744734026_0001_000001 State change from SUBMITTED to LAUNCHED_UNMANAGED_SAVING 2015-12-10 02:52:19,039 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1449744734026_0001_000001 2015-12-10 02:52:19,042 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1449744734026_0001_000001 2015-12-10 02:52:19,046 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1449744734026_0001_000001 State change from LAUNCHED_UNMANAGED_SAVING to LAUNCHED 2015-12-10 02:52:19,936 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1449744734026_0001_000001 (auth:SIMPLE) 2015-12-10 02:52:19,944 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1449744734026_0001_000001 2015-12-10 02:52:19,945 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=gpadmin IP=10.32.34.218 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1449744734026_0001 APPATTEMPTID=appattempt_1449744734026_0001_000001 2015-12-10 02:52:19,946 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type REGISTERED for applicationAttempt application_1449744734026_0001 java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.appAttemptRegistered(SystemMetricsPublisher.java:145) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMRegisteredTransition.transition(RMAppAttemptImpl.java:1380) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:850) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:831) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-12-10 02:52:19,947 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1449744734026_0001 State change from ACCEPTED to RUNNING 2015-12-10 02:52:19,963 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:850) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:831) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-12-10 02:52:38,864 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Node not found resyncing test2:9983 2015-12-10 02:52:39,884 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved test2 to /default-rack 2015-12-10 02:52:39,886 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node test2(cmPort: 9983 httpPort: 8042) registered with capability: <memory:12288, vCores:6>, assigned nodeId test2:9983 2015-12-10 02:52:39,886 WARN org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get RMApp by appId=application_1449655982302_0004, just added it to finishedApplications list for cleanup 2015-12-10 02:52:39,886 WARN org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Cannot get RMApp by appId=application_1449743958580_0001, just added it to finishedApplications list for cleanup 2015-12-10 02:52:39,889 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: test2:9983 Node Transitioned from NEW to RUNNING
