[ 
https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541722#comment-14541722
 ] 

Junping Du commented on YARN-3634:
----------------------------------

Thanks [~sjlee0] to report the issue and deliver the patch to fix it. Patch 
looks mostly good to me.  Only one minor issue:
{code}
+    if (nmCollectorService == null) {
+      synchronized (this) {
+        Configuration conf = getConfig();
+        InetSocketAddress nmCollectorServiceAddress = conf.getSocketAddr(
+            YarnConfiguration.NM_BIND_HOST,
+            YarnConfiguration.NM_COLLECTOR_SERVICE_ADDRESS,
+            YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_ADDRESS,
+            YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_PORT);
+        LOG.info("nmCollectorServiceAddress: " + nmCollectorServiceAddress);
+        final YarnRPC rpc = YarnRPC.create(conf);
+
+        // TODO Security settings.
+        nmCollectorService = (CollectorNodemanagerProtocol) rpc.getProxy(
+            CollectorNodemanagerProtocol.class,
+            nmCollectorServiceAddress, conf);
+      }
+    }
{code}
The synchronized block sounds unnecessary, as this is the only place to update 
nmCollectorService which get called by serviceStart() - which get called by 
single thread only. The race condition could happen with other reader threads. 
But given writer is always single thread and we already mark nmCollectorService 
as volatile in this patch, it should safe to remove the synchronized block.


> TestMRTimelineEventHandling and TestApplication are broken
> ----------------------------------------------------------
>
>                 Key: YARN-3634
>                 URL: https://issues.apache.org/jira/browse/YARN-3634
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-3634-YARN-2928.001.patch, 
> YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch
>
>
> TestMRTimelineEventHandling is broken. Relevant error message:
> {noformat}
> 2015-05-12 06:28:56,415 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:28:57,416 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:28:58,416 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:28:59,417 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:00,418 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:01,419 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:02,420 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:03,420 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:04,421 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:05,422 INFO  [AsyncDispatcher event handler] ipc.Client 
> (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
> asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 9 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 2015-05-12 06:29:05,424 ERROR [AsyncDispatcher event handler] 
> collector.NodeTimelineCollectorManager 
> (NodeTimelineCollectorManager.java:postPut(121)) - Failed to communicate with 
> NM Collector Service for application_1431412130291_0001
> 2015-05-12 06:29:05,425 WARN  [AsyncDispatcher event handler] 
> containermanager.AuxServices 
> (AuxServices.java:logWarningWhenAuxServiceThrowExceptions(261)) - The 
> auxService name is timeline_collector and it got an error at event: 
> CONTAINER_INIT
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 
> to asf904.gq1.ygridcore.net:0 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager.putIfAbsent(TimelineCollectorManager.java:97)
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.addApplication(PerNodeTimelineCollectorsAuxService.java:99)
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.PerNodeTimelineCollectorsAuxService.initializeContainer(PerNodeTimelineCollectorsAuxService.java:126)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:226)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:49)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 
> to asf904.gq1.ygridcore.net:0 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.postPut(NodeTimelineCollectorManager.java:122)
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager.putIfAbsent(TimelineCollectorManager.java:95)
>       ... 7 more
> Caused by: java.net.ConnectException: Call From 
> asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed 
> on connection exception: java.net.ConnectException: Connection refused; For 
> more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>       at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>       at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1496)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1423)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>       at com.sun.proxy.$Proxy108.getTimelineCollectorContext(Unknown Source)
>       at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.CollectorNodemanagerProtocolPBClientImpl.getTimelineCollectorContext(CollectorNodemanagerProtocolPBClientImpl.java:99)
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.updateTimelineCollectorContext(NodeTimelineCollectorManager.java:188)
>       at 
> org.apache.hadoop.yarn.server.timelineservice.collector.NodeTimelineCollectorManager.postPut(NodeTimelineCollectorManager.java:116)
>       ... 8 more
> Caused by: java.net.ConnectException: Connection refused
>       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>       at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
>       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:625)
>       at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:723)
>       at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>       at org.apache.hadoop.ipc.Client.getConnection(Client.java:1545)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1462)
>       ... 14 more
> {noformat}
> This surfaced when we switched to use port ":0" for the mini-YARN cluster for 
> the node collector service.
> Also, TestApplication tests are broken because the mocked context does not 
> have the configuration object which ApplicationImpl depends on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to