[ 
https://issues.apache.org/jira/browse/YARN-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592777#comment-13592777
 ] 

Bikas Saha commented on YARN-417:
---------------------------------

Calling client.registerApp() before client.start() and client.stop() before 
client.unregister() is not in line with the Service interface. Services need to 
be started before used and stopped after using them. Also, adding a number of 
services as part of a composite service is a common pattern. In that, all 
services are added, inited, started, used and then stopped . The composite 
service takes care of ordering between services. In such use cases, it may not 
possible to call interface methods out of order as is being done here. We could 
enhance the heartbeater to not heartbeat until register is called. or we could 
start the heartbeater after registration is complete. The latter approach makes 
more sense to me.

I am surprised that the DistShell code is calling resourceManager.stop() and 
then resourceManager.unregister() because stop() eventually call 
AMRMClientImpl.stop() that shuts down the proxy. After that, unregister() call 
on AMRMClientImpl should fail.

Why are we calling client.start() in the init() method and not at the beginning 
of the start method()? Perhaps related to the above comment.
{code}
+  @Override
+  public void init(Configuration conf) {
+    super.init(conf);
+    client.init(conf);
+    client.start();
+  }
{code}

Why not wait for the handlerThread to join()? The comment does not match the 
code for the heartbeat thread.
{code}
+  /**
+   * Tells the heartbeat thread to stop, but does not wait for it to return.
+   */
+  @Override
+  public void stop() {
+    client.stop();
+    keepRunning = false;
+    try {
+      heartbeatThread.join();
+    } catch (InterruptedException ex) {
+      LOG.error("Error joining with heartbeat thread", ex);
+    }
+    handlerThread.interrupt();
+  }
{code}

In general, it would be good to spend some thought on the thread safety of the 
new class. Both external calls from the app and the internal producer/consumer 
race between the heartbeat and callback threads. During startup, execution and 
shutdown. I havent thought through them but the almost complete absence of any 
synchronization made be wonder if it was by design.
I would prefer queue.put() which blocks on capacity instead of queue.add() to 
mirror queue.take().

Could save some time using wait/notify? Important for end to end tests time.
{done}
+    while (!done) {
+      try {
+        Thread.sleep(1000);
+      } catch (InterruptedException ex) {}
+    }
{done}

Looks like this is only for tests. If yes, how about making it package private 
and annotating with @Private and @VisibleForTesting.
{code}
+  public AMRMClientAsync(AMRMClient client, int intervalMs,
+      CallbackHandler callbackHandler) {
{code}

A committer once told me that the philosophy behind BuilderUtils it to pass all 
members of the object being built and use it as a completely defined 
constructor so that folks dont miss passing any member fields by accident. So I 
guess nodeUpdates and reboot should also be passed in as arguments.
{code}
+  public static AMResponse newAMResponse(
+      List<ContainerStatus> completedContainers,
+      List<Container> allocatedContainers) {
{code}

I would like the test code to not exemplify incorrect use of the class. The 
test is calling allocate without call register and it all works. Maybe if we 
fixed the first comment in this review then it wont allow such incorrect usage. 
Secondly, folks tend to look at test code to see usage of a class and so 
showing incorrect usage is not a good idea IMO.
{code}
+    AMRMClientAsync asyncClient = new AMRMClientAsync(client, 200, 
callbackHandler);
+    asyncClient.init(conf);
+    asyncClient.start();
+    
+    while (callbackHandler.takeAllocatedContainers() == null) {
+
{code}

This code can lead to a flaky test. If I understand the flow correctly the 
following can happen. CallbackHandler populates allocatedContainers and OS 
pauses it. In the meanwhile heartbeater has already given completedContainers. 
The main thread then takesAllocatedContainers and it pauses. The 
CallbackHandler then returns and onCompletedContainers() is called which 
populates completed containers. Then it pauses. The main thread executes 
takeCompletedContainers() which returns non-null and the Assert fails. Is this 
a correct understanding? If yes, we should make sure that the test does not end 
up being flaky. In general sleep() should be avoided because it makes tests 
slow and tend to be flaky. I agree in some case, sleep is hard to avoid when 
the test is running an inline service whose timing we cannot control or when 
the effort to do so is too large. But in this case where all the code is test 
code or mock code, we could avoid sleeping.
{code}
+    while (callbackHandler.takeAllocatedContainers() == null) {
+      // allocated containers should come before completed containers
+      Assert.assertEquals(null, callbackHandler.takeCompletedContainers());
+      Thread.sleep(100);
+    }
{code}

An interesting test would be one that asserts that the heartbeats continue to 
happen when the callback is called. Its an important feature we explicitly 
designed for, right?


                
> Add a poller that allows the AM to receive notifications when it is assigned 
> containers
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-417
>                 URL: https://issues.apache.org/jira/browse/YARN-417
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, applications
>    Affects Versions: 2.0.3-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>         Attachments: AMRMClientAsync-1.java, AMRMClientAsync.java, 
> YARN-417-1.patch, YARN-417-2.patch, YARN-417-3.patch, YARN-417.patch, 
> YarnAppMaster.java, YarnAppMasterListener.java
>
>
> Writing AMs would be easier for some if they did not have to handle 
> heartbeating to the RM on their own.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to