[
https://issues.apache.org/jira/browse/YARN-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592777#comment-13592777
]
Bikas Saha commented on YARN-417:
---------------------------------
Calling client.registerApp() before client.start() and client.stop() before
client.unregister() is not in line with the Service interface. Services need to
be started before used and stopped after using them. Also, adding a number of
services as part of a composite service is a common pattern. In that, all
services are added, inited, started, used and then stopped . The composite
service takes care of ordering between services. In such use cases, it may not
possible to call interface methods out of order as is being done here. We could
enhance the heartbeater to not heartbeat until register is called. or we could
start the heartbeater after registration is complete. The latter approach makes
more sense to me.
I am surprised that the DistShell code is calling resourceManager.stop() and
then resourceManager.unregister() because stop() eventually call
AMRMClientImpl.stop() that shuts down the proxy. After that, unregister() call
on AMRMClientImpl should fail.
Why are we calling client.start() in the init() method and not at the beginning
of the start method()? Perhaps related to the above comment.
{code}
+ @Override
+ public void init(Configuration conf) {
+ super.init(conf);
+ client.init(conf);
+ client.start();
+ }
{code}
Why not wait for the handlerThread to join()? The comment does not match the
code for the heartbeat thread.
{code}
+ /**
+ * Tells the heartbeat thread to stop, but does not wait for it to return.
+ */
+ @Override
+ public void stop() {
+ client.stop();
+ keepRunning = false;
+ try {
+ heartbeatThread.join();
+ } catch (InterruptedException ex) {
+ LOG.error("Error joining with heartbeat thread", ex);
+ }
+ handlerThread.interrupt();
+ }
{code}
In general, it would be good to spend some thought on the thread safety of the
new class. Both external calls from the app and the internal producer/consumer
race between the heartbeat and callback threads. During startup, execution and
shutdown. I havent thought through them but the almost complete absence of any
synchronization made be wonder if it was by design.
I would prefer queue.put() which blocks on capacity instead of queue.add() to
mirror queue.take().
Could save some time using wait/notify? Important for end to end tests time.
{done}
+ while (!done) {
+ try {
+ Thread.sleep(1000);
+ } catch (InterruptedException ex) {}
+ }
{done}
Looks like this is only for tests. If yes, how about making it package private
and annotating with @Private and @VisibleForTesting.
{code}
+ public AMRMClientAsync(AMRMClient client, int intervalMs,
+ CallbackHandler callbackHandler) {
{code}
A committer once told me that the philosophy behind BuilderUtils it to pass all
members of the object being built and use it as a completely defined
constructor so that folks dont miss passing any member fields by accident. So I
guess nodeUpdates and reboot should also be passed in as arguments.
{code}
+ public static AMResponse newAMResponse(
+ List<ContainerStatus> completedContainers,
+ List<Container> allocatedContainers) {
{code}
I would like the test code to not exemplify incorrect use of the class. The
test is calling allocate without call register and it all works. Maybe if we
fixed the first comment in this review then it wont allow such incorrect usage.
Secondly, folks tend to look at test code to see usage of a class and so
showing incorrect usage is not a good idea IMO.
{code}
+ AMRMClientAsync asyncClient = new AMRMClientAsync(client, 200,
callbackHandler);
+ asyncClient.init(conf);
+ asyncClient.start();
+
+ while (callbackHandler.takeAllocatedContainers() == null) {
+
{code}
This code can lead to a flaky test. If I understand the flow correctly the
following can happen. CallbackHandler populates allocatedContainers and OS
pauses it. In the meanwhile heartbeater has already given completedContainers.
The main thread then takesAllocatedContainers and it pauses. The
CallbackHandler then returns and onCompletedContainers() is called which
populates completed containers. Then it pauses. The main thread executes
takeCompletedContainers() which returns non-null and the Assert fails. Is this
a correct understanding? If yes, we should make sure that the test does not end
up being flaky. In general sleep() should be avoided because it makes tests
slow and tend to be flaky. I agree in some case, sleep is hard to avoid when
the test is running an inline service whose timing we cannot control or when
the effort to do so is too large. But in this case where all the code is test
code or mock code, we could avoid sleeping.
{code}
+ while (callbackHandler.takeAllocatedContainers() == null) {
+ // allocated containers should come before completed containers
+ Assert.assertEquals(null, callbackHandler.takeCompletedContainers());
+ Thread.sleep(100);
+ }
{code}
An interesting test would be one that asserts that the heartbeats continue to
happen when the callback is called. Its an important feature we explicitly
designed for, right?
> Add a poller that allows the AM to receive notifications when it is assigned
> containers
> ---------------------------------------------------------------------------------------
>
> Key: YARN-417
> URL: https://issues.apache.org/jira/browse/YARN-417
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: api, applications
> Affects Versions: 2.0.3-alpha
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
> Attachments: AMRMClientAsync-1.java, AMRMClientAsync.java,
> YARN-417-1.patch, YARN-417-2.patch, YARN-417-3.patch, YARN-417.patch,
> YarnAppMaster.java, YarnAppMasterListener.java
>
>
> Writing AMs would be easier for some if they did not have to handle
> heartbeating to the RM on their own.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira