[
https://issues.apache.org/jira/browse/YARN-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759973#comment-13759973
]
Bikas Saha commented on YARN-1027:
----------------------------------
The patch look clean overall.
I would suggest keeping haEnabled concept within the HAServiceProtocol service
instead of mixing it between the ResourceManager and HAServiceProtocol. Thus
the RM always addService(HAServiceProtocol). HAServiceProtocol is the one that
checks if haEnabled in serviceStart(). If enabled then it transitions to
standby and waits for active signal. If not, then it directly transitions to
active.
Shouldn't we simply call transitionToStandby() here? That would ensure
getServiceStatus() returns non active status for anyone that cares to know.
{code}
+ public void serviceStop() throws Exception {
+ if (rm.haState == HAServiceState.ACTIVE) {
+ rm.stopActiveServices();
{code}
This is fine for now but we might have to invest in better health check in a
different jira. Any ideas?
{code}
public synchronized void monitorHealth() throws HealthCheckFailedException {
+ if (rm.haState == HAServiceState.ACTIVE && !rm.areActiveServicesRunning())
{
{code}
We probably want the log before the if stmt.
Should we change state to standby before we stop services? Assuming that HA
aware services would need to know about this earlier rather than later so that
they can stop signaling Active services and allow them to be drained/stopped.
{code}
+ if (rm.haState == HAServiceState.ACTIVE) {
+ rm.stopActiveServices();
+ }
+
+ LOG.info("Transitioning to standby");
+ rm.haState = HAServiceState.STANDBY;
{code}
Didnt quite get this comment. Is this do with change being requested by
user/admin/ZKFC?
{code}
+ public void transitionToActive(StateChangeRequestInfo reqInfo) {
+ // TODO: When automatic failover is enabled, check if transition should
+ // be allowed for this request
{code}
What are the pros of making haState a member of ResourceManager instead of
HAServiceProtocol? A pro of the latter is that it keeps all HA stuff in one
place.
Why is there a lock used in ResourceManager.startActive() etc. Why are these
methods protected. If testing, then lets add an @visiblefortesting annotation.
Is there a way to confirm that the active service objects are all being GC'd?
testStartAndTransitions() - How about calling getServiceStatus() and
monitorHealth() in addition to checking the internal members, in all places
where internal members are being checked. So we can test and exercise those
methods too. How about completing
Active->Standby->Active->Standby->Active->RM.serviceStop(). This would fully
simulate multiple full cycles of transitions and also verify the shutdown case.
We can also issue some requests like createApplication() to the RM, when in
active state, and verify that the RM is really working.
TestRMHADisabled. It confusing to read that the RM has started but its
haState==INITIALIZING. Also, we can probably move this test in TestRMHA.java to
keep related tests in one place.
Minor nits
LOG instead of print?
{code}
+ } catch (Exception e) {
+ e.printStackTrace();
{code}
RM_HA_PREFIX instead of HA_PREFIX
> Implement RMHAServiceProtocol
> -----------------------------
>
> Key: YARN-1027
> URL: https://issues.apache.org/jira/browse/YARN-1027
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Karthik Kambatla
> Attachments: test-yarn-1027.patch, yarn-1027-1.patch,
> yarn-1027-2.patch, yarn-1027-3.patch, yarn-1027-including-yarn-1098-3.patch,
> yarn-1027-in-rm-poc.patch
>
>
> Implement existing HAServiceProtocol from Hadoop common. This protocol is the
> single point of interaction between the RM and HA clients/services.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira