Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-26 Thread Sandor Magyari


> On July 13, 2016, 7:02 a.m., Sebastian Toader wrote:
> > I think this is rather generic problem that needs to be handled in 
> > *HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.
> > 
> > These are the classes that carry out the HDFS operations. All retryable 
> > operations (e.g. SETPERMISSION) should be guarded with retry logic that 
> > would retry the operation until a given timeout before giving up and 
> > bailing out.
> > 
> > To determine which HDFS operations are retriable might be as easy as just 
> > looking the returned status/error code or the type of the exception (e.g. 
> > "RetriableException") though this needs to be verified if it's consistent 
> > with both the webhdfs and hdfsresource jar.
> > 
> > The RCO doesn't help here as even though NNs are started before ATS it 
> > doesn't mean that NNs are ready to execute HDFS operations (e.g. it takes 
> > some time to elect active and standby nodes; exiting safe mode may take a 
> > considerable amount of time if there are many datanodes)
> 
> Victor Galgo wrote:
> Hi Sebastian. Thanks for your input. 
> 
> I don't like this approach very much as sometimes NN can take really long 
> time to go out of safemode, we cannot just wait forever.
> 
> Waitting for too long will make operations when NN is off hang, while 
> they should fail with information.
> 
> Victor Galgo wrote:
> "The RCO doesn't help here as even though NNs"
> 
> It does help because NN start waits for NN leaving safemode before 
> finishing.

I've created https://issues.apache.org/jira/browse/AMBARI-17901 to have the 
generic solution suggested by Sebastian and Jonathan. Apart from that I think 
we can go ahead with this patch as it solves this special case for now.


- Sandor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-26 Thread Sandor Magyari

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review143523
---


Ship it!




Ship It!

- Sandor Magyari


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> ---
> 
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> 
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-20 Thread Jonathan Hurley


> On June 21, 2016, 4:48 p.m., Jonathan Hurley wrote:
> > Ship It!
> 
> Victor Galgo wrote:
> Jonathan can please do the honours of helping to commit this patch?
> 
> Jonathan Hurley wrote:
> Has this been committed yet? If so, please close the review.
> 
> Victor Galgo wrote:
> Hi Jonathan.
> It was not. Can you please do the honours?

It only has a single +1; we need to wait for another. Also, it seems as though 
there is still some discussion on-going about whether this is the best approach.


- Jonathan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 6:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 6:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On June 21, 2016, 8:48 p.m., Jonathan Hurley wrote:
> > Ship It!
> 
> Victor Galgo wrote:
> Jonathan can please do the honours of helping to commit this patch?
> 
> Jonathan Hurley wrote:
> Has this been committed yet? If so, please close the review.

Hi Jonathan.
It was not. Can you please do the honours?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On July 13, 2016, 7:02 a.m., Sebastian Toader wrote:
> > I think this is rather generic problem that needs to be handled in 
> > *HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.
> > 
> > These are the classes that carry out the HDFS operations. All retryable 
> > operations (e.g. SETPERMISSION) should be guarded with retry logic that 
> > would retry the operation until a given timeout before giving up and 
> > bailing out.
> > 
> > To determine which HDFS operations are retriable might be as easy as just 
> > looking the returned status/error code or the type of the exception (e.g. 
> > "RetriableException") though this needs to be verified if it's consistent 
> > with both the webhdfs and hdfsresource jar.
> > 
> > The RCO doesn't help here as even though NNs are started before ATS it 
> > doesn't mean that NNs are ready to execute HDFS operations (e.g. it takes 
> > some time to elect active and standby nodes; exiting safe mode may take a 
> > considerable amount of time if there are many datanodes)
> 
> Victor Galgo wrote:
> Hi Sebastian. Thanks for your input. 
> 
> I don't like this approach very much as sometimes NN can take really long 
> time to go out of safemode, we cannot just wait forever.
> 
> Waitting for too long will make operations when NN is off hang, while 
> they should fail with information.

"The RCO doesn't help here as even though NNs"

It does help because NN start waits for NN leaving safemode before finishing.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On July 13, 2016, 7:02 a.m., Sebastian Toader wrote:
> > I think this is rather generic problem that needs to be handled in 
> > *HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.
> > 
> > These are the classes that carry out the HDFS operations. All retryable 
> > operations (e.g. SETPERMISSION) should be guarded with retry logic that 
> > would retry the operation until a given timeout before giving up and 
> > bailing out.
> > 
> > To determine which HDFS operations are retriable might be as easy as just 
> > looking the returned status/error code or the type of the exception (e.g. 
> > "RetriableException") though this needs to be verified if it's consistent 
> > with both the webhdfs and hdfsresource jar.
> > 
> > The RCO doesn't help here as even though NNs are started before ATS it 
> > doesn't mean that NNs are ready to execute HDFS operations (e.g. it takes 
> > some time to elect active and standby nodes; exiting safe mode may take a 
> > considerable amount of time if there are many datanodes)

Hi Sebastian. Thanks for your input. 

I don't like this approach very much as sometimes NN can take really long time 
to go out of safemode, we cannot just wait forever.

Waitting for too long will make operations when NN is off hang, while they 
should fail with information.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Sebastian Toader

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---



I think this is rather generic problem that needs to be handled in 
*HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.

These are the classes that carry out the HDFS operations. All retryable 
operations (e.g. SETPERMISSION) should be guarded with retry logic that would 
retry the operation until a given timeout before giving up and bailing out.

To determine which HDFS operations are retriable might be as easy as just 
looking the returned status/error code or the type of the exception (e.g. 
"RetriableException") though this needs to be verified if it's consistent with 
both the webhdfs and hdfsresource jar.

The RCO doesn't help here as even though NNs are started before ATS it doesn't 
mean that NNs are ready to execute HDFS operations (e.g. it takes some time to 
elect active and standby nodes; exiting safe mode may take a considerable 
amount of time if there are many datanodes)

- Sebastian Toader


On June 18, 2016, 12:45 a.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 18, 2016, 12:45 a.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-12 Thread Jonathan Hurley


> On June 21, 2016, 4:48 p.m., Jonathan Hurley wrote:
> > Ship It!
> 
> Victor Galgo wrote:
> Jonathan can please do the honours of helping to commit this patch?

Has this been committed yet? If so, please close the review.


- Jonathan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 6:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 6:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Victor Galgo


> On June 21, 2016, 8:48 p.m., Jonathan Hurley wrote:
> > Ship It!

Jonathan can please do the honours of helping to commit this patch?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> ---
> 
> Calling set on destroyed 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Jonathan Hurley

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


Ship it!




Ship It!

- Jonathan Hurley


On June 17, 2016, 6:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 6:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> ---
> 
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> 
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Alejandro Fernandez


> On June 17, 2016, 5:44 p.m., Alejandro Fernandez wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 25
> > 
> >
> > How does this fix the issue? If NN just started, it still needs to get 
> > block reports, so ATS can still fail.
> 
> Victor Galgo wrote:
> Alejandro thanks for having a look!
> 
> This fixes the issue because when we do "Start All" later on, NNs start 
> is triggered before ATS start (role_command_order). And during NN start it 
> waits until safemode is off. To proceed with ATS and others.

I still prefer a custom command to wait for NN to leave safemode. A restart 
command has a timeout, so if it doesn't finish in time, a retry will stop NN 
again, which we don't want.
Further, a start command also has a timeout, and another start command may do a 
no-op since it's already started.
I think the logic is far cleaner and more maintainable with a custom command 
just for leaving safemode.


- Alejandro


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138263
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Victor Galgo


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.
> 
> Victor Galgo wrote:
> "So in my opinion, the fix should be at the NameNode Python script level 
> to always check safemode against the two NNs". 
> We cannot do that. Because at that points all Datanodes are stopped. 
> Which means NN will never go out of safemode.
> 
> Alejandro Fernandez wrote:
> Please include Jonathan Hurley in the code review since he recently 
> modified the function that waits to leave safemode.
> This is not the first time that we've had the need for a step to "leave 
> safe mode". So either we put it into the python code (and do a lot of testing 
> on it since it also impacts EU and RU), or make a custom command for HDFS 
> that is only available if HA is present, and it waits for NameNode to leave 
> safemode.
> 
> Jonathan Hurley wrote:
> Yes, I recently added something for the case during an EU where we know 
> that the NameNode probably won't leave Safemode. Essentially, don't try to 
> create any directories if the NN didn't wait for safemode to exit. That was 
> only for NN, though.
> 
> But this problem is a more generic case - it affects other services. 
> Since NN wasn't restarted it might be in Safemode. In this case, I think we 
> need to handle the retryable exception and back off and wait. 
> 
> However, you could also argue that since we know we're doing a restart 
> operation, we should be shutting down the NNs completely. If there's no issue 
> with shutting them stop during the HA process, then this patch seems fine for 
> now, but we should open another one for catching the RetryableException.

Thanks Jonathan! Absolutely agree with your points. Could you please Ship it?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Alejandro Fernandez


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.
> 
> Victor Galgo wrote:
> "So in my opinion, the fix should be at the NameNode Python script level 
> to always check safemode against the two NNs". 
> We cannot do that. Because at that points all Datanodes are stopped. 
> Which means NN will never go out of safemode.

Please include Jonathan Hurley in the code review since he recently modified 
the function that waits to leave safemode.
This is not the first time that we've had the need for a step to "leave safe 
mode". So either we put it into the python code (and do a lot of testing on it 
since it also impacts EU and RU), or make a custom command for HDFS that is 
only available if HA is present, and it waits for NameNode to leave safemode.


- Alejandro


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>  

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.

"So in my opinion, the fix should be at the NameNode Python script level to 
always check safemode against the two NNs". 
We cannot do that. Because at that points all Datanodes are stopped. Which 
means NN will never go out of safemode.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Di Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138281
---



Hello Jayush and Alejandro,

Could you please take a look at my previous comment to Victor about my 
investigation and why I think the fix should happen at the NameNode Python code 
level? Let me know if it's a reasonable statement...

- Di Li


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Di Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---



Hello Victor,

so I ran some tests and observed the following. I have a 3 node cluster, 
c1.apache.org, c2.apache.org, and c3.apache.org

1. Right after finishing the manual steps listed on step "Initialize Metadata". 
I noticed c1.apache.org has NameNode process running but it's the standby. 
c2.apache.org (the new NN added) has NN stopped.

2. The state of the two NNs in #1 seems to have cause the NN's 
check_is_active_namenode function call to return False, thus setting 
ensure_safemode_off to False as well. >> Skipping the safemode check altegather.

3. If I just ran safemode check command line hadoop cmd, here are the results, 
notice the safemode is reported as ON on the Standby node and the ther one is a 
connection refused err

[hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns binn 
-getServiceState nn1
standby
[hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns binn 
-getServiceState nn2
16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
c1.apache.org:8020. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
[hdfs@c1 ~]$ hdfs dfsadmin -safemode get
Safe mode is ON in c1.apache.org:8020
safemode: Call From c1.apache.org to c2.apache.org:8020 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused

So in my opinion, the fix should be at the NameNode Python script level to 
always check safemode against the two NNs, and make sure the safemode is off on 
the active namenode. As a safeguard against offline active NN, the check should 
eventually timeout to unblock the rest of the start sequence.

- Di Li


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 3:15 p.m., Jayush Luniya wrote:
> > ambari-web/app/messages.js, line 1325
> > 
> >
> > Not sure if stopping namenodes is the right way to go about with this.

Jayush, it looks right to me because NNs should be started with other 
components during start all to ensure correct ordering and waitting for turning 
off the safemode. If you've got any suggestions on how it can be fixed in other 
way please feel free to re-open the issue.

Thanks!


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138234
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 5:44 p.m., Alejandro Fernandez wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 25
> > 
> >
> > How does this fix the issue? If NN just started, it still needs to get 
> > block reports, so ATS can still fail.

Alejandro thanks for having a look!

This fixes the issue because when we do "Start All" later on, NNs start is 
triggered before ATS start (role_command_order). And during NN start it waits 
until safemode is off. To proceed with ATS and others.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138263
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Alejandro Fernandez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138263
---




ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
 (line 25)


How does this fix the issue? If NN just started, it still needs to get 
block reports, so ATS can still fail.


- Alejandro Fernandez


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Jayush Luniya

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138234
---




ambari-web/app/messages.js (line 1325)


Not sure if stopping namenodes is the right way to go about with this.


- Jayush Luniya


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> ---
> 
> 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Victor Galgo


> On June 16, 2016, 6:34 p.m., Di Li wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 146
> > 
> >
> > I am under the impression that the time it takes for NN to exit the 
> > safemode is largely determined by the amount of data in HDFS, not whether 
> > DNs are started before NN. 
> > 
> > Would it be safer to have some logic to check if NameNode is out of the 
> > safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
> > some time (a few minutes, depending on the cluster's performenace) to exit 
> > the safemode.
> 
> Victor Galgo wrote:
> Hi Di Li! Thanks for taking a look into this.
> 
> The problem here is more coplicated than it looks like.
> 
> *Here is the basic scenario for handling safemode:*
> During "Start All" on Namenode Start we wait while safemode goes to of 
> safemode to say that start is succesful.
> 
> *However on HA wizard:*
> We start Namenodes at the point when Datanodes are stopped. Which means 
> NN won't go out of safemode at that point, that's why skip that waitting on 
> NN start in HA wizard.
> After that when we do "Start all" (last step in the wizard). Namenodes 
> are already started, so there won't be triggered any waitting for them to get 
> out of safemode when DNs are started.
> 
> My solution makes NNs stopped before "Start All", which means that when 
> "Start All" on HA wizard is done, on NN start it will ensure that NNs will go 
> out of safemode (since DN are already started at that point).
> 
> Di Li wrote:
> Hello Victor,
> 
> Thanks for the explanation. I may be asking something obvious to 
> experienced eyes so please bear with me.
> Could you please
> 1. point me to the logic that: "During "Start All" on Namenode Start we 
> wait while safemode goes to of safemode to say that start is succesful." 
> 2. point me to the logic that skips #1 when DS isn't running.
> 
> I looked at the HDFS namenode Python scripts the "wait_for_safemode_off" 
> method seems to be called only during the upgrade time. I could have missed 
> something, so please let me know.

ensure_safemode_off = True

# True if this is the only NameNode (non-HA) or if its the Active one in HA
is_active_namenode = True

if params.dfs_ha_enabled:
  Logger.info("Waiting for the NameNode to broadcast whether it is Active 
or Standby...")
  if check_is_active_namenode(hdfs_binary):
Logger.info("Waiting for the NameNode to leave Safemode since High 
Availability is enabled and it is Active...")
  else:
# we are the STANDBY NN
ensure_safemode_off = False


check_is_active_namenode will return false after a lot of retries for both 
namenodes, since both of them aren't even out of safemode. Which will set 
ensure_safemode_off to False. Which will make it skip below:

   # wait for Safemode to end
if ensure_safemode_off:
  wait_for_safemode_off(hdfs_binary)


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Di Li


> On June 16, 2016, 6:34 p.m., Di Li wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 146
> > 
> >
> > I am under the impression that the time it takes for NN to exit the 
> > safemode is largely determined by the amount of data in HDFS, not whether 
> > DNs are started before NN. 
> > 
> > Would it be safer to have some logic to check if NameNode is out of the 
> > safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
> > some time (a few minutes, depending on the cluster's performenace) to exit 
> > the safemode.
> 
> Victor Galgo wrote:
> Hi Di Li! Thanks for taking a look into this.
> 
> The problem here is more coplicated than it looks like.
> 
> *Here is the basic scenario for handling safemode:*
> During "Start All" on Namenode Start we wait while safemode goes to of 
> safemode to say that start is succesful.
> 
> *However on HA wizard:*
> We start Namenodes at the point when Datanodes are stopped. Which means 
> NN won't go out of safemode at that point, that's why skip that waitting on 
> NN start in HA wizard.
> After that when we do "Start all" (last step in the wizard). Namenodes 
> are already started, so there won't be triggered any waitting for them to get 
> out of safemode when DNs are started.
> 
> My solution makes NNs stopped before "Start All", which means that when 
> "Start All" on HA wizard is done, on NN start it will ensure that NNs will go 
> out of safemode (since DN are already started at that point).

Hello Victor,

Thanks for the explanation. I may be asking something obvious to experienced 
eyes so please bear with me.
Could you please
1. point me to the logic that: "During "Start All" on Namenode Start we wait 
while safemode goes to of safemode to say that start is succesful." 
2. point me to the logic that skips #1 when DS isn't running.

I looked at the HDFS namenode Python scripts the "wait_for_safemode_off" method 
seems to be called only during the upgrade time. I could have missed something, 
so please let me know.


- Di


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Victor Galgo


> On June 16, 2016, 6:34 p.m., Di Li wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 146
> > 
> >
> > I am under the impression that the time it takes for NN to exit the 
> > safemode is largely determined by the amount of data in HDFS, not whether 
> > DNs are started before NN. 
> > 
> > Would it be safer to have some logic to check if NameNode is out of the 
> > safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
> > some time (a few minutes, depending on the cluster's performenace) to exit 
> > the safemode.

Hi Di Li! Thanks for taking a look into this.

The problem here is more coplicated than it looks like.

*Here is the basic scenario for handling safemode:*
During "Start All" on Namenode Start we wait while safemode goes to of safemode 
to say that start is succesful.

*However on HA wizard:*
We start Namenodes at the point when Datanodes are stopped. Which means NN 
won't go out of safemode at that point, that's why skip that waitting on NN 
start in HA wizard.
After that when we do "Start all" (last step in the wizard). Namenodes are 
already started, so there won't be triggered any waitting for them to get out 
of safemode when DNs are started.

My solution makes NNs stopped before "Start All", which means that when "Start 
All" on HA wizard is done, on NN start it will ensure that NNs will go out of 
safemode (since DN are already started at that point).


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Di Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---




ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
 (line 146)


I am under the impression that the time it takes for NN to exit the 
safemode is largely determined by the amount of data in HDFS, not whether DNs 
are started before NN. 

Would it be safer to have some logic to check if NameNode is out of the 
safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
some time (a few minutes, depending on the cluster's performenace) to exit the 
safemode.


- Di Li


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 

Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-15 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/
---

Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
Onischuk, Dmitro Lisnichenko, Robert Levas, Sandor Magyari, Sumit Mohanty, 
Sebastian Toader, and Yusaku Sako.


Bugs: AMBARI-17182
https://issues.apache.org/jira/browse/AMBARI-17182


Repository: ambari


Description
---

On the last step "Start all" on enabling HA below happens:

Traceback (most recent call last):
File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 147, in 
  ApplicationTimelineServer().execute()
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 219, in execute
  method(env)
File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 43, in start
  self.configure(env) # FOR SECURITY
File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 54, in configure
  yarn(name='apptimelineserver')
File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
line 89, in thunk
  return fn(*args, **kwargs)
File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
 line 276, in yarn
  mode=0755
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
line 154, in __init__
  self.env.run()
File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 160, in run
  self.run_action(resource, action)
File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 124, in run_action
  provider_action()
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 463, in action_create_on_execute
  self.action_delayed("create")
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 460, in action_delayed
  self.get_hdfs_resource_executor().action_delayed(action_name, self)
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 259, in action_delayed
  self._set_mode(self.target_status)
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 366, in _set_mode
  self.util.run_command(self.main_resource.resource.target, 
'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 195, in run_command
  raise Fail(err_msg)
  resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
'%{http_code}' -X PUT 
'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
 returned status_code=403. 
  {
"RemoteException": {
  "exception": "RetriableException", 
  "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
  "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
blocks 697.\nThe number of live datanodes 20 has reached the minimum number 0. 
Safe mode will be turned off automatically once the thresholds have been 
reached."
}
  }
  
  
This happens because NN is not yet out of safemode at the moment of ats start, 
because DNs just started.

To fix this "stop namenodes" has to be triggered before "start all".

If this is done, on "Start all" it will be ensured that datanodes start prior 
to NN, and that NN are out of safemode before ATS start.


Diffs
-

  
ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
 24677e4 
  ambari-web/app/messages.js 6465812 

Diff: https://reviews.apache.org/r/48734/diff/


Testing
---

Calling set on destroyed view
Calling set on destroyed view
Calling set on destroyed view
Calling set on destroyed view

  28668 tests complete (34 seconds)
  154 tests pending

[INFO] 
[INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web ---
[INFO] 51 implicit excludes (use -debug for more details).
[INFO] Exclude: .idea/**
[INFO] Exclude: package.json
[INFO] Exclude: public/**
[INFO] Exclude: public-static/**
[INFO] Exclude: app/assets/**
[INFO] Exclude: vendor/**
[INFO] Exclude: node_modules/**
[INFO] Exclude: node/**
[INFO] Exclude: npm-debug.log
[INFO] 1425 resources included (use -debug for more details)
Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not