Re: Review Request 60910: Fix PERF stack scripts to handle changed config paths

2017-07-17 Thread Victor Galgo via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60910/#review180676
---


Ship it!




Ship It!

- Victor Galgo


On July 17, 2017, 1:34 p.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60910/
> ---
> 
> (Updated July 17, 2017, 1:34 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Myroslav Papirkovskyy, and 
> Sid Wagle.
> 
> 
> Bugs: AMBARI-21493
> https://issues.apache.org/jira/browse/AMBARI-21493
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> 
> Diffs
> -
> 
>   
> ambari-common/src/main/python/resource_management/libraries/functions/get_not_managed_resources.py
>  4af636b 
>   
> ambari-common/src/main/python/resource_management/libraries/functions/stack_features.py
>  2b3df5f 
>   
> ambari-common/src/main/python/resource_management/libraries/providers/xml_config.py
>  28697bf 
>   ambari-common/src/main/python/resource_management/libraries/script/dummy.py 
> ad5f2a6 
>   
> ambari-common/src/main/python/resource_management/libraries/script/script.py 
> a08feab 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/params.py
>  9abd2fe 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/shared_initialization.py
>  36a202f 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-ANY/scripts/params.py
>  4052d1d 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-INSTALL/scripts/params.py
>  6193c11 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-INSTALL/scripts/repo_initialization.py
>  a35dce7 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-START/scripts/params.py
>  70ebfeb 
>   
> ambari-server/src/main/resources/stacks/HDP/2.0.6/hooks/before-START/scripts/shared_initialization.py
>  148d235 
>   
> ambari-server/src/main/resources/stacks/HDP/3.0/hooks/after-INSTALL/scripts/params.py
>  5dcd39b 
>   
> ambari-server/src/main/resources/stacks/HDP/3.0/hooks/after-INSTALL/scripts/shared_initialization.py
>  e9f2283 
>   
> ambari-server/src/main/resources/stacks/HDP/3.0/hooks/before-ANY/scripts/params.py
>  9be9101 
>   
> ambari-server/src/main/resources/stacks/HDP/3.0/hooks/before-INSTALL/scripts/params.py
>  6193c11 
>   
> ambari-server/src/main/resources/stacks/HDP/3.0/hooks/before-START/scripts/params.py
>  5a5361c 
>   
> ambari-server/src/main/resources/stacks/PERF/1.0/hooks/before-ANY/scripts/params.py
>  2c2c901 
>   
> ambari-server/src/main/resources/stacks/PERF/1.0/services/FAKEHBASE/metainfo.xml
>  66d5a29 
>   
> ambari-server/src/main/resources/stacks/PERF/1.0/services/FAKEHDFS/metainfo.xml
>  13b10e0 
>   
> ambari-server/src/main/resources/stacks/PERF/1.0/services/FAKEHDFS/package/scripts/params.py
>  8068441 
>   
> ambari-server/src/main/resources/stacks/PERF/1.0/services/KERBEROS/package/scripts/params.py
>  4eb5b02 
> 
> 
> Diff: https://reviews.apache.org/r/60910/diff/1/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>



Re: Review Request 60590: Create a topic to send hostLevelParams

2017-07-04 Thread Victor Galgo via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60590/#review179552
---


Ship it!




Ship It!

- Victor Galgo


On July 3, 2017, 9:36 a.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60590/
> ---
> 
> (Updated July 3, 2017, 9:36 a.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Myroslav Papirkovskyy, and 
> Sid Wagle.
> 
> 
> Bugs: AMBARI-21394
> https://issues.apache.org/jira/browse/AMBARI-21394
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> In this topic any parameters which are computed on per host basis is sent.  
> Parameters can be used by either execution/status_commands or an agent itself.
> 
> 
> Diffs
> -
> 
>   ambari-agent/src/main/python/ambari_agent/ClusterHostLevelParamsCache.py 
> PRE-CREATION 
>   ambari-agent/src/main/python/ambari_agent/Constants.py 02945ee 
>   ambari-agent/src/main/python/ambari_agent/CustomServiceOrchestrator.py 
> 6d1a491 
>   ambari-agent/src/main/python/ambari_agent/HeartbeatThread.py dbf4006 
>   ambari-agent/src/main/python/ambari_agent/InitializerModule.py 8de1fa5 
>   ambari-agent/src/main/python/ambari_agent/RecoveryManager.py 68dd0be 
>   
> ambari-agent/src/main/python/ambari_agent/listeners/HostLevelParamsEventListener.py
>  PRE-CREATION 
>   
> ambari-agent/src/main/python/ambari_agent/listeners/MetadataEventListener.py 
> 364d8af 
>   ambari-agent/src/test/python/ambari_agent/TestAgentStompResponses.py 
> c41f87e 
>   
> ambari-agent/src/test/python/ambari_agent/dummy_files/stomp/host_level_params.json
>  PRE-CREATION 
>   
> ambari-agent/src/test/python/ambari_agent/dummy_files/stomp/metadata_after_registration.json
>  6462ccf 
>   
> ambari-agent/src/test/python/ambari_agent/dummy_files/stomp/topology_add_host.json
>  2458f08 
>   
> ambari-agent/src/test/python/ambari_agent/dummy_files/stomp/topology_cache_expected.json
>  53d0e0d 
>   
> ambari-agent/src/test/python/ambari_agent/dummy_files/stomp/topology_create.json
>  dfe17b9 
> 
> 
> Diff: https://reviews.apache.org/r/60590/diff/4/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>



Re: Review Request 60602: /usr/sbin/ambari-agent missing after Ambari upgrade

2017-07-03 Thread Victor Galgo via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60602/#review179511
---


Ship it!




Ship It!

- Victor Galgo


On July 3, 2017, 2:42 p.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60602/
> ---
> 
> (Updated July 3, 2017, 2:42 p.m.)
> 
> 
> Review request for Ambari and Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-21396
> https://issues.apache.org/jira/browse/AMBARI-21396
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> STR:
> 
>   * Install Ambari 2.4.2
>   * Upgrade to Ambari 2.5.2
> 
> Result: `/usr/sbin/ambari-agent` missing, Agent can only be started as
> `service start ambari-agent`
> 
> 
> 
> $ yum upgrade -y ambari-agent
> ...
> ---> Package ambari-agent.x86_64 0:2.4.2.0-163 will be updated
> ---> Package ambari-agent.x86_64 0:2.5.2.0-92 will be an update
> ...
> Updated:
>   ambari-agent.x86_64 0:2.5.2.0-92
> 
> Complete!
> $ ambari-agent start
> -bash: /usr/sbin/ambari-agent: No such file or directory
> $ which ambari-agent
> /usr/bin/which: no ambari-agent in 
> (/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin)
> 
> 
> Works OK if installing from scratch:
> 
> 
> 
> $ yum install -y ambari-agent
> ...
> ---> Package ambari-agent.x86_64 0:2.5.2.0-92 will be installed
> ...
> Installed:
>   ambari-agent.x86_64 0:2.5.2.0-92
> 
> Complete!
> $ which ambari-agent
> /usr/sbin/ambari-agent
> 
> 
> Also works OK with previous build:
> 
> 
> 
> $ yum upgrade -y ambari-agent
> ...
> ---> Package ambari-agent.x86_64 0:2.4.2.0-163 will be updated
> ---> Package ambari-agent.x86_64 0:2.5.2.0-91 will be an update
> ...
> Updated:
>   ambari-agent.x86_64 0:2.5.2.0-91
> 
> Complete!
> $ which ambari-agent
> /usr/sbin/ambari-agent
> 
> 
> Diffs
> -
> 
>   ambari-agent/src/main/package/rpm/posttrans_agent.sh c301fc3 
> 
> 
> Diff: https://reviews.apache.org/r/60602/diff/1/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>



Re: Review Request 53651: Ambari upgrade failed while running 'Alter Table blueprint' - blueprint_name column

2016-11-10 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53651/#review155605
---


Ship it!




Ship It!

- Victor Galgo


On Nov. 10, 2016, 4:23 p.m., Vitalyi Brodetskyi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53651/
> ---
> 
> (Updated Nov. 10, 2016, 4:23 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmytro Grinenko, Dmytro Sen, and 
> Sumit Mohanty.
> 
> 
> Bugs: AMBARI-18640
> https://issues.apache.org/jira/browse/AMBARI-18640
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Observed errors in today's run during Ambari upgrade from 2.2.1.1 to 
> 2.4.2.0-36
> ambari-server --hash
> c6da6776f029f15d3a7d6009697371eee4e5f4c5
> 
> Ambari DB - MySQL; Secure HDP-2.4.0.0 cluster deployed via UI
> 
> *Upgrade Log indicates below:*
> {code}
> 18 Oct 2016 14:17:59,115  INFO [main] DBAccessorImpl:824 - Executing query: 
> ALTER TABLE users  MODIFY user_name VARCHAR(100)
> 18 Oct 2016 14:17:59,154  INFO [main] DBAccessorImpl:824 - Executing query: 
> ALTER TABLE users  MODIFY user_name VARCHAR(100) NOT NULL
> 18 Oct 2016 14:17:59,191  INFO [main] DBAccessorImpl:824 - Executing query: 
> ALTER TABLE host_role_command  MODIFY role VARCHAR(100)
> 18 Oct 2016 14:17:59,428  INFO [main] DBAccessorImpl:824 - Executing query: 
> ALTER TABLE host_role_command  MODIFY status VARCHAR(100)
> 18 Oct 2016 14:17:59,656  INFO [main] DBAccessorImpl:824 - Executing query: 
> ALTER TABLE blueprint  MODIFY blueprint_name VARCHAR(100)
> 18 Oct 2016 14:17:59,678 ERROR [main] DBAccessorImpl:830 - Error executing 
> query: ALTER TABLE blueprint  MODIFY blueprint_name VARCHAR(100)
> java.sql.SQLException: Cannot change column 'blueprint_name': used in a 
> foreign key constraint 'FK_blueprint_setting_name' of table 
> 'ambaricustom.blueprint_setting'
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:996)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2526)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
> at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:848)
> at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:742)
> at 
> org.apache.ambari.server.orm.DBAccessorImpl.executeQuery(DBAccessorImpl.java:827)
> at 
> org.apache.ambari.server.orm.DBAccessorImpl.executeQuery(DBAccessorImpl.java:819)
> at 
> org.apache.ambari.server.orm.DBAccessorImpl.alterColumn(DBAccessorImpl.java:610)
> at 
> org.apache.ambari.server.upgrade.UpgradeCatalog242.updateTablesForMysql(UpgradeCatalog242.java:120)
> at 
> org.apache.ambari.server.upgrade.UpgradeCatalog242.executeDDLUpdates(UpgradeCatalog242.java:95)
> at 
> org.apache.ambari.server.upgrade.AbstractUpgradeCatalog.upgradeSchema(AbstractUpgradeCatalog.java:889)
> at 
> org.apache.ambari.server.upgrade.SchemaUpgradeHelper.executeUpgrade(SchemaUpgradeHelper.java:206)
> at 
> org.apache.ambari.server.upgrade.SchemaUpgradeHelper.main(SchemaUpgradeHelper.java:349)
> 18 Oct 2016 14:17:59,680 ERROR [main] SchemaUpgradeHelper:208 - Upgrade 
> failed.
> java.sql.SQLException: Cannot change column 'blueprint_name': used in a 
> foreign key constraint 'FK_blueprint_setting_name' of table 
> 'ambaricustom.blueprint_setting'
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:996)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2526)
> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
> at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:848)
> 
> {code}
> 
> 
> Diffs
> -
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog242.java
>  f5445ea 
>   
> ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog242Test.java
>  8cfcee5 
> 
> Diff: https://reviews.apache.org/r/53651/diff/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Vitalyi Brodetskyi
> 
>



Re: Review Request 52607: oozie server start fails post upgrade to Ambari 2.4.1

2016-10-06 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52607/#review151669
---


Ship it!




Ship It!

- Victor Galgo


On Oct. 6, 2016, 4:18 p.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52607/
> ---
> 
> (Updated Oct. 6, 2016, 4:18 p.m.)
> 
> 
> Review request for Ambari and Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-18549
> https://issues.apache.org/jira/browse/AMBARI-18549
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
>   
> **Info**  
> This cluster has gone through below upgrade.  
> Ambari 2.2.2.0 - Ambari 2.4.1  
> Cluster Details : <https://github.com/hortonworks/HCube#winchester-hdp-24>  
> Please login with okta credential to these cluster machines  
> 
> 
> On restarting Oozie web server from Ambari below error was thrown.
> 
> 
> 
> 
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/OOZIE/4.0.0.2.0/package/scripts/oozie_server.py",
>  line 215, in 
> OozieServer().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 280, in execute
> method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/OOZIE/4.0.0.2.0/package/scripts/oozie_server.py",
>  line 100, in stop
> oozie_service(action='stop', upgrade_type=upgrade_type)
>   File 
> "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, 
> in thunk
> return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/OOZIE/4.0.0.2.0/package/scripts/oozie_service.py",
>  line 164, in oozie_service
> user = params.oozie_user)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 
> 155, in __init__
> self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
> self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
> provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/providers/system.py",
>  line 273, in action_run
> tries=self.resource.tries, try_sleep=self.resource.try_sleep)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 
> 71, in inner
> result = function(command, **kwargs)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 
> 93, in checked_call
> tries=tries, try_sleep=try_sleep)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 
> 141, in _call_wrapper
> result = _call(command, **kwargs_copy)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 
> 294, in _call
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'cd /var/tmp/oozie 
> && /usr/hdp/current/oozie-server/bin/oozied.sh stop 60 -force' returned 1. 
> -bash: line 0: cd: /var/tmp/oozie: No such file or directory
> 
> 
> We had to workaround this problem by running oozie stop manual as below
> 
> 
> 
> 
> /usr/hdp/2.4.3.0-207/oozie/bin/oozied.sh stop
> 
> 
> Diffs
> -
> 
>   
> ambari-server/src/main/resources/common-services/OOZIE/4.0.0.2.0/package/scripts/oozie_service.py
>  eabaea3 
>   ambari-server/src/test/python/stacks/2.0.6/OOZIE/test_oozie_server.py 
> b0cc2e9 
> 
> Diff: https://reviews.apache.org/r/52607/diff/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>



Re: Review Request 51082: Atlas Hook is printing exception during topology submission

2016-08-14 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51082/#review145722
---


Ship it!




Ship It!

- Victor Galgo


On Aug. 14, 2016, 12:25 p.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51082/
> ---
> 
> (Updated Aug. 14, 2016, 12:25 p.m.)
> 
> 
> Review request for Ambari and Dmitro Lisnichenko.
> 
> 
> Bugs: AMBARI-18145
> https://issues.apache.org/jira/browse/AMBARI-18145
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Every topology submission is producing stack trace and log messages that give
> the impression that something is wrong. There are no test failures, still it
> will be better not to have them.
> 
> 
> 
> 
> 67740266
> [===   ] 63897600 / 67740266
> [===   ] 64204800 / 67740266
> [===   ] 64512000 / 67740266
> [===   ] 64819200 / 67740266
> [  ] 65126400 / 67740266
> [  ] 65433600 / 67740266
> [  ] 65740800 / 67740266
> [  ] 66048000 / 67740266
> [  ] 66355200 / 67740266
> [= ] 2400 / 67740266
> [= ] 66969600 / 67740266
> [= ] 67276800 / 67740266
> [= ] 67584000 / 67740266
> [==] 67740266 / 67740266
> [==] 67740266 / 67740266
> 2016-08-11 
> 12:36:17,552|beaver.machine|INFO|14772|140568081340160|MainThread|File 
> '/tmp/1fb1523e5fc011e6add4fa163e91b191.jar' uploaded to 
> '/hadoop/storm/nimbus/inbox/stormjar-3bfb75ab-61cc-417d-ae0d-172ce58dfead.jar'
>  (67740266 bytes)
> 2016-08-11 
> 12:36:17,552|beaver.machine|INFO|14772|140568081340160|MainThread|2316 [main] 
> INFO  o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned 
> location: 
> /hadoop/storm/nimbus/inbox/stormjar-3bfb75ab-61cc-417d-ae0d-172ce58dfead.jar
> 2016-08-11 
> 12:36:17,552|beaver.machine|INFO|14772|140568081340160|MainThread|2316 [main] 
> INFO  o.a.s.StormSubmitter - Submitting topology 
> StormCLIDeployTwoTimesSameName in distributed mode with conf 
> {"java.security.auth.login.config":"\/etc\/storm\/conf\/client_jaas.conf","storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-6867977836683662341:-7135463821039054711","topology.workers":3,"storm.thrift.transport":"org.apache.storm.security.auth.kerberos.KerberosSaslTransportPlugin","topology.debug":true}
> 2016-08-11 
> 12:36:17,553|beaver.machine|INFO|14772|140568081340160|MainThread|2317 [main] 
> INFO  o.a.s.m.n.Login - successfully logged in.
> 2016-08-11 
> 12:36:17,568|beaver.machine|INFO|14772|140568081340160|MainThread|2332 [main] 
> INFO  o.a.s.m.n.Login - successfully logged in.
> 2016-08-11 
> 12:36:18,292|beaver.machine|INFO|14772|140568081340160|MainThread|3055 [main] 
> INFO  o.a.s.StormSubmitter - Finished submitting topology: 
> StormCLIDeployTwoTimesSameName
> 2016-08-11 
> 12:36:18,295|beaver.machine|INFO|14772|140568081340160|MainThread|3058 [main] 
> INFO  o.a.s.StormSubmitter - Initializing the registered ISubmitterHook 
> [org.apache.atlas.storm.hook.StormAtlasHook]
> 2016-08-11 
> 12:36:18,392|beaver.machine|INFO|14772|140568081340160|MainThread|3156 [main] 
> INFO  o.a.a.ApplicationProperties - Looking for atlas-application.properties 
> in classpath
> 2016-08-11 
> 12:36:18,393|beaver.machine|INFO|14772|140568081340160|MainThread|3157 [main] 
> INFO  o.a.a.ApplicationProperties - Loading atlas-application.properties from 
> file:/etc/storm/2.5.0.0-1182/0/atlas-application.properties
> 2016-08-11 
> 12:36:18,428|beaver.machine|INFO|14772|140568081340160|MainThread|log4j:WARN 
> No appenders could be found for logger 
> (org.apache.atlas.ApplicationProperties).
>   

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On June 21, 2016, 8:48 p.m., Jonathan Hurley wrote:
> > Ship It!
> 
> Victor Galgo wrote:
> Jonathan can please do the honours of helping to commit this patch?
> 
> Jonathan Hurley wrote:
> Has this been committed yet? If so, please close the review.

Hi Jonathan.
It was not. Can you please do the honours?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
>

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On July 13, 2016, 7:02 a.m., Sebastian Toader wrote:
> > I think this is rather generic problem that needs to be handled in 
> > *HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.
> > 
> > These are the classes that carry out the HDFS operations. All retryable 
> > operations (e.g. SETPERMISSION) should be guarded with retry logic that 
> > would retry the operation until a given timeout before giving up and 
> > bailing out.
> > 
> > To determine which HDFS operations are retriable might be as easy as just 
> > looking the returned status/error code or the type of the exception (e.g. 
> > "RetriableException") though this needs to be verified if it's consistent 
> > with both the webhdfs and hdfsresource jar.
> > 
> > The RCO doesn't help here as even though NNs are started before ATS it 
> > doesn't mean that NNs are ready to execute HDFS operations (e.g. it takes 
> > some time to elect active and standby nodes; exiting safe mode may take a 
> > considerable amount of time if there are many datanodes)
> 
> Victor Galgo wrote:
> Hi Sebastian. Thanks for your input. 
> 
> I don't like this approach very much as sometimes NN can take really long 
> time to go out of safemode, we cannot just wait forever.
> 
> Waitting for too long will make operations when NN is off hang, while 
> they should fail with information.

"The RCO doesn't help here as even though NNs"

It does help because NN start waits for NN leaving safemode before finishing.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-07-13 Thread Victor Galgo


> On July 13, 2016, 7:02 a.m., Sebastian Toader wrote:
> > I think this is rather generic problem that needs to be handled in 
> > *HdfsResourceJar* and *HdfsResourceWebHDFS (WebHDFSUtil)*.
> > 
> > These are the classes that carry out the HDFS operations. All retryable 
> > operations (e.g. SETPERMISSION) should be guarded with retry logic that 
> > would retry the operation until a given timeout before giving up and 
> > bailing out.
> > 
> > To determine which HDFS operations are retriable might be as easy as just 
> > looking the returned status/error code or the type of the exception (e.g. 
> > "RetriableException") though this needs to be verified if it's consistent 
> > with both the webhdfs and hdfsresource jar.
> > 
> > The RCO doesn't help here as even though NNs are started before ATS it 
> > doesn't mean that NNs are ready to execute HDFS operations (e.g. it takes 
> > some time to elect active and standby nodes; exiting safe mode may take a 
> > considerable amount of time if there are many datanodes)

Hi Sebastian. Thanks for your input. 

I don't like this approach very much as sometimes NN can take really long time 
to go out of safemode, we cannot just wait forever.

Waitting for too long will make operations when NN is off hang, while they 
should fail with information.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review142023
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run

Re: Review Request 49535: Ambari Agent memory Leak fix.

2016-07-01 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49535/#review140439
---


Ship it!




Ship It!

- Victor Galgo


On July 1, 2016, 9:48 p.m., Andrew Onischuk wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49535/
> ---
> 
> (Updated July 1, 2016, 9:48 p.m.)
> 
> 
> Review request for Ambari and Dmitro Lisnichenko.
> 
> 
> Bugs: AMBARI-17539
> https://issues.apache.org/jira/browse/AMBARI-17539
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Ambari Agent memory Leak fix.
> 
> 
> Diffs
> -
> 
>   ambari-agent/src/main/python/ambari_agent/main.py 4db89f8 
> 
> Diff: https://reviews.apache.org/r/49535/diff/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Andrew Onischuk
> 
>



Re: Review Request 48722: Reduce the idle time before first command from next stage is executed on a host

2016-06-22 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48722/#review139053
---


Ship it!




Ship It!

- Victor Galgo


On June 22, 2016, 11:21 a.m., Sebastian Toader wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48722/
> ---
> 
> (Updated June 22, 2016, 11:21 a.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Laszlo Puskas, Robert Levas, 
> Sandor Magyari, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-17248
> https://issues.apache.org/jira/browse/AMBARI-17248
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Commands to be executed by ambari-agents are being sent down by the server in 
> the response message to agent heartbeat messages. 
> The server processes the received heartbeat, it checks if there are next 
> commands scheduled to be executed by ambari-agent and adds those to the 
> heartbeat response for the ambari-agent.
> The server organises the commands that can be executed in parallel into 
> stages. Ambari server ensures that only the commands of a single stage is 
> scheduled to be executed by the agent and starts scheduling the commands of 
> the next stage only after all commands of current stage has finished 
> successfully.
> The processing of command status received with the heartbeat message happens 
> asynchronously to heartbeat response in HeartBeatProcessor and 
> ActionScheduler creation thus when the heartbeat response is created the 
> commands for the next stage are not scheduled yet. This means that the next 
> commands will be sent to agent only with the next heartbeat.
> Agents currently sends a heartbeat to the server on command a completion or 
> at a timeout = self.netutil.HEARTBEAT_IDDLE_INTERVAL_SEC – 
> self.netutil.MINIMUM_INTERVAL_BETWEEN_HEARTBEATS interval which is ~10 
> seconds if there are no commands to be executed.
> This means that when the server receives a heartbeat triggered by the 
> completion of the last command from the current stage the server will send 
> the commands for the next stage only 10 seconds later when the next heartbeat 
> is received. This leads to agents spending considerable amount of time idle 
> when there are multiple stages to be executed.
> Agents should heartbeat at a higher rate while there are still pending stages 
> to be executed.
> 
> 
> Diffs
> -
> 
>   ambari-agent/conf/unix/ambari-agent.ini 8f2ab1b 
>   ambari-agent/conf/unix/upgrade_agent_configs.py 583b5aa 
>   ambari-agent/conf/windows/ambari-agent.ini df88be6 
>   ambari-agent/src/main/python/ambari_agent/AmbariConfig.py 89a881a 
>   ambari-agent/src/main/python/ambari_agent/Controller.py e981a76 
>   ambari-agent/src/main/python/ambari_agent/Heartbeat.py 91098e0 
>   ambari-agent/src/main/python/ambari_agent/NetUtil.py 80bf3ae 
>   ambari-agent/src/test/python/ambari_agent/TestHeartbeat.py f113083 
>   ambari-agent/src/test/python/ambari_agent/TestNetUtil.py d72e319 
>   ambari-agent/src/test/python/ambari_agent/examples/ControllerTester.py 
> 8103872 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
>  35a37e3 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatResponse.java
>  1ab7ae9 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java 
> ac0ddd2 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Clusters.java 
> bd9de13 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
>  3d2388e 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClustersImpl.java
>  c26e1e9 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterImplTest.java
>  627ade9 
> 
> Diff: https://reviews.apache.org/r/48722/diff/
> 
> 
> Testing
> ---
> 
> Manual testing.
> 
> Unit tests in succeeded.
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>



Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Victor Galgo


> On June 21, 2016, 8:48 p.m., Jonathan Hurley wrote:
> > Ship It!

Jonathan can please do the honours of helping to commit this patch?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138935
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
> }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the 

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-21 Thread Victor Galgo


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.
> 
> Victor Galgo wrote:
> "So in my opinion, the fix should be at the NameNode Python script level 
> to always check safemode against the two NNs". 
> We cannot do that. Because at that points all Datanodes are stopped. 
> Which means NN will never go out of safemode.
> 
> Alejandro Fernandez wrote:
> Please include Jonathan Hurley in the code review since he recently 
> modified the function that waits to leave safemode.
> This is not the first time that we've had the need for a step to "leave 
> safe mode". So either we put it into the python code (and do a lot of testing 
> on it since it also impacts EU and RU), or make a custom command for HDFS 
> that is only available if HA is present, and it waits for NameNode to leave 
> safemode.
> 
> Jonathan Hurley wrote:
> Yes, I recently added something for the case during an EU where we know 
> that the NameNode probably won't leave Safemode. Essentially, don't try to 
> create any directories if the NN didn't wait for safemode to exit. That was 
> only for NN, though.
> 
> But this problem is a more generic case - it affects other services. 
> Since NN wasn't restarted it might be in Safemode. In this case, I think we 
> need to handle the retryable exception and back off and wait. 
> 
> However, you could also argue that since we know we're doing a restart 
> operation, we should be shutting down the NNs completely. If there's no issue 
> with shutting them stop during the HA process, then this patch seems fine for 
> now, but we should open another one for catching the RetryableException.

Thanks Jonathan! Absolutely agree with your points. Could you please Ship it?


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Le

Re: Review Request 48722: Reduce the idle time before first command from next stage is executed on a host

2016-06-20 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48722/#review138580
---


Ship it!




Nice!

- Victor Galgo


On June 16, 2016, 4:43 p.m., Sebastian Toader wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48722/
> ---
> 
> (Updated June 16, 2016, 4:43 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Laszlo Puskas, Robert Levas, 
> Sandor Magyari, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-17248
> https://issues.apache.org/jira/browse/AMBARI-17248
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Commands to be executed by ambari-agents are being sent down by the server in 
> the response message to agent heartbeat messages. 
> The server processes the received heartbeat, it checks if there are next 
> commands scheduled to be executed by ambari-agent and adds those to the 
> heartbeat response for the ambari-agent.
> The server organises the commands that can be executed in parallel into 
> stages. Ambari server ensures that only the commands of a single stage is 
> scheduled to be executed by the agent and starts scheduling the commands of 
> the next stage only after all commands of current stage has finished 
> successfully.
> The processing of command status received with the heartbeat message happens 
> asynchronously to heartbeat response in HeartBeatProcessor and 
> ActionScheduler creation thus when the heartbeat response is created the 
> commands for the next stage are not scheduled yet. This means that the next 
> commands will be sent to agent only with the next heartbeat.
> Agents currently sends a heartbeat to the server on command a completion or 
> at a timeout = self.netutil.HEARTBEAT_IDDLE_INTERVAL_SEC – 
> self.netutil.MINIMUM_INTERVAL_BETWEEN_HEARTBEATS interval which is ~10 
> seconds if there are no commands to be executed.
> This means that when the server receives a heartbeat triggered by the 
> completion of the last command from the current stage the server will send 
> the commands for the next stage only 10 seconds later when the next heartbeat 
> is received. This leads to agents spending considerable amount of time idle 
> when there are multiple stages to be executed.
> Agents should heartbeat at a higher rate while there are still pending stages 
> to be executed.
> 
> 
> Diffs
> -
> 
>   ambari-agent/conf/unix/ambari-agent.ini 8f2ab1b 
>   ambari-agent/conf/windows/ambari-agent.ini df88be6 
>   ambari-agent/src/main/python/ambari_agent/AmbariConfig.py 89a881a 
>   ambari-agent/src/main/python/ambari_agent/Controller.py e981a76 
>   ambari-agent/src/main/python/ambari_agent/NetUtil.py 80bf3ae 
>   ambari-agent/src/test/python/ambari_agent/TestNetUtil.py d72e319 
>   ambari-agent/src/test/python/ambari_agent/examples/ControllerTester.py 
> 8103872 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
>  35a37e3 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatResponse.java
>  1ab7ae9 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Cluster.java 
> ac0ddd2 
>   ambari-server/src/main/java/org/apache/ambari/server/state/Clusters.java 
> bd9de13 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
>  3d2388e 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClustersImpl.java
>  c26e1e9 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterImplTest.java
>  627ade9 
> 
> Diff: https://reviews.apache.org/r/48722/diff/
> 
> 
> Testing
> ---
> 
> Manual testing.
> 
> Unit tests in succeeded.
> 
> 
> Thanks,
> 
> Sebastian Toader
> 
>



Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.

"So in my opinion, the fix should be at the NameNode Python script level to 
always check safemode against the two NNs". 
We cannot do that. Because at that points all Datanodes are stopped. Which 
means NN will never go out of safemode.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/bas

Re: Review Request 48863: Ambari-server upgrade results in "DB configs consistency check failed. "

2016-06-17 Thread Victor Galgo

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48863/#review138284
---


Ship it!




Ship It!

- Victor Galgo


On June 17, 2016, 6:30 p.m., Vitalyi Brodetskyi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48863/
> ---
> 
> (Updated June 17, 2016, 6:30 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Dmitro Lisnichenko, Dmytro Sen, 
> and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-17302
> https://issues.apache.org/jira/browse/AMBARI-17302
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> Cluster deployed via BP based on ambari 2.2.1 have no slider-client config. 
> After upgrade to ambari 2.4.0 this issue appears
> 2016-06-16 01:19:10,963 INFO - *** Check database 
> started ***
> 2016-06-16 01:19:14,660 INFO - Checking for configs not mapped to any cluster
> 2016-06-16 01:19:14,681 INFO - Checking for configs selected more than once
> 2016-06-16 01:19:14,683 INFO - Checking for hosts without state
> 2016-06-16 01:19:14,684 INFO - Checking host component states count equals 
> host component desired states count
> 2016-06-16 01:19:14,685 INFO - Checking services and their configs
> 2016-06-16 01:19:16,045 ERROR - Required config(s): slider-client is(are) not 
> available for service SLIDER with service config version 2 in cluster 
> hortonhdp
> 2016-06-16 01:19:16,161 INFO - *** Check database 
> completed ***
> 
> 
> Diffs
> -
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
>  13206c0 
>   
> ambari-server/src/test/java/org/apache/ambari/server/upgrade/UpgradeCatalog240Test.java
>  1288053 
> 
> Diff: https://reviews.apache.org/r/48863/diff/
> 
> 
> Testing
> ---
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Vitalyi Brodetskyi
> 
>



Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 3:15 p.m., Jayush Luniya wrote:
> > ambari-web/app/messages.js, line 1325
> > <https://reviews.apache.org/r/48734/diff/1/?file=1420113#file1420113line1325>
> >
> > Not sure if stopping namenodes is the right way to go about with this.

Jayush, it looks right to me because NNs should be started with other 
components during start all to ensure correct ordering and waitting for turning 
off the safemode. If you've got any suggestions on how it can be fixed in other 
way please feel free to re-open the issue.

Thanks!


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138234
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission f

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-17 Thread Victor Galgo


> On June 17, 2016, 5:44 p.m., Alejandro Fernandez wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 25
> > <https://reviews.apache.org/r/48734/diff/1/?file=1420112#file1420112line25>
> >
> > How does this fix the issue? If NN just started, it still needs to get 
> > block reports, so ATS can still fail.

Alejandro thanks for having a look!

This fixes the issue because when we do "Start All" later on, NNs start is 
triggered before ATS start (role_command_order). And during NN start it waits 
until safemode is off. To proceed with ATS and others.


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138263
-------


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>   self._set_mode(self.target_status)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>   self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>   raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
>   {
> "RemoteException": {
>   "exception": "RetriableException", 
>   "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>   "message": "org

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Victor Galgo


> On June 16, 2016, 6:34 p.m., Di Li wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 146
> > <https://reviews.apache.org/r/48734/diff/1/?file=1420112#file1420112line146>
> >
> > I am under the impression that the time it takes for NN to exit the 
> > safemode is largely determined by the amount of data in HDFS, not whether 
> > DNs are started before NN. 
> > 
> > Would it be safer to have some logic to check if NameNode is out of the 
> > safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
> > some time (a few minutes, depending on the cluster's performenace) to exit 
> > the safemode.
> 
> Victor Galgo wrote:
> Hi Di Li! Thanks for taking a look into this.
> 
> The problem here is more coplicated than it looks like.
> 
> *Here is the basic scenario for handling safemode:*
> During "Start All" on Namenode Start we wait while safemode goes to of 
> safemode to say that start is succesful.
> 
> *However on HA wizard:*
> We start Namenodes at the point when Datanodes are stopped. Which means 
> NN won't go out of safemode at that point, that's why skip that waitting on 
> NN start in HA wizard.
> After that when we do "Start all" (last step in the wizard). Namenodes 
> are already started, so there won't be triggered any waitting for them to get 
> out of safemode when DNs are started.
> 
> My solution makes NNs stopped before "Start All", which means that when 
> "Start All" on HA wizard is done, on NN start it will ensure that NNs will go 
> out of safemode (since DN are already started at that point).
> 
> Di Li wrote:
> Hello Victor,
> 
> Thanks for the explanation. I may be asking something obvious to 
> experienced eyes so please bear with me.
> Could you please
> 1. point me to the logic that: "During "Start All" on Namenode Start we 
> wait while safemode goes to of safemode to say that start is succesful." 
> 2. point me to the logic that skips #1 when DS isn't running.
> 
> I looked at the HDFS namenode Python scripts the "wait_for_safemode_off" 
> method seems to be called only during the upgrade time. I could have missed 
> something, so please let me know.

ensure_safemode_off = True

# True if this is the only NameNode (non-HA) or if its the Active one in HA
is_active_namenode = True

if params.dfs_ha_enabled:
  Logger.info("Waiting for the NameNode to broadcast whether it is Active 
or Standby...")
  if check_is_active_namenode(hdfs_binary):
Logger.info("Waiting for the NameNode to leave Safemode since High 
Availability is enabled and it is Active...")
  else:
# we are the STANDBY NN
ensure_safemode_off = False


check_is_active_namenode will return false after a lot of retries for both 
namenodes, since both of them aren't even out of safemode. Which will set 
ensure_safemode_off to False. Which will make it skip below:

   # wait for Safemode to end
if ensure_safemode_off:
  wait_for_safemode_off(hdfs_binary)


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/applicat

Re: Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-16 Thread Victor Galgo


> On June 16, 2016, 6:34 p.m., Di Li wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 146
> > <https://reviews.apache.org/r/48734/diff/1/?file=1420112#file1420112line146>
> >
> > I am under the impression that the time it takes for NN to exit the 
> > safemode is largely determined by the amount of data in HDFS, not whether 
> > DNs are started before NN. 
> > 
> > Would it be safer to have some logic to check if NameNode is out of the 
> > safemode? On a cluster with terabytes of data in HDFS, it may take NN quite 
> > some time (a few minutes, depending on the cluster's performenace) to exit 
> > the safemode.

Hi Di Li! Thanks for taking a look into this.

The problem here is more coplicated than it looks like.

*Here is the basic scenario for handling safemode:*
During "Start All" on Namenode Start we wait while safemode goes to of safemode 
to say that start is succesful.

*However on HA wizard:*
We start Namenodes at the point when Datanodes are stopped. Which means NN 
won't go out of safemode at that point, that's why skip that waitting on NN 
start in HA wizard.
After that when we do "Start all" (last step in the wizard). Namenodes are 
already started, so there won't be triggered any waitting for them to get out 
of safemode when DNs are started.

My solution makes NNs stopped before "Start All", which means that when "Start 
All" on HA wizard is done, on NN start it will ensure that NNs will go out of 
safemode (since DN are already started at that point).


- Victor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138047
---


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> ---
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
> https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
>   ApplicationTimelineServer().execute()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>   method(env)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>   self.configure(env) # FOR SECURITY
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>   yarn(name='apptimelineserver')
> File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>   return fn(*args, **kwargs)
> File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>   mode=0755
> File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>   self.env.run()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>   self.run_action(resource, action)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>   provider_action()
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>   self.action_delayed("create")
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>   self.get_hdfs_resource_executor().action_delayed(action_name, self)
> File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_de

Review Request 48734: App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-15 Thread Victor Galgo
lude: node/**
[INFO] Exclude: npm-debug.log
[INFO] 1425 resources included (use -debug for more details)
Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
recognized.
Compiler warnings:
  WARNING:  'org.apache.xerces.jaxp.SAXParserImpl: Property 
'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'
Warning:  org.apache.xerces.parsers.SAXParser: Feature 
'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
Warning:  org.apache.xerces.parsers.SAXParser: Property 
'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.
Warning:  org.apache.xerces.parsers.SAXParser: Property 
'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
recognized.
[INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
approved: 1425 licence.
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 1:31.015s
[INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016
[INFO] Final Memory: 13M/407M
[INFO] 

Also to test this I have installed 3 nodes cluster and enabled namenode ha on 
it.


Thanks,

Victor Galgo