[jira] [Created] (MESOS-5204) Accessibility Enhancement For Page "Offers"

2016-04-12 Thread Chen Nan Li (JIRA)
Chen Nan Li created MESOS-5204:
--

 Summary: Accessibility Enhancement For Page "Offers"
 Key: MESOS-5204
 URL: https://issues.apache.org/jira/browse/MESOS-5204
 Project: Mesos
  Issue Type: Task
Reporter: Chen Nan Li
Assignee: Chen Nan Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5203) Accessibility Enhancement For Page "Slaves"

2016-04-12 Thread Chen Nan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Nan Li updated MESOS-5203:
---
Priority: Minor  (was: Major)

> Accessibility Enhancement For Page "Slaves"
> ---
>
> Key: MESOS-5203
> URL: https://issues.apache.org/jira/browse/MESOS-5203
> Project: Mesos
>  Issue Type: Task
>Reporter: Chen Nan Li
>Assignee: Chen Nan Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5202) Accessibility Enhancement For Page "Frameworks"

2016-04-12 Thread Chen Nan Li (JIRA)
Chen Nan Li created MESOS-5202:
--

 Summary: Accessibility Enhancement For Page "Frameworks"
 Key: MESOS-5202
 URL: https://issues.apache.org/jira/browse/MESOS-5202
 Project: Mesos
  Issue Type: Task
Reporter: Chen Nan Li
Assignee: Chen Nan Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5203) Accessibility Enhancement For Page "Slaves"

2016-04-12 Thread Chen Nan Li (JIRA)
Chen Nan Li created MESOS-5203:
--

 Summary: Accessibility Enhancement For Page "Slaves"
 Key: MESOS-5203
 URL: https://issues.apache.org/jira/browse/MESOS-5203
 Project: Mesos
  Issue Type: Task
Reporter: Chen Nan Li
Assignee: Chen Nan Li






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5201) Accessibility Enhancement For Page "Mesos"

2016-04-12 Thread Chen Nan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Nan Li updated MESOS-5201:
---
Summary: Accessibility Enhancement For Page "Mesos"  (was: Accessibility 
Enhancement For Page Mesos)

> Accessibility Enhancement For Page "Mesos"
> --
>
> Key: MESOS-5201
> URL: https://issues.apache.org/jira/browse/MESOS-5201
> Project: Mesos
>  Issue Type: Task
>Reporter: Chen Nan Li
>Assignee: Chen Nan Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5201) Accessibility Enhancement For Page Mesos

2016-04-12 Thread Chen Nan Li (JIRA)
Chen Nan Li created MESOS-5201:
--

 Summary: Accessibility Enhancement For Page Mesos
 Key: MESOS-5201
 URL: https://issues.apache.org/jira/browse/MESOS-5201
 Project: Mesos
  Issue Type: Task
Reporter: Chen Nan Li
Assignee: Chen Nan Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5185) Accessibility for Mesos Web UI

2016-04-12 Thread Chen Nan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Nan Li updated MESOS-5185:
---
Description: 
Currently, Mesos Web UI do not kindly support Accessibility features for
disabled people.

For example:

Web GUI can support screen reader to read page content for blind person.
so we can fix some issues such as making Mesos Web GUI pages to support
[WAI-ARIA standard | https://www.w3.org/WAI/intro/aria]

We could update webui according to [Accessibility Design Guidelines for the 
Web|https://msdn.microsoft.com/en-us/library/aa291312(v=vs.71).aspx] and 
https://www.w3.org/standards/webdesign/accessibility.


  was:
Currently, Mesos Web UI do not have fully support Accessibility features for
disabled people.

For example:

Web GUI can support screen reader to read page content for blind person.
so we can fix some issues such as making Mesos Web GUI pages to support
[WAI-ARIA standard | https://www.w3.org/WAI/intro/aria]

We could update webui according to [Accessibility Design Guidelines for the 
Web|https://msdn.microsoft.com/en-us/library/aa291312(v=vs.71).aspx] and 
https://www.w3.org/standards/webdesign/accessibility.



> Accessibility for Mesos Web UI
> --
>
> Key: MESOS-5185
> URL: https://issues.apache.org/jira/browse/MESOS-5185
> Project: Mesos
>  Issue Type: Epic
>  Components: webui
>Reporter: haosdent
>Assignee: Chen Nan Li
>Priority: Minor
>
> Currently, Mesos Web UI do not kindly support Accessibility features for
> disabled people.
> For example:
> Web GUI can support screen reader to read page content for blind person.
> so we can fix some issues such as making Mesos Web GUI pages to support
> [WAI-ARIA standard | https://www.w3.org/WAI/intro/aria]
> We could update webui according to [Accessibility Design Guidelines for the 
> Web|https://msdn.microsoft.com/en-us/library/aa291312(v=vs.71).aspx] and 
> https://www.w3.org/standards/webdesign/accessibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-540) Executor health checking.

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-540:
---
Labels: health-check twitter  (was: twitter)

> Executor health checking.
> -
>
> Key: MESOS-540
> URL: https://issues.apache.org/jira/browse/MESOS-540
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: health-check, twitter
>
> We currently do not health check running executors.
> At Twitter, this has led to out-of-band health checking of executors for an 
> internal framework.
> For the Storm framework, this has led to out-of-band health checking via 
> ZooKeeper. Health checking would allow Storm to use finer grained executors 
> for better isolation.
> This also helps the Hadoop and Jenkins frameworks as well should health 
> checking be desired.
> As for implementation, I would propose adding a call on the Executor 
> interface:
> /**
>  * Invoked by the ExecutorDriver to determine the health of the executor.
>  * When this function returns, the Executor is considered healthy.
>  */
> void heartbeat(ExecutorDriver* driver) = 0;
> The driver can then heartbeat periodically and kill when the Executor is not 
> responding to heartbeats. The driver should also detect the executor 
> deadlocking on any of the other callbacks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1653:

Labels: flaky health-check mesosphere  (was: flaky mesosphere)

> HealthCheckTest.GracePeriod is flaky.
> -
>
> Key: MESOS-1653
> URL: https://issues.apache.org/jira/browse/MESOS-1653
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Timothy Chen
>  Labels: flaky, health-check, mesosphere
>
> {noformat}
> [--] 3 tests from HealthCheckTest
> [ RUN  ] HealthCheckTest.GracePeriod
> Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
> I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
> I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
> I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
> I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
> 2317ns
> I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 1367ns
> I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
> I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
> I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
> STARTING
> I0729 17:10:10.508482  1213 master.cpp:289] Master 
> 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
> I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
> authenticated frameworks to register
> I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
> authenticated slaves to register
> I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
> authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
> I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
> I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:54701
> I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
> offers for all slaves
> I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
> master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
> I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
> I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
> I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
> I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 12.946461ms
> I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
> STARTING
> I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
> I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
> I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 11.537102ms
> I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
> VOTING
> I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
> group
> I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
> I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
> I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.830863ms
> I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
> I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 6.548344ms
> I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
> I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
> for position 0
> I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
> took 15676ns
> I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
> leveldb took 9.373798ms
> I0729 

[jira] [Updated] (MESOS-1281) Support fall-back in replaceTask with health-checks

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1281:

Labels: health-check  (was: )

> Support fall-back in replaceTask with health-checks
> ---
>
> Key: MESOS-1281
> URL: https://issues.apache.org/jira/browse/MESOS-1281
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, master, slave
>Reporter: Niklas Quarfot Nielsen
>  Labels: health-check
>
> Coupled with the health check feature, a new task can be _attempted_ to run. 
> If health checks fails (defined by properties covered in health check 
> ticket), the old task can be restarted and report a new TASK_FALLBACK status 
> update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1640) HealthCheckTests are flaky under gprof/gcov

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1640:

Labels: flaky health-check mesosphere  (was: flaky mesosphere)

> HealthCheckTests are flaky under gprof/gcov
> ---
>
> Key: MESOS-1640
> URL: https://issues.apache.org/jira/browse/MESOS-1640
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Dominic Hamon
>Assignee: Timothy Chen
>Priority: Minor
>  Labels: flaky, health-check, mesosphere
>
> When running {{/support/coverage.sh}} the {{HealthCheckTest}} fixture 
> exhibits multiple flakes:
> {noformat}
> [ RUN  ] HealthCheckTest.HealthyTask
> ../../src/tests/health_check_tests.cpp:165: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/health_check_tests.cpp:167: Failure
> Failed to wait 10secs for statusHealth
> ../../src/tests/health_check_tests.cpp:158: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] HealthCheckTest.HealthyTask (11854 ms)
> [ RUN  ] HealthCheckTest.EnvironmentSetup
> ../../src/tests/health_check_tests.cpp:314: Failure
> Value of: statusRunning.get().state()
>   Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/health_check_tests.cpp:316: Failure
> Failed to wait 10secs for statusHealth
> ../../src/tests/health_check_tests.cpp:307: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(, _))...
>  Expected: to be called twice
>Actual: called once - unsatisfied and active
> [  FAILED  ] HealthCheckTest.EnvironmentSetup (12020 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-1802:

Labels: flaky health-check mesosphere  (was: flaky mesosphere)

> HealthCheckTest.HealthStatusChange is flaky on jenkins.
> ---
>
> Key: MESOS-1802
> URL: https://issues.apache.org/jira/browse/MESOS-1802
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
>Affects Versions: 0.26.0
>Reporter: Benjamin Mahler
>  Labels: flaky, health-check, mesosphere
> Attachments: health_check_flaky_test_log.txt
>
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] HealthCheckTest.HealthStatusChange
> Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
> I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
> I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
> I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
> I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 642ns
> I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 343ns
> I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
> I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
> I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to 
> STARTING
> I0916 22:56:14.036603 21046 master.cpp:286] Master 
> 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
> 67.195.81.186:47865
> I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
> I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 480322ns
> I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
> STARTING
> I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
> I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
> I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:47865
> I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
> master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
> I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
> I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
> I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
> I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
> I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 330251ns
> I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos 
> group
> I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
> I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
> I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 92623ns
> I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
> I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 144215ns
> I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0
> I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request 
> for position 0

[jira] [Updated] (MESOS-4604) ROOT_DOCKER_DockerHealthyTask is flaky.

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4604:

Labels: flaky-test health-check mesosphere test  (was: flaky-test 
mesosphere test)

> ROOT_DOCKER_DockerHealthyTask is flaky.
> ---
>
> Key: MESOS-4604
> URL: https://issues.apache.org/jira/browse/MESOS-4604
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: CentOS 6/7, Ubuntu 15.04 on AWS.
>Reporter: Jan Schlicht
>Assignee: Joseph Wu
>  Labels: flaky-test, health-check, mesosphere, test
>
> Log from Teamcity that is running {{sudo ./bin/mesos-tests.sh}} on AWS EC2 
> instances:
> {noformat}
> [18:27:14][Step 8/8] [--] 8 tests from HealthCheckTest
> [18:27:14][Step 8/8] [ RUN  ] HealthCheckTest.HealthyTask
> [18:27:17][Step 8/8] [   OK ] HealthCheckTest.HealthyTask ( ms)
> [18:27:17][Step 8/8] [ RUN  ] 
> HealthCheckTest.ROOT_DOCKER_DockerHealthyTask
> [18:27:36][Step 8/8] ../../src/tests/health_check_tests.cpp:388: Failure
> [18:27:36][Step 8/8] Failed to wait 15secs for termination
> [18:27:36][Step 8/8] F0204 18:27:35.981302 23085 logging.cpp:64] RAW: Pure 
> virtual method called
> [18:27:36][Step 8/8] @ 0x7f7077055e1c  google::LogMessage::Fail()
> [18:27:36][Step 8/8] @ 0x7f707705ba6f  google::RawLog__()
> [18:27:36][Step 8/8] @ 0x7f70760f76c9  __cxa_pure_virtual
> [18:27:36][Step 8/8] @   0xa9423c  
> mesos::internal::tests::Cluster::Slaves::shutdown()
> [18:27:36][Step 8/8] @  0x1074e45  
> mesos::internal::tests::MesosTest::ShutdownSlaves()
> [18:27:36][Step 8/8] @  0x1074de4  
> mesos::internal::tests::MesosTest::Shutdown()
> [18:27:36][Step 8/8] @  0x1070ec7  
> mesos::internal::tests::MesosTest::TearDown()
> [18:27:36][Step 8/8] @  0x16eb7b2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e61a9  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16c56aa  testing::Test::Run()
> [18:27:36][Step 8/8] @  0x16c5e89  testing::TestInfo::Run()
> [18:27:36][Step 8/8] @  0x16c650a  testing::TestCase::Run()
> [18:27:36][Step 8/8] @  0x16cd1f6  
> testing::internal::UnitTestImpl::RunAllTests()
> [18:27:36][Step 8/8] @  0x16ec513  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16e6df1  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [18:27:36][Step 8/8] @  0x16cbe26  testing::UnitTest::Run()
> [18:27:36][Step 8/8] @   0xe54c84  RUN_ALL_TESTS()
> [18:27:36][Step 8/8] @   0xe54867  main
> [18:27:36][Step 8/8] @ 0x7f7071560a40  (unknown)
> [18:27:36][Step 8/8] @   0x9b52d9  _start
> [18:27:36][Step 8/8] Aborted (core dumped)
> [18:27:36][Step 8/8] Process exited with code 134
> {noformat}
> Happens with Ubuntu 15.04, CentOS 6, CentOS 7 _quite_ often. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-4812:
---

Assignee: haosdent

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4869:

Labels: health-check  (was: )

> /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
> ---
>
> Key: MESOS-4869
> URL: https://issues.apache.org/jira/browse/MESOS-4869
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.1
>Reporter: Anthony Scalisi
>Priority: Critical
>  Labels: health-check
>
> We switched our health checks in Marathon from HTTP to COMMAND:
> {noformat}
> "healthChecks": [
> {
>   "protocol": "COMMAND",
>   "path": "/ops/ping",
>   "command": { "value": "curl --silent -f -X GET 
> http://$HOST:$PORT0/ops/ping > /dev/null" },
>   "gracePeriodSeconds": 90,
>   "intervalSeconds": 2,
>   "portIndex": 0,
>   "timeoutSeconds": 5,
>   "maxConsecutiveFailures": 3
> }
>   ]
> {noformat}
> All our applications have the same health check (and /ops/ping endpoint).
> Even though we have the issue on all our Meos slaves, I'm going to focus on a 
> particular one: *mesos-slave-i-e3a9c724*.
> The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:
> !https://i.imgur.com/gbRf804.png!
> Here is a *docker ps* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724 # docker ps
> CONTAINER IDIMAGE   COMMAND  CREATED  
>STATUS  PORTS NAMES
> 4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31926->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
> 66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31939->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
> f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31656->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
> 880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago 
>Up 24 hours 0.0.0.0:31371->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
> 5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31500->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
> b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31382->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
> 5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31186->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
> 53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31839->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
> {noformat}
> Here is a *docker stats* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724  # docker stats
> CONTAINER   CPU %   MEM USAGE / LIMIT MEM %   
> NET I/O   BLOCK I/O
> 4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%  
> 1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
> 53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%  
> 419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
> 5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%  
> 423 MB / 526.5 MB 3.219 MB / 61.44 kB
> 5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%  
> 2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
> 66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%  
> 258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
> 880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%  
> 1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
> b63740fe56e712.04%  629 MB / 1.611 GB 39.06%  
> 10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
> f7382f241fce6.21%   505 MB / 1.611 GB 31.36%  
> 153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
> {noformat}
> Not much else is running on the slave, yet the used memory doesn't map to the 
> tasks memory:
> {noformat}
> Mem:16047M used:13340M buffers:1139M cache:776M
> {noformat}
> If I exec into the container (*java:8* image), I can see correctly the shell 
> calls to execute the curl specified in the 

[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4812:

Labels: health-check  (was: )

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>  Labels: health-check
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2533) Support HTTP checks in Mesos health check program

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-2533:

Labels: health-check mesosphere  (was: mesosphere)

> Support HTTP checks in Mesos health check program
> -
>
> Key: MESOS-2533
> URL: https://issues.apache.org/jira/browse/MESOS-2533
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> Currently, only commands are supported but our health check protobuf enables 
> users to encode HTTP checks as well. We should wire up this in the health 
> check program or remove the http field from the protobuf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-12 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238524#comment-15238524
 ] 

Vinod Kone commented on MESOS-5184:
---

Sorry, I misunderstood the problem. Looks like the we need to do call 
roles::validate() inside Master::subscribe().

> Mesos does not validate role info when framework registered with specified 
> role
> ---
>
> Key: MESOS-5184
> URL: https://issues.apache.org/jira/browse/MESOS-5184
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> When framework registered with specified role, Mesos does not validate the 
> role info. It will accept the subscription and send unreserved resources as 
> offer to the framework.
> {code}
> # cat register.json
> {
> "framework_id": {"value" : "test1"},
> "type":"SUBSCRIBE",
> "subscribe":{
> "framework_info":{
> "user":"root",
> "name":"test1",
> "failover_timeout":60,
> "role":"/test/test1",
> "id":{"value":"test1"},
> "principal":"test1",
> "capabilities":[{"type":"REVOCABLE_RESOURCES"}]
> },
> "force":true
> }
> }
> # curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
> application/json" -X POST -d @register.json
> * Hostname was NOT found in DNS cache
> *   Trying 192.168.56.110...
> * Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> > POST /api/v1/scheduler HTTP/1.1
> > User-Agent: curl/7.35.0
> > Host: 192.168.56.110:5050
> > Accept: */*
> > Content-type: application/json
> > Content-Length: 265
> >
> * upload completely sent off: 265 out of 265 bytes
> < HTTP/1.1 200 OK
> < Date: Wed, 06 Apr 2016 21:34:18 GMT
> < Transfer-Encoding: chunked
> < Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
> < Content-Type: application/json
> <
> 69
> {"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
> {"type":"HEARTBEAT"}1531
> {"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
> alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
> {"type":"HEARTBEAT"}20
> {code}
> As you see,  the role under which framework register is "/test/test1", which 
> is an invalid role according to 
> [#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]
> And Mesos master log
> {code}
> I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request 
> for HTTP framework 'test1'
> I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' 
> with checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
> I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
> I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
> test1 (test1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-12 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238525#comment-15238525
 ] 

Liqiang Lin commented on MESOS-5184:


Yes. In MESOS-2210, we have role validation to disable some special characters 
in role name, e.g., "/", ".", "..", start with "-", etc. I think we need to 
validate role name when registered.

> Mesos does not validate role info when framework registered with specified 
> role
> ---
>
> Key: MESOS-5184
> URL: https://issues.apache.org/jira/browse/MESOS-5184
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> When framework registered with specified role, Mesos does not validate the 
> role info. It will accept the subscription and send unreserved resources as 
> offer to the framework.
> {code}
> # cat register.json
> {
> "framework_id": {"value" : "test1"},
> "type":"SUBSCRIBE",
> "subscribe":{
> "framework_info":{
> "user":"root",
> "name":"test1",
> "failover_timeout":60,
> "role":"/test/test1",
> "id":{"value":"test1"},
> "principal":"test1",
> "capabilities":[{"type":"REVOCABLE_RESOURCES"}]
> },
> "force":true
> }
> }
> # curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
> application/json" -X POST -d @register.json
> * Hostname was NOT found in DNS cache
> *   Trying 192.168.56.110...
> * Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> > POST /api/v1/scheduler HTTP/1.1
> > User-Agent: curl/7.35.0
> > Host: 192.168.56.110:5050
> > Accept: */*
> > Content-type: application/json
> > Content-Length: 265
> >
> * upload completely sent off: 265 out of 265 bytes
> < HTTP/1.1 200 OK
> < Date: Wed, 06 Apr 2016 21:34:18 GMT
> < Transfer-Encoding: chunked
> < Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
> < Content-Type: application/json
> <
> 69
> {"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
> {"type":"HEARTBEAT"}1531
> {"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
> alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
> {"type":"HEARTBEAT"}20
> {code}
> As you see,  the role under which framework register is "/test/test1", which 
> is an invalid role according to 
> [#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]
> And Mesos master log
> {code}
> I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request 
> for HTTP framework 'test1'
> I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' 
> with checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
> I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
> I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
> test1 (test1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5146) MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky

2016-04-12 Thread Yongqiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238522#comment-15238522
 ] 

Yongqiao Wang commented on MESOS-5146:
--

Append the RR: https://reviews.apache.org/r/46135/

> MasterAllocatorTest/1.RebalancedForUpdatedWeights is flaky
> --
>
> Key: MESOS-5146
> URL: https://issues.apache.org/jira/browse/MESOS-5146
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, tests
>Affects Versions: 0.28.0
> Environment: Ubuntu 14.04 using clang, without libevent or SSL
>Reporter: Greg Mann
>Assignee: Yongqiao Wang
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Observed on the ASF CI:
> {code}
> [ RUN  ] MasterAllocatorTest/1.RebalancedForUpdatedWeights
> I0407 22:34:10.330394 29278 cluster.cpp:149] Creating default 'local' 
> authorizer
> I0407 22:34:10.466182 29278 leveldb.cpp:174] Opened db in 135.608207ms
> I0407 22:34:10.516398 29278 leveldb.cpp:181] Compacted db in 50.159558ms
> I0407 22:34:10.516464 29278 leveldb.cpp:196] Created db iterator in 34959ns
> I0407 22:34:10.516484 29278 leveldb.cpp:202] Seeked to beginning of db in 
> 10195ns
> I0407 22:34:10.516496 29278 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 7324ns
> I0407 22:34:10.516547 29278 replica.cpp:779] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0407 22:34:10.517277 29298 recover.cpp:447] Starting replica recovery
> I0407 22:34:10.517693 29300 recover.cpp:473] Replica is in EMPTY status
> I0407 22:34:10.520251 29310 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from (4775)@172.17.0.3:35855
> I0407 22:34:10.520611 29311 recover.cpp:193] Received a recover response from 
> a replica in EMPTY status
> I0407 22:34:10.521164 29299 recover.cpp:564] Updating replica status to 
> STARTING
> I0407 22:34:10.523435 29298 master.cpp:382] Master 
> f59f9057-a5c7-43e1-b129-96862e640a12 (129e11060069) started on 
> 172.17.0.3:35855
> I0407 22:34:10.523473 29298 master.cpp:384] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/3rZY8C/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.29.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/3rZY8C/master" --zk_session_timeout="10secs"
> I0407 22:34:10.523885 29298 master.cpp:433] Master only allowing 
> authenticated frameworks to register
> I0407 22:34:10.523901 29298 master.cpp:438] Master only allowing 
> authenticated agents to register
> I0407 22:34:10.523913 29298 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/3rZY8C/credentials'
> I0407 22:34:10.524298 29298 master.cpp:480] Using default 'crammd5' 
> authenticator
> I0407 22:34:10.524441 29298 master.cpp:551] Using default 'basic' HTTP 
> authenticator
> I0407 22:34:10.524564 29298 master.cpp:589] Authorization enabled
> I0407 22:34:10.525269 29305 hierarchical.cpp:145] Initialized hierarchical 
> allocator process
> I0407 22:34:10.525333 29305 whitelist_watcher.cpp:77] No whitelist given
> I0407 22:34:10.527331 29298 master.cpp:1832] The newly elected leader is 
> master@172.17.0.3:35855 with id f59f9057-a5c7-43e1-b129-96862e640a12
> I0407 22:34:10.527441 29298 master.cpp:1845] Elected as the leading master!
> I0407 22:34:10.527545 29298 master.cpp:1532] Recovering from registrar
> I0407 22:34:10.527889 29298 registrar.cpp:331] Recovering registrar
> I0407 22:34:10.549734 29299 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 28.25177ms
> I0407 22:34:10.549782 29299 replica.cpp:320] Persisted replica status to 
> STARTING
> I0407 22:34:10.550010 29299 recover.cpp:473] Replica is in STARTING status
> I0407 22:34:10.551352 29299 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from (4777)@172.17.0.3:35855
> I0407 22:34:10.551676 29299 recover.cpp:193] Received a recover response from 
> a replica in STARTING status
> I0407 22:34:10.552315 29308 recover.cpp:564] Updating replica status to VOTING

[jira] [Commented] (MESOS-5184) Mesos does not validate role info when framework registered with specified role

2016-04-12 Thread Jian Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238504#comment-15238504
 ] 

Jian Qiu commented on MESOS-5184:
-

[~vi...@twitter.com] I think the issue here is that we disallow some special 
characters in role, such as  slash, however, the role is not validated when 
registering framework.

> Mesos does not validate role info when framework registered with specified 
> role
> ---
>
> Key: MESOS-5184
> URL: https://issues.apache.org/jira/browse/MESOS-5184
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> When framework registered with specified role, Mesos does not validate the 
> role info. It will accept the subscription and send unreserved resources as 
> offer to the framework.
> {code}
> # cat register.json
> {
> "framework_id": {"value" : "test1"},
> "type":"SUBSCRIBE",
> "subscribe":{
> "framework_info":{
> "user":"root",
> "name":"test1",
> "failover_timeout":60,
> "role":"/test/test1",
> "id":{"value":"test1"},
> "principal":"test1",
> "capabilities":[{"type":"REVOCABLE_RESOURCES"}]
> },
> "force":true
> }
> }
> # curl -v  http://192.168.56.110:5050/api/v1/scheduler -H "Content-type: 
> application/json" -X POST -d @register.json
> * Hostname was NOT found in DNS cache
> *   Trying 192.168.56.110...
> * Connected to 192.168.56.110 (192.168.56.110) port 5050 (#0)
> > POST /api/v1/scheduler HTTP/1.1
> > User-Agent: curl/7.35.0
> > Host: 192.168.56.110:5050
> > Accept: */*
> > Content-type: application/json
> > Content-Length: 265
> >
> * upload completely sent off: 265 out of 265 bytes
> < HTTP/1.1 200 OK
> < Date: Wed, 06 Apr 2016 21:34:18 GMT
> < Transfer-Encoding: chunked
> < Mesos-Stream-Id: 8b2c6740-b619-49c3-825a-e6ae780f4edc
> < Content-Type: application/json
> <
> 69
> {"subscribed":{"framework_id":{"value":"test1"}},"type":"SUBSCRIBED"}20
> {"type":"HEARTBEAT"}1531
> {"offers":{"offers":[{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S0"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos2"},"type":"TEXT"}],"framework_id":{"value":"test1"},"hostname":"mesos2","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O0"},"resources":[{"name":"disk","role":"*","scalar":{"value":20576.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos2","ip":"192.168.56.110","port":5051},"path":"\/slave(1)","scheme":"http"}},{"agent_id":{"value":"2cd5576e-6260-4262-a62c-b0dc45c86c45-S1"},"attributes":[{"name":"mesos_agent_type","text":{"value":"IBM_MESOS_EGO"},"type":"TEXT"},{"name":"hostname","text":{"value":"mesos1"},"type":"TEXT"}],"framework_id":{"v
> alue":"test1"},"hostname":"mesos1","id":{"value":"5b84aad8-dd60-40b3-84c2-93be6b7aa81c-O1"},"resources":[{"name":"disk","role":"*","scalar":{"value":21468.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"role":"*","type":"RANGES"},{"name":"mem","role":"*","scalar":{"value":3952.0},"type":"SCALAR"},{"name":"cpus","role":"*","scalar":{"value":4.0},"type":"SCALAR"}],"url":{"address":{"hostname":"mesos1","ip":"192.168.56.111","port":5051},"path":"\/slave(1)","scheme":"http"}}]},"type":"OFFERS"}20
> {"type":"HEARTBEAT"}20
> {code}
> As you see,  the role under which framework register is "/test/test1", which 
> is an invalid role according to 
> [#MESOS-2210|https://issues.apache.org/jira/browse/MESOS-2210]
> And Mesos master log
> {code}
> I0407 05:34:18.132333 20672 master.cpp:2107] Received subscription request 
> for HTTP framework 'test1'
> I0407 05:34:18.133515 20672 master.cpp:2198] Subscribing framework 'test1' 
> with checkpointing disabled and capabilities [ REVOCABLE_RESOURCES ]
> I0407 05:34:18.135027 20674 hierarchical.cpp:264] Added framework test1
> I0407 05:34:18.138746 20672 master.cpp:5659] Sending 2 offers to framework 
> test1 (test1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3782) Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create symlinks)

2016-04-12 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238498#comment-15238498
 ] 

zhou xing commented on MESOS-3782:
--

submitted the second patch to make changes in lib process make: 
https://reviews.apache.org/r/46134/

> Replace Master/Slave Terminology Phase I - Add duplicate binaries (or create 
> symlinks)
> --
>
> Key: MESOS-3782
> URL: https://issues.apache.org/jira/browse/MESOS-3782
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: zhou xing
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5200) agent->master messages use temporary TCP connections

2016-04-12 Thread David Robinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Robinson updated MESOS-5200:
--
Description: 
Background info: When an agent is started it starts a background task 
(libprocess process?) to detect the leading master. When the leading master is 
detected (or changes) the [SocketManager's link() 
method|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1415]
 [is 
called|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L942] 
and a TCP connection to the master is established. The connection is used by 
the agent to send messages to the master, and the master, upon receiving a 
RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP connection 
back to the agent. Each TCP connection is uni-directional, the agent writes 
messages on one connection and reads messages from the other, and the master 
reads/writes from the opposite ends of the connections.

If the initial TCP connection to the master fails to be established then 
temporary connections are used for all agent->master messages; each send() 
causes a new TCP connection to be setup, the message sent, then the connection 
torn down. If link() succeeds a persistent TCP connection is used instead.

If agents do not use ZK to detect the master then the master detector "detects" 
the master immediately and attempts to connect immediately. The master may not 
be listening for connections at the time, or it could be overwhelmed w/ TCP 
connection attempts, therefore the initial TCP connection attempt fails. The 
agent does not attempt to establish a new persistent connection as link() is 
only called when a new master is detected, which only occurs once unless ZK is 
used.

It's possible for agents to overwhelm a master w/ TCP connections such that 
agents cannot establish connections. When this occurs pong messages may not be 
received by the master so the master shuts down agents thus killing any tasks 
they were running. We have witnessed this scenario during scale/load tests at 
Twitter.

The problem is trivial to reproduce: configure an agent to use a certain master 
(\-\-master=10.20.30.40:5050), start the agent, wait several minutes then start 
the master. All the agent->master messages will occur over temporary 
connections.

The problem occurs less frequently in production because ZK is typically used 
for master detection and a master only registers in ZK after it has started 
listening on its socket. However, the scenario described above can also occur 
when ZK is used – a thundering herd of 10,000+ slaves establishing TCP 
connections to the master can result in some connection attempts failing and 
agents using temporary connections.

  was:
Background info: When an agent is started it starts a background task 
(libprocess process?) to detect the leading master. When the leading master is 
detected (or changes) the [SocketManager's link() 
method|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1415]
 [is 
called|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L942] 
and a TCP connection to the master is established. The connection is used by 
the agent to send messages to the master, and the master, upon receiving a 
RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP connection 
back to the agent. Each TCP connection is uni-directional, the agent writes 
messages on one connection and reads messages from the other, and the master 
reads/writes from the opposite ends of the connections.

If the initial TCP connection to the master fails to be established then 
temporary connections are used for all agent->master messages; each send() 
causes a new TCP connection to be setup, the message sent, then the connection 
torn down. If link() succeeds a persistent TCP connection is used instead.

If agents do not use ZK to detect the master then the master detector "detects" 
the master immediately and attempts to connect immediately. The master may not 
be listening for connections at the time, or it could be overwhelmed w/ TCP 
connection attempts, therefore the initial TCP connection attempt fails. The 
agent does not attempt to establish a new persistent connection as link() is 
only called when a new master is detected, which only occurs once unless ZK is 
used.

It's possible for agents to overwhelm a master w/ TCP connections such that 
agents cannot establish connections. When this occurs pong messages may not be 
received by the master so the master shuts down agents thus killing any tasks 
they were running. We have witnessed this scenario during scale/load tests at 
Twitter.

The problem is trivial to reproduce: configure an agent to use a certain master 
(--master=10.20.30.40:5050), start the agent, wait several minutes then start 
the master. All the agent->master messages will occur over temporary 

[jira] [Created] (MESOS-5200) agent->master messages use temporary TCP connections

2016-04-12 Thread David Robinson (JIRA)
David Robinson created MESOS-5200:
-

 Summary: agent->master messages use temporary TCP connections
 Key: MESOS-5200
 URL: https://issues.apache.org/jira/browse/MESOS-5200
 Project: Mesos
  Issue Type: Bug
Reporter: David Robinson


Background info: When an agent is started it starts a background task 
(libprocess process?) to detect the leading master. When the leading master is 
detected (or changes) the [SocketManager's link() 
method|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1415]
 [is 
called|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L942] 
and a TCP connection to the master is established. The connection is used by 
the agent to send messages to the master, and the master, upon receiving a 
RegisterSlaveMessage/ReregisterSlaveMessage, establishes another TCP connection 
back to the agent. Each TCP connection is uni-directional, the agent writes 
messages on one connection and reads messages from the other, and the master 
reads/writes from the opposite ends of the connections.

If the initial TCP connection to the master fails to be established then 
temporary connections are used for all agent->master messages; each send() 
causes a new TCP connection to be setup, the message sent, then the connection 
torn down. If link() succeeds a persistent TCP connection is used instead.

If agents do not use ZK to detect the master then the master detector "detects" 
the master immediately and attempts to connect immediately. The master may not 
be listening for connections at the time, or it could be overwhelmed w/ TCP 
connection attempts, therefore the initial TCP connection attempt fails. The 
agent does not attempt to establish a new persistent connection as link() is 
only called when a new master is detected, which only occurs once unless ZK is 
used.

It's possible for agents to overwhelm a master w/ TCP connections such that 
agents cannot establish connections. When this occurs pong messages may not be 
received by the master so the master shuts down agents thus killing any tasks 
they were running. We have witnessed this scenario during scale/load tests at 
Twitter.

The problem is trivial to reproduce: configure an agent to use a certain master 
(--master=10.20.30.40:5050), start the agent, wait several minutes then start 
the master. All the agent->master messages will occur over temporary 
connections.

The problem occurs less frequently in production because ZK is typically used 
for master detection and a master only registers in ZK after it has started 
listening on its socket. However, the scenario described above can also occur 
when ZK is used – a thundering herd of 10,000+ slaves establishing TCP 
connections to the master can result in some connection attempts failing and 
agents using temporary connections.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-12 Thread Liqiang Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238353#comment-15238353
 ] 

Liqiang Lin commented on MESOS-5188:


If that's the truth, the executor shall remove the stopped docker container in 
shutting down of executor, rather than try to stop the docker container.

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
> identical to existing resources
> I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task 
> '99ee7dc74861' to executor '99ee7dc74861' of framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725
> I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for 

[jira] [Commented] (MESOS-3923) Implement AuthN handling in Master for the Scheduler endpoint

2016-04-12 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238249#comment-15238249
 ] 

Adam B commented on MESOS-3923:
---

Ok, just curious how framework schedulers are expected to know how to fill out 
the request headers, etc. to accommodate a custom authenticator module, e.g. 
for SPNEGO or token-based authn. We can discuss offline.

> Implement AuthN handling in Master for the Scheduler endpoint
> -
>
> Key: MESOS-3923
> URL: https://issues.apache.org/jira/browse/MESOS-3923
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Affects Versions: 0.25.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> If authentication(AuthN) is enabled on a master, frameworks attempting to use 
> the HTTP Scheduler API can't register.
> {code}
> $ cat /tmp/subscribe-943257503176798091.bin | http --print=HhBb --stream 
> --pretty=colors --auth verification:password1 POST :5050/api/v1/scheduler 
> Accept:application/x-protobuf Content-Type:application/x-protobuf
> POST /api/v1/scheduler HTTP/1.1
> Connection: keep-alive
> Content-Type: application/x-protobuf
> Accept-Encoding: gzip, deflate
> Accept: application/x-protobuf
> Content-Length: 126
> User-Agent: HTTPie/0.9.0
> Host: localhost:5050
> Authorization: Basic dmVyaWZpY2F0aW9uOnBhc3N3b3JkMQ==
> +-+
> | NOTE: binary data not shown in terminal |
> +-+
> HTTP/1.1 401 Unauthorized
> Date: Fri, 13 Nov 2015 20:00:45 GMT
> WWW-authenticate: Basic realm="Mesos master"
> Content-Length: 65
> HTTP schedulers are not supported when authentication is required
> {code}
> Authorization(AuthZ) is already supported for HTTP based frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5199) The mesos-execute print confuse message when launch task

2016-04-12 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5199:
--

 Summary: The mesos-execute print confuse message when launch task
 Key: MESOS-5199
 URL: https://issues.apache.org/jira/browse/MESOS-5199
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu
Priority: Minor


{code}
root@mesos002:~/src/mesos/m2/mesos/build# src/mesos-execute 
--master=192.168.56.12:5050 --name=test --docker_image=ubuntu:14.04 
--command="ls /root"
I0413 07:28:03.833521  2295 scheduler.cpp:175] Version: 0.29.0
Subscribed with ID '3a1af11e-cf66-4ce2-826d-48b332977999-0001'
Submitted task 'test' to agent '3a1af11e-cf66-4ce2-826d-48b332977999-S0'
Received status update TASK_RUNNING for task 'test'
  source: SOURCE_EXECUTOR
  reason: REASON_COMMAND_EXECUTOR_FAILED <<< 
Received status update TASK_FINISHED for task 'test'
  message: 'Command exited with status 0'
  source: SOURCE_EXECUTOR
  reason: REASON_COMMAND_EXECUTOR_FAILED <<<
root@mesos002:~/src/mesos/m2/mesos/build#
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3923) Implement AuthN handling in Master for the Scheduler endpoint

2016-04-12 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238190#comment-15238190
 ] 

Anand Mazumdar commented on MESOS-3923:
---

[~adam-mesos] This uses the HTTP authenticator like the other HTTP operator 
endpoints do. The change looked pretty straight-forward and hence we decided to 
not have an explicit design doc. Do you see any potential issues that we need 
to be aware of with this approach?

> Implement AuthN handling in Master for the Scheduler endpoint
> -
>
> Key: MESOS-3923
> URL: https://issues.apache.org/jira/browse/MESOS-3923
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Affects Versions: 0.25.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> If authentication(AuthN) is enabled on a master, frameworks attempting to use 
> the HTTP Scheduler API can't register.
> {code}
> $ cat /tmp/subscribe-943257503176798091.bin | http --print=HhBb --stream 
> --pretty=colors --auth verification:password1 POST :5050/api/v1/scheduler 
> Accept:application/x-protobuf Content-Type:application/x-protobuf
> POST /api/v1/scheduler HTTP/1.1
> Connection: keep-alive
> Content-Type: application/x-protobuf
> Accept-Encoding: gzip, deflate
> Accept: application/x-protobuf
> Content-Length: 126
> User-Agent: HTTPie/0.9.0
> Host: localhost:5050
> Authorization: Basic dmVyaWZpY2F0aW9uOnBhc3N3b3JkMQ==
> +-+
> | NOTE: binary data not shown in terminal |
> +-+
> HTTP/1.1 401 Unauthorized
> Date: Fri, 13 Nov 2015 20:00:45 GMT
> WWW-authenticate: Basic realm="Mesos master"
> Content-Length: 65
> HTTP schedulers are not supported when authentication is required
> {code}
> Authorization(AuthZ) is already supported for HTTP based frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3923) Implement AuthN handling in Master for the Scheduler endpoint

2016-04-12 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238139#comment-15238139
 ] 

Adam B commented on MESOS-3923:
---

Is there a (brief) design doc for this? Does this use the HTTP authenticator or 
the framework authenticator (module)?

> Implement AuthN handling in Master for the Scheduler endpoint
> -
>
> Key: MESOS-3923
> URL: https://issues.apache.org/jira/browse/MESOS-3923
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Affects Versions: 0.25.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> If authentication(AuthN) is enabled on a master, frameworks attempting to use 
> the HTTP Scheduler API can't register.
> {code}
> $ cat /tmp/subscribe-943257503176798091.bin | http --print=HhBb --stream 
> --pretty=colors --auth verification:password1 POST :5050/api/v1/scheduler 
> Accept:application/x-protobuf Content-Type:application/x-protobuf
> POST /api/v1/scheduler HTTP/1.1
> Connection: keep-alive
> Content-Type: application/x-protobuf
> Accept-Encoding: gzip, deflate
> Accept: application/x-protobuf
> Content-Length: 126
> User-Agent: HTTPie/0.9.0
> Host: localhost:5050
> Authorization: Basic dmVyaWZpY2F0aW9uOnBhc3N3b3JkMQ==
> +-+
> | NOTE: binary data not shown in terminal |
> +-+
> HTTP/1.1 401 Unauthorized
> Date: Fri, 13 Nov 2015 20:00:45 GMT
> WWW-authenticate: Basic realm="Mesos master"
> Content-Length: 65
> HTTP schedulers are not supported when authentication is required
> {code}
> Authorization(AuthZ) is already supported for HTTP based frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3923) Implement AuthN handling in Master for the Scheduler endpoint

2016-04-12 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238118#comment-15238118
 ] 

Anand Mazumdar commented on MESOS-3923:
---

Review chain: https://reviews.apache.org/r/46113/

> Implement AuthN handling in Master for the Scheduler endpoint
> -
>
> Key: MESOS-3923
> URL: https://issues.apache.org/jira/browse/MESOS-3923
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Affects Versions: 0.25.0
>Reporter: Ben Whitehead
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> If authentication(AuthN) is enabled on a master, frameworks attempting to use 
> the HTTP Scheduler API can't register.
> {code}
> $ cat /tmp/subscribe-943257503176798091.bin | http --print=HhBb --stream 
> --pretty=colors --auth verification:password1 POST :5050/api/v1/scheduler 
> Accept:application/x-protobuf Content-Type:application/x-protobuf
> POST /api/v1/scheduler HTTP/1.1
> Connection: keep-alive
> Content-Type: application/x-protobuf
> Accept-Encoding: gzip, deflate
> Accept: application/x-protobuf
> Content-Length: 126
> User-Agent: HTTPie/0.9.0
> Host: localhost:5050
> Authorization: Basic dmVyaWZpY2F0aW9uOnBhc3N3b3JkMQ==
> +-+
> | NOTE: binary data not shown in terminal |
> +-+
> HTTP/1.1 401 Unauthorized
> Date: Fri, 13 Nov 2015 20:00:45 GMT
> WWW-authenticate: Basic realm="Mesos master"
> Content-Length: 65
> HTTP schedulers are not supported when authentication is required
> {code}
> Authorization(AuthZ) is already supported for HTTP based frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5131) Slave allows the resource estimator to send non-revocable resources.

2016-04-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5131:
---
Summary: Slave allows the resource estimator to send non-revocable 
resources.  (was: DRF allocator crashes master with CHECK when resource is 
incorrect)

> Slave allows the resource estimator to send non-revocable resources.
> 
>
> Key: MESOS-5131
> URL: https://issues.apache.org/jira/browse/MESOS-5131
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, oversubscription
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Critical
>
> We were testing a custom resource estimator which broadcasts oversubscribed 
> resources, but they are not marked as "revocable".
> This unfortunately triggered the following check in hierarchical allocator:
> {quote}
> void HierarchicalAllocatorProcess::updateSlave( 
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
> {quote}
> This definitely shouldn't happen in production cluster. IMO, we should do 
> both of following:
> 1. Make sure incorrect resource is not sent from agent (even crash agent 
> process is better);
> 2. Decline agent registration if it's resources is incorrect, or even tell it 
> to shutdown, and possibly remove this check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5131) DRF allocator crashes master with CHECK when resource is incorrect

2016-04-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5131:
---
Shepherd: Benjamin Mahler

> DRF allocator crashes master with CHECK when resource is incorrect
> --
>
> Key: MESOS-5131
> URL: https://issues.apache.org/jira/browse/MESOS-5131
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, oversubscription
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Critical
>
> We were testing a custom resource estimator which broadcasts oversubscribed 
> resources, but they are not marked as "revocable".
> This unfortunately triggered the following check in hierarchical allocator:
> {quote}
> void HierarchicalAllocatorProcess::updateSlave( 
>   // Check that all the oversubscribed resources are revocable.
>   CHECK_EQ(oversubscribed, oversubscribed.revocable());
> {quote}
> This definitely shouldn't happen in production cluster. IMO, we should do 
> both of following:
> 1. Make sure incorrect resource is not sent from agent (even crash agent 
> process is better);
> 2. Decline agent registration if it's resources is incorrect, or even tell it 
> to shutdown, and possibly remove this check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5198) state.json incorrectly serves an empty {{executors}} field

2016-04-12 Thread Michael Gummelt (JIRA)
Michael Gummelt created MESOS-5198:
--

 Summary: state.json incorrectly serves an empty {{executors}} field
 Key: MESOS-5198
 URL: https://issues.apache.org/jira/browse/MESOS-5198
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.28.1
Reporter: Michael Gummelt


The {{frameworks.executors}} array in {{state.json}} is empty, despite the 
framework having running tasks.  I believe this is incorrect, since you can't 
have tasks w/o an executor.  Perhaps the intended meaning is "custom 
executors", but I think we should serve info for all executors run by the 
framework, including command executors.  I often need to look up, for example, 
which command is run by the command executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-12 Thread Michael Gummelt (JIRA)
Michael Gummelt created MESOS-5197:
--

 Summary: Log executor commands w/o verbose logs enabled
 Key: MESOS-5197
 URL: https://issues.apache.org/jira/browse/MESOS-5197
 Project: Mesos
  Issue Type: Task
Reporter: Michael Gummelt


To debug executors, it's often necessary to know the command that ran the 
executor.  For example, when Spark executors fail, I'd like to know the command 
used to invoke the executor (Spark uses the command executor in a docker 
container).  Currently, it's only output if GLOG_v is enabled, but I don't 
think this should be a "verbose" output.  It's a common debugging need.

https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677

cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5197) Log executor commands w/o verbose logs enabled

2016-04-12 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated MESOS-5197:
---
Labels: mesosphere  (was: )

> Log executor commands w/o verbose logs enabled
> --
>
> Key: MESOS-5197
> URL: https://issues.apache.org/jira/browse/MESOS-5197
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Gummelt
>  Labels: mesosphere
>
> To debug executors, it's often necessary to know the command that ran the 
> executor.  For example, when Spark executors fail, I'd like to know the 
> command used to invoke the executor (Spark uses the command executor in a 
> docker container).  Currently, it's only output if GLOG_v is enabled, but I 
> don't think this should be a "verbose" output.  It's a common debugging need.
> https://github.com/apache/mesos/blob/2e76199a3dd977152110fbb474928873f31f7213/src/docker/docker.cpp#L677
> cc [~kaysoky]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5196) Sandbox GC shouldn't return early in the face of an error.

2016-04-12 Thread Yan Xu (JIRA)
Yan Xu created MESOS-5196:
-

 Summary: Sandbox GC shouldn't return early in the face of an error.
 Key: MESOS-5196
 URL: https://issues.apache.org/jira/browse/MESOS-5196
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Yan Xu


Since GC's purpose is to clean up stuff that no one cares about anymore, it 
should do its best to recover as much as disk space possible.

In practice it's not easy for GC to anticipate what the task has done to the 
sandbox in a generic manner, e.g., immutable file attribute, mount points, etc. 
The least it can do is to log the error and continue with the rest of the files 
in the sandbox.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4465) Implement pagesize facilities in Windows

2016-04-12 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-4465:

Description: (was: https://reviews.apache.org/r/45943/
https://reviews.apache.org/r/45944/)

> Implement pagesize facilities in Windows
> 
>
> Key: MESOS-4465
> URL: https://issues.apache.org/jira/browse/MESOS-4465
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout, windows-mvp
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4465) Implement pagesize facilities in Windows

2016-04-12 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237977#comment-15237977
 ] 

Michael Park commented on MESOS-4465:
-

https://reviews.apache.org/r/45943/
https://reviews.apache.org/r/45944/

> Implement pagesize facilities in Windows
> 
>
> Key: MESOS-4465
> URL: https://issues.apache.org/jira/browse/MESOS-4465
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout, windows-mvp
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5195) Docker executor: task logs lost on shutdown

2016-04-12 Thread Steven Schlansker (JIRA)
Steven Schlansker created MESOS-5195:


 Summary: Docker executor: task logs lost on shutdown
 Key: MESOS-5195
 URL: https://issues.apache.org/jira/browse/MESOS-5195
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.27.2
 Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
Reporter: Steven Schlansker


When you try to kill a task running in the Docker executor (in our case via 
Singularity), the task shuts down cleanly but the last logs to standard out / 
standard error are lost in teardown.

For example, we run dumb-init.  With debugging on, you can see it should write:
{noformat}
DEBUG("Forwarded signal %d to children.\n", signum);
{noformat}

If you attach strace to the process, you can see it clearly writes the text to 
stderr.  But that message is lost and never is written to the sandbox 'stderr' 
file.

We believe the issue starts here, in Docker executor.cpp:

{code}
  void shutdown(ExecutorDriver* driver)
  {
cout << "Shutting down" << endl;

if (run.isSome() && !killed) {
  // The docker daemon might still be in progress starting the
  // container, therefore we kill both the docker run process
  // and also ask the daemon to stop the container.

  // Making a mutable copy of the future so we can call discard.
  Future(run.get()).discard();
  stop = docker->stop(containerName, stopTimeout);
  killed = true;
}
  }
{code}

Notice how the "run" future is discarded *before* the Docker daemon is told to 
stop -- now what will discarding it do?

{code}
void commandDiscarded(const Subprocess& s, const string& cmd)
{
  VLOG(1) << "'" << cmd << "' is being discarded";
  os::killtree(s.pid(), SIGKILL);
}
{code}

Oops, just sent SIGKILL to the entire process tree...

You can see another (harmless?) side effect in the Docker daemon logs, it never 
gets a chance to kill the task:

{noformat}
ERROR Handler for DELETE 
/v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
 returned error: No such container: 
mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
{noformat}

I suspect that the fix is wait for 'docker->stop()' to complete before 
discarding the 'run' future.

Happy to provide more information if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5194) Python build failure (non-deterministic)

2016-04-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5194:
---
Attachment: AWSMesosCI_Centos_6_-_SSL_587.log.gz

> Python build failure (non-deterministic)
> 
>
> Key: MESOS-5194
> URL: https://issues.apache.org/jira/browse/MESOS-5194
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>  Labels: mesosphere, python
> Attachments: AWSMesosCI_Centos_6_-_SSL_587.log.gz
>
>
> Observed on Mesosphere internal CI. The complete log file is attached, but 
> the suspicious part looks like:
> {noformat}
> [19:32:08] : [Step 6/10] running build_ext
> [19:32:08] : [Step 6/10] copying 
> src/mesos.native.egg-info/namespace_packages.txt -> 
> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] copying src/mesos.egg-info/dependency_links.txt 
> -> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] copying src/mesos.native.egg-info/requires.txt 
> -> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] copying src/mesos.egg-info/requires.txt -> 
> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] copying src/mesos.native.egg-info/top_level.txt 
> -> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] copying src/mesos.egg-info/top_level.txt -> 
> build/bdist.linux-x86_64/egg/EGG-INFO
> [19:32:08] : [Step 6/10] zip_safe flag not set; analyzing archive 
> contents...
> [19:32:08] : [Step 6/10] zip_safe flag not set; analyzing archive 
> contents...
> [19:32:08] : [Step 6/10] mesos.__init__: module references __path__
> [19:32:08] : [Step 6/10] mesos.__init__: module references __path__
> [19:32:08] : [Step 6/10] creating 
> /mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist
> [19:32:08] : [Step 6/10] creating 
> /mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist
> [19:32:08]W: [Step 6/10] error: could not create 
> '/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist': File exists
> [19:32:08] : [Step 6/10] creating 
> '/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist/mesos-0.29.0-py2.6.egg'
>  and adding 'build/bdist.linux-x86_64/egg' to it
> [19:32:08] : [Step 6/10] building 'mesos.executor._executor' extension
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5194) Python build failure (non-deterministic)

2016-04-12 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5194:
--

 Summary: Python build failure (non-deterministic)
 Key: MESOS-5194
 URL: https://issues.apache.org/jira/browse/MESOS-5194
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway


Observed on Mesosphere internal CI. The complete log file is attached, but the 
suspicious part looks like:

{noformat}
[19:32:08] : [Step 6/10] running build_ext
[19:32:08] : [Step 6/10] copying 
src/mesos.native.egg-info/namespace_packages.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] copying src/mesos.egg-info/dependency_links.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] copying src/mesos.native.egg-info/requires.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] copying src/mesos.egg-info/requires.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] copying src/mesos.native.egg-info/top_level.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] copying src/mesos.egg-info/top_level.txt -> 
build/bdist.linux-x86_64/egg/EGG-INFO
[19:32:08] : [Step 6/10] zip_safe flag not set; analyzing archive 
contents...
[19:32:08] : [Step 6/10] zip_safe flag not set; analyzing archive 
contents...
[19:32:08] : [Step 6/10] mesos.__init__: module references __path__
[19:32:08] : [Step 6/10] mesos.__init__: module references __path__
[19:32:08] : [Step 6/10] creating 
/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist
[19:32:08] : [Step 6/10] creating 
/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist
[19:32:08]W: [Step 6/10] error: could not create 
'/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist': File exists
[19:32:08] : [Step 6/10] creating 
'/mnt/teamcity/work/4240ba9ddd0997c3/build/src/python/dist/mesos-0.29.0-py2.6.egg'
 and adding 'build/bdist.linux-x86_64/egg' to it
[19:32:08] : [Step 6/10] building 'mesos.executor._executor' extension
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237882#comment-15237882
 ] 

Joseph Wu commented on MESOS-5193:
--

Probably not related to this problem you're seeing, but using {{/tmp}} as your 
{{work_dir}} is problematic.  
See this and related JIRAs: https://issues.apache.org/jira/browse/MESOS-5064

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237874#comment-15237874
 ] 

Neil Conway commented on MESOS-5193:


Hi [~prigupta] -- can you post the complete log files for all three nodes? I'd 
like to make sure that the snippets you've posted are not missing some 
important context. Thanks!

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5187) filesystem/linux isolator does not set the permissions of the host_path

2016-04-12 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237725#comment-15237725
 ] 

Ian Downes commented on MESOS-5187:
---

The highlighted code was intended for this quite specific use-case: masking a 
system directory and inheriting its mode. I agree that the filesystem/linux 
isolator should support this use-case but suggest that it be made explicit, 
perhaps by extending the Volume message to include setting the directory mode 
(different to the existing Volume::Mode) when creating container relative 
paths. [~jieyu] thoughts?

> filesystem/linux isolator does not set the permissions of the host_path
> ---
>
> Key: MESOS-5187
> URL: https://issues.apache.org/jira/browse/MESOS-5187
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.26.0
> Environment: Mesos 0.26.0, Apache Aurora 0.12
>Reporter: Stephan Erb
>
> The {{filesystem/linux}} isolator is not a drop in replacement for the 
> {{filesystem/shared}} isolator. This should be considered before the latter 
> is deprecated.
> We are currently using the {{filesystem/shared}} isolator together with the 
> following slave option. This provides us with a private {{/tmp}} and 
> {{/var/tmp}} folder for each task.
> {code}
> --default_container_info='{
> "type": "MESOS",
> "volumes": [
> {"host_path": "system/tmp", "container_path": "/tmp", 
>"mode": "RW"},
> {"host_path": "system/vartmp",  "container_path": "/var/tmp", 
>"mode": "RW"}
> ]
> }'
> {code}
> When browsing the Mesos sandbox, one can see the following permissions:
> {code}
> mode  nlink   uid gid sizemtime   
> drwxrwxrwx3   rootroot4 KBApr 11 18:16 tmp
> drwxrwxrwx2   rootroot4 KBApr 11 18:15 vartmp 
> {code}
> However, when running with the new {{filesystem/linux}} isolator, the 
> permissions are different:
> {code}
> mode  nlink   uid gid sizemtime   
> drwxr-xr-x 2  rootroot4 KBApr 12 10:34 tmp
> drwxr-xr-x 2  rootroot4 KBApr 12 10:34 vartmp
> {code}
> This prevents user code (running as a non-root user) from writing to those 
> folders, i.e. every write attempt fails with permission denied. 
> *Context*:
> * We are using Apache Aurora. Aurora is running its custom executor as root 
> but then switches to a non-privileged user before running the actual user 
> code. 
> * The follow code seems to have enabled our usecase in the existing 
> {{filesystem/shared}} isolator: 
> https://github.com/apache/mesos/blob/4d2b1b793e07a9c90b984ca330a3d7bc9e1404cc/src/slave/containerizer/mesos/isolators/filesystem/shared.cpp#L175-L198
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4922) Setup proper /etc/hostname, /etc/hosts and /etc/resolv.conf for containers in network/cni isolator.

2016-04-12 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237726#comment-15237726
 ] 

Jie Yu commented on MESOS-4922:
---

commit 2d75778557d3d29291c4ec98d00d6784010be769
Author: Avinash sridharan 
Date:   Tue Apr 12 09:02:39 2016 -0700

Added `subcommand` to `network/cni` isolator.

The `subcommand` allow configuring the container hostname and setting up
network files in the container such as /etc/hosts, /etc/hostname,
/etc/resolv.conf. This will allow for correct name to IP resolution
within the container network namespace.

Review: https://reviews.apache.org/r/45954/

> Setup proper /etc/hostname, /etc/hosts and /etc/resolv.conf for containers in 
> network/cni isolator.
> ---
>
> Key: MESOS-4922
> URL: https://issues.apache.org/jira/browse/MESOS-4922
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> The network/cni isolator needs to properly setup /etc/hostname and /etc/hosts 
> for the container with a hostname (e.g., randomly generated) and the assigned 
> IP returned by CNI plugin.
> We should consider the following cases:
> 1) container is using host filesystem
> 2) container is using a different filesystem
> 3) custom executor and command executor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4785) Reorganize ACL subject/object descriptions

2016-04-12 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237597#comment-15237597
 ] 

Adam B commented on MESOS-4785:
---

On it

> Reorganize ACL subject/object descriptions
> --
>
> Key: MESOS-4785
> URL: https://issues.apache.org/jira/browse/MESOS-4785
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Alexander Rojas
>  Labels: documentation, mesosphere, security
> Fix For: 0.29.0
>
>
> The authorization documentation would benefit from a reorganization of the 
> ACL subject/object descriptions. Instead of simple lists of the available 
> subjects and objects, it would be nice to see a table showing which subject 
> and object is used with each action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4785) Reorganize ACL subject/object descriptions

2016-04-12 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4785:
--
Shepherd: Adam B

> Reorganize ACL subject/object descriptions
> --
>
> Key: MESOS-4785
> URL: https://issues.apache.org/jira/browse/MESOS-4785
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Alexander Rojas
>  Labels: documentation, mesosphere, security
> Fix For: 0.29.0
>
>
> The authorization documentation would benefit from a reorganization of the 
> ACL subject/object descriptions. Instead of simple lists of the available 
> subjects and objects, it would be nice to see a table showing which subject 
> and object is used with each action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237577#comment-15237577
 ] 

Priyanka Gupta commented on MESOS-5193:
---

Error Stack in mesos master log

Node3
I0411 22:47:02.007249  1348 detector.cpp:479] A new leading master 
(UPID=master@10.221.28.61:5050) is detected
I0411 22:47:02.007380  1348 master.cpp:1710] The newly elected leader is 
master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4
I0411 22:47:02.007428  1348 master.cpp:1723] Elected as the leading master!
I0411 22:47:02.007457  1348 master.cpp:1468] Recovering from registrar
I0411 22:47:02.007551  1345 registrar.cpp:307] Recovering registrar
I0411 22:47:02.007649  1356 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 }
I0411 22:47:02.007841  1356 log.cpp:659] Attempting to start the writer
I0411 22:47:02.008477  1348 replica.cpp:493] Replica received implicit promise 
request from (30)@10.221.28.61:5050 with proposal 52
E0411 22:47:02.008903  1358 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:47:02.009968  1348 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.44126ms
I0411 22:47:02.010022  1348 replica.cpp:342] Persisted promised to 52
F0411 22:48:02.008332  1357 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f4bd5bcedfd  (unknown)
@ 0x7f4bd5bd0c3d  (unknown)
@ 0x7f4bd5bce9ec  (unknown)
@ 0x7f4bd5bd1539  (unknown)
@ 0x7f4bd54022dc  (unknown)
@ 0x7f4bd5442ab0  (unknown)
@   0x42807e  (unknown)
@ 0x7f4bd54690a5  (unknown)
@ 0x7f4bd54bb976  (unknown)
@ 0x7f4bd54cc566  (unknown)
@ 0x7f4bd52fc4d6  (unknown)
@ 0x7f4bd54cc553  (unknown)
@ 0x7f4bd54b0614  (unknown)
@ 0x7f4bd5b7c971  (unknown)
@ 0x7f4bd5b7cc77  (unknown)
@   0x3dc38b6470  (unknown)
@   0x3dc18079d1  (unknown)
@   0x3dc14e88fd  (unknown)
@  (nil)  (unknown)
/bin/bash: line 1:  1313 Aborted /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 2

I0411 22:48:10.006216  1466 log.cpp:659] Attempting to start the writer
E0411 22:48:10.006958  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.007202  1467 replica.cpp:493] Replica received implicit promise 
request from (13)@10.221.28.249:5050 with proposal 52
E0411 22:48:10.007491  1478 process.cpp:1966] Failed to shutdown socket with fd 
23: Transport endpoint is not connected
I0411 22:48:10.008458  1467 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 1.227092ms
I0411 22:48:10.008491  1467 replica.cpp:342] Persisted promised to 52
F0411 22:49:10.006739  1476 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7fec686f2dfd  (unknown)
@ 0x7fec686f4c3d  (unknown)
@ 0x7fec686f29ec  (unknown)
@ 0x7fec686f5539  (unknown)
@ 0x7fec67f262dc  (unknown)
@ 0x7fec67f66ab0  (unknown)
@   0x42807e  (unknown)
@ 0x7fec67f8d0a5  (unknown)
@ 0x7fec67fdf976  (unknown)
@ 0x7fec67ff0566  (unknown)
@ 0x7fec67e204d6  (unknown)
@ 0x7fec67ff0553  (unknown)
@ 0x7fec67fd4614  (unknown)
@ 0x7fec686a0971  (unknown)
@ 0x7fec686a0c77  (unknown)
@   0x37f98b6470  (unknown)
@   0x39ed207a51  (unknown)
@   0x39ecae89ad  (unknown)
@  (nil)  (unknown)
/bin/bash: line 1:  1452 Aborted /usr/sbin/mesos-master 
--work_dir=/tmp/mesos_dir 
--zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos
 --quorum=2



Node 1
I0411 22:45:52.017833  8338 detector.cpp:479] A new leading master 
(UPID=master@10.221.29.247:5050) is detected
I0411 22:45:52.017925  8338 master.cpp:1710] The newly elected leader is 
master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16
I0411 22:45:52.017956  8338 master.cpp:1723] Elected as the leading master!
I0411 22:45:52.017983  8338 master.cpp:1468] Recovering from registrar
I0411 22:45:52.018069  8339 registrar.cpp:307] Recovering registrar
I0411 22:45:52.018337  8333 log.cpp:659] Attempting to start the writer
I0411 22:45:52.018785  8336 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:45:52.019008  8336 replica.cpp:493] Replica received implicit promise 
request from 

[jira] [Created] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-12 Thread Priyanka Gupta (JIRA)
Priyanka Gupta created MESOS-5193:
-

 Summary: Recovery failed: Failed to recover registrar on reboot of 
mesos master
 Key: MESOS-5193
 URL: https://issues.apache.org/jira/browse/MESOS-5193
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.27.0, 0.22.0
Reporter: Priyanka Gupta


Hi all, 

We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
all of them. We are using chronos on top of it. The problem is when we reboot 
the mesos master leader, the other nodes try to get elected as leader but fail 
with recovery registrar issue. 
"Recovery failed: Failed to recover registrar: Failed to perform fetch within 
1mins"

The next node then try to become the leader but again fails with same error. I 
am not sure about the issue. We are currently using mesos 0.22 and also tried 
to upgrade to mesos 0.27 as well but the problem continues to happen. 

 /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
--zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2

Can you please help us resolve this issue as its a production system.

Thanks,
Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-12 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237546#comment-15237546
 ] 

Vinod Kone commented on MESOS-3781:
---

I took a look at the review you posted and discussed w/ [~bmahler] on better 
ways to do it.

Here is what I've in mind

--> First do a review where you change the Flag *variables* but keep the old 
*names*. For example

{code}
  add(::authenticate_agents,
 "authenticate_slaves",
 "..help string.",
 false);
{code}

Note that this needs changes in master.cpp and other files where these flags 
are accessed.

2) Another review where we add support for Flags::add() to take multiple names 
instead of a single name as argument. This is similar to the multi-name support 
in python's argparse.

{code}
  template 
  void add(
  T1* t1,
  const std::vector& names,
  const std::string& help,
  const T2& t2,
  F validate);
{code}

Note this would need changes to FlagsBase and Flag classes.
{code}
 struct Flag
 {
...
-  std::string name;
+  std::vector names;
...
}

struct FlagsBase
{
...
-  std::map flags_;
+  std::map flags_;
...
}
{code}

Note that we need a shared_ptr in FlagsBase because multiple flag names can 
correspond to the same Flag object.

3) Another review to update master binary flags that contain "slave" keyword to 
have an additional name with "agent" keyword. Example

{code}
  add(::authenticate_agents,
  {"authenticate_slaves", "authenticated_agents"},
  "If `true`, only authenticated slaves are allowed to register.\n"
  "If `false`, unauthenticated slaves are also allowed to register.\n"
  "--authenticate_slaves flag is *DEPRECATED* in favor of 
--authenticate_agents.",
  false);
{code}

4) Another review to update slave binary flags as above.

How does that sound?

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6

2016-04-12 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4053:
-
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 31, 
Mesosphere Sprint 32  (was: Mesosphere Sprint 26, Mesosphere Sprint 27, 
Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33)

> MemoryPressureMesosTest tests fail on CentOS 6.6
> 
>
> Key: MESOS-4053
> URL: https://issues.apache.org/jira/browse/MESOS-4053
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere, test-failure
>
> {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
> {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
> seems that mounted cgroups are not properly cleaned up after previous tests, 
> so multiple hierarchies are detected and thus an error is produced:
> {code}
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> ../../src/tests/mesos.cpp:849: Failure
> Value of: _baseHierarchy.get()
>   Actual: "/cgroup"
> Expected: baseHierarchy
> Which is: "/tmp/mesos_test_cgroup"
> -
> Multiple cgroups base hierarchies detected:
>   '/tmp/mesos_test_cgroup'
>   '/cgroup'
> Mesos does not support multiple cgroups base hierarchies.
> Please unmount the corresponding (or all) subsystems.
> -
> ../../src/tests/mesos.cpp:932: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> ../../src/tests/mesos.cpp:849: Failure
> Value of: _baseHierarchy.get()
>   Actual: "/cgroup"
> Expected: baseHierarchy
> Which is: "/tmp/mesos_test_cgroup"
> -
> Multiple cgroups base hierarchies detected:
>   '/tmp/mesos_test_cgroup'
>   '/cgroup'
> Mesos does not support multiple cgroups base hierarchies.
> Please unmount the corresponding (or all) subsystems.
> -
> ../../src/tests/mesos.cpp:932: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5139) ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky

2016-04-12 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5139:

Shepherd: Jie Yu

> ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky
> --
>
> Key: MESOS-5139
> URL: https://issues.apache.org/jira/browse/MESOS-5139
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Ubuntu14.04
>Reporter: Vinod Kone
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Found this on ASF CI while testing 0.28.1-rc2
> {code}
> [ RUN  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar
> E0406 18:29:30.870481   520 shell.hpp:93] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> E0406 18:29:30.870576   520 fetcher.cpp:59] Failed to create URI fetcher 
> plugin 'hadoop': Failed to create HDFS client: Failed to execute 'hadoop 
> version 2>&1'; the command was either not found or exited with a non-zero 
> exit status: 127
> I0406 18:29:30.871052   520 local_puller.cpp:90] Creating local puller with 
> docker registry '/tmp/3l8ZBv/images'
> I0406 18:29:30.873325   539 metadata_manager.cpp:159] Looking for image 'abc'
> I0406 18:29:30.874438   539 local_puller.cpp:142] Untarring image 'abc' from 
> '/tmp/3l8ZBv/images/abc.tar' to '/tmp/3l8ZBv/store/staging/5tw8bD'
> I0406 18:29:30.901916   547 local_puller.cpp:162] The repositories JSON file 
> for image 'abc' is '{"abc":{"latest":"456"}}'
> I0406 18:29:30.902304   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/rootfs'
> I0406 18:29:30.909144   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs'
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:183: Failure
> (imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar, -C, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' failed: tar: This does not look 
> like a tar archive
> tar: Exiting with failure status due to previous errors
> [  FAILED  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar (243 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5192) LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is flaky

2016-04-12 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-5192:
---

Assignee: Gilbert Song

> LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is flaky
> 
>
> Key: MESOS-5192
> URL: https://issues.apache.org/jira/browse/MESOS-5192
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7 + SSL
>Reporter: Neil Conway
>Assignee: Gilbert Song
>  Labels: flaky, flaky-test, mesosphere
>
> Observed on internal Mesosphere CI:
> {noformat}
> [11:32:03] :   [Step 11/11] [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_MultipleContainers
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.587877 17436 linux.cpp:81] Making 
> '/tmp/NMWl31' a shared mount
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.603153 17436 
> linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.604372 17456 
> containerizer.cpp:682] Starting container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40' for executor 'test_executor1' of 
> framework ''
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.604940 17450 provisioner.cpp:285] 
> Provisioning image rootfs 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
>  for container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.605561 17454 copy.cpp:128] Copying 
> layer path '/tmp/NMWl31/test_image1' to rootfs 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.784813 17450 linux.cpp:355] Bind 
> mounting work directory from 
> '/tmp/NMWl31/slaves/test_slave/frameworks/executors/test_executor1/runs/f1f5de4c-aaef-45c7-b28c-83014327eb40'
>  to 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox'
>  for container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.785079 17450 linux.cpp:683] 
> Changing the ownership of the persistent volume at 
> '/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' with uid 0 and gid > 0
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.790043 17450 linux.cpp:723] 
> Mounting '/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' to 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox/volume'
>  for persistent volume disk(test_role)[persistent_volume_id:volume]:32 of 
> container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.791724 17455 
> linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.797466 17453 
> containerizer.cpp:682] Starting container 
> '2c9bd571-7804-481d-9a42-1df65489dda8' for executor 'test_executor2' of 
> framework ''
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.797947 17453 
> containerizer.cpp:1439] Destroying container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.798017 17457 provisioner.cpp:285] 
> Provisioning image rootfs 
> '/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
>  for container 2c9bd571-7804-481d-9a42-1df65489dda8
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.798683 17451 copy.cpp:128] Copying 
> layer path '/tmp/NMWl31/test_image2' to rootfs 
> '/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.802534 17456 cgroups.cpp:2676] 
> Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.804713 17451 cgroups.cpp:1409] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
> 2.123008ms
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.807065 17450 cgroups.cpp:2694] 
> Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.809074 17450 cgroups.cpp:1438] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
> 1.96992ms
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.984756 17457 
> containerizer.cpp:1674] Executor for container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40' has exited
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.987437 17451 linux.cpp:798] 
> Unmounting volume 
> 

[jira] [Updated] (MESOS-5070) Introduce more flexible subprocess interface for child options.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5070:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Introduce more flexible subprocess interface for child options.
> ---
>
> Key: MESOS-5070
> URL: https://issues.apache.org/jira/browse/MESOS-5070
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> We introduced a number of parameters to the subprocess interface with 
> MESOS-5049.
> Adding all options explicitly to the subprocess interface makes it 
> inflexible. 
> We should investigate a flexible options, which still prevents arbitrary code 
> to be executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4982) Update example long running to use v1 API.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4982:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Update example long running to use v1 API.
> --
>
> Key: MESOS-4982
> URL: https://issues.apache.org/jira/browse/MESOS-4982
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> We need to modify the long running test framework similar to 
> {{src/examples/long_lived_framework.cpp}} to use the v1 API.
> This would allow us to vet the v1 API and the scheduler library in test 
> clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5028) Copy provisioner cannot replace directory with symlink

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5028:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Copy provisioner cannot replace directory with symlink
> --
>
> Key: MESOS-5028
> URL: https://issues.apache.org/jira/browse/MESOS-5028
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Zhitao Li
>Assignee: Gilbert Song
>
> I'm trying to play with the new image provisioner on our custom docker 
> images, but one of layer failed to get copied, possibly due to a dangling 
> symlink.
> Error log with Glog_v=1:
> {quote}
> I0324 05:42:48.926678 15067 copy.cpp:127] Copying layer path 
> '/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs'
>  to rootfs 
> '/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6'
> E0324 05:42:49.028506 15062 slave.cpp:3773] Container 
> '5f05be6c-c970-4539-aa64-fd0eef2ec7ae' for executor 'test' of framework 
> 75932a89-1514-4011-bafe-beb6a208bb2d-0004 failed to start: Collect failed: 
> Collect failed: Failed to copy layer: cp: cannot overwrite directory 
> ‘/var/lib/mesos/provisioner/containers/5f05be6c-c970-4539-aa64-fd0eef2ec7ae/backends/copy/rootfses/507173f3-e316-48a3-a96e-5fdea9ffe9f6/etc/apt’
>  with non-directory
> {quote}
> Content of 
> _/tmp/mesos/store/docker/layers/5df0888641196b88dcc1b97d04c74839f02a73b8a194a79e134426d6a8fcb0f1/rootfs/etc/apt_
>  points to a non-existing absolute path (cannot provide exact path but it's a 
> result of us trying to mount apt keys into docker container at build time).
> I believe what happened is that we executed a script at build time, which 
> contains equivalent of:
> {quote}
> rm -rf /etc/apt/* && ln -sf /build-mount-point/ /etc/apt
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3781) Replace Master/Slave Terminology Phase I - Add duplicate agent flags

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-3781:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Replace Master/Slave Terminology Phase I - Add duplicate agent flags 
> -
>
> Key: MESOS-3781
> URL: https://issues.apache.org/jira/browse/MESOS-3781
> Project: Mesos
>  Issue Type: Task
>Reporter: Diana Arroyo
>Assignee: Jay Guo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4781) Executor env variables should not be leaked to the command task.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4781:
-
Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 32, 
Mesosphere Sprint 33  (was: Mesosphere Sprint 30, Mesosphere Sprint 31, 
Mesosphere Sprint 32)

> Executor env variables should not be leaked to the command task.
> 
>
> Key: MESOS-4781
> URL: https://issues.apache.org/jira/browse/MESOS-4781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Currently, command task inherits the env variables of the command executor. 
> This is less ideal because the command executor environment variables include 
> some Mesos internal env variables like MESOS_XXX and LIBPROCESS_XXX. Also, 
> this behavior does not match what Docker containerizer does. We should 
> construct the env variables from scratch for the command task, rather than 
> relying on inheriting the env variables from the command executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5071) Refactor the clone option to subprocess.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5071:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Refactor the clone option to subprocess.
> 
>
> Key: MESOS-5071
> URL: https://issues.apache.org/jira/browse/MESOS-5071
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> The clone option in subprocess is only used (at least in the Mesos codebase) 
> to specify custom namespace flags to clone. It feels having the clone 
> function in the subprocess interface is too explicit for this functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4316) Support get non-default weights by /weights

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4316:
-
Sprint: Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33  
(was: Mesosphere Sprint 31, Mesosphere Sprint 32)

> Support get non-default weights by /weights
> ---
>
> Key: MESOS-4316
> URL: https://issues.apache.org/jira/browse/MESOS-4316
> Project: Mesos
>  Issue Type: Task
>Reporter: Yongqiao Wang
>Assignee: Yongqiao Wang
>Priority: Minor
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Like /quota, we should also add query logic for /weights to keep consistent. 
> Then /roles no longer needs to show weight information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5005) Make `ReservationInfo.principal` and `Persistence.principal` equivalent

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5005:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Make `ReservationInfo.principal` and `Persistence.principal` equivalent
> ---
>
> Key: MESOS-5005
> URL: https://issues.apache.org/jira/browse/MESOS-5005
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes, reservations
>
> Currently, we require that `ReservationInfo.principal` be equal to the 
> principal provided for authentication, which means that when HTTP 
> authentication is disabled this field cannot be set. Based on comments in 
> 'mesos.proto', the original intention was to enforce this same constraint for 
> `Persistence.principal`, but it seems that we don't enforce it. This should 
> be changed to make the two fields equivalent.
> This means that when HTTP authentication is disabled, requests to '/reserve' 
> cannot set {{ReservationInfo.principal}}, while requests to `/create-volumes` 
> can set any principal in {{Persistence.principal}}. One solution would be to 
> add the constraint to {{Persistence.principal}} when HTTP authentication is 
> enabled, and remove the constraint from {{ReservationInfo.principal}} when 
> HTTP authentication is disabled: this would allow us to track a 
> reserver/creator principal when HTTP authentication is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4763) Add test mock for CNI plugins.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4763:
-
Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 32, 
Mesosphere Sprint 33  (was: Mesosphere Sprint 30, Mesosphere Sprint 31, 
Mesosphere Sprint 32)

> Add test mock for CNI plugins.
> --
>
> Key: MESOS-4763
> URL: https://issues.apache.org/jira/browse/MESOS-4763
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Qian Zhang
>  Labels: mesosphere
>
> In order to test the network/cni isolator, we need to mock the behavior of an 
> CNI plugin. One option is to write a mock script which acts as a CNI plugin. 
> The isolator will talk to the mock script the same way it talks to an actual 
> CNI plugin.
> The mock script can just join the host network?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4233:
-
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, 
Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere 
Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 26, Mesosphere Sprint 
27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30, 
Mesosphere Sprint 31, Mesosphere Sprint 32)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Assignee: Kapil Arya
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5071) Refactor the clone option to subprocess.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5071:
-
Sprint: Mesosphere Sprint 32  (was: Mesosphere Sprint 32, Mesosphere Sprint 
33)

> Refactor the clone option to subprocess.
> 
>
> Key: MESOS-5071
> URL: https://issues.apache.org/jira/browse/MESOS-5071
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> The clone option in subprocess is only used (at least in the Mesos codebase) 
> to specify custom namespace flags to clone. It feels having the clone 
> function in the subprocess interface is too explicit for this functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5070) Introduce more flexible subprocess interface for child options.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5070:
-
Sprint: Mesosphere Sprint 32  (was: Mesosphere Sprint 32, Mesosphere Sprint 
33)

> Introduce more flexible subprocess interface for child options.
> ---
>
> Key: MESOS-5070
> URL: https://issues.apache.org/jira/browse/MESOS-5070
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> We introduced a number of parameters to the subprocess interface with 
> MESOS-5049.
> Adding all options explicitly to the subprocess interface makes it 
> inflexible. 
> We should investigate a flexible options, which still prevents arbitrary code 
> to be executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5139) ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky

2016-04-12 Thread Artem Harutyunyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237475#comment-15237475
 ] 

Artem Harutyunyan commented on MESOS-5139:
--

[~gilbert] please make sure this ticket has a shepherd.

> ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar is flaky
> --
>
> Key: MESOS-5139
> URL: https://issues.apache.org/jira/browse/MESOS-5139
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0
> Environment: Ubuntu14.04
>Reporter: Vinod Kone
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Found this on ASF CI while testing 0.28.1-rc2
> {code}
> [ RUN  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar
> E0406 18:29:30.870481   520 shell.hpp:93] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> E0406 18:29:30.870576   520 fetcher.cpp:59] Failed to create URI fetcher 
> plugin 'hadoop': Failed to create HDFS client: Failed to execute 'hadoop 
> version 2>&1'; the command was either not found or exited with a non-zero 
> exit status: 127
> I0406 18:29:30.871052   520 local_puller.cpp:90] Creating local puller with 
> docker registry '/tmp/3l8ZBv/images'
> I0406 18:29:30.873325   539 metadata_manager.cpp:159] Looking for image 'abc'
> I0406 18:29:30.874438   539 local_puller.cpp:142] Untarring image 'abc' from 
> '/tmp/3l8ZBv/images/abc.tar' to '/tmp/3l8ZBv/store/staging/5tw8bD'
> I0406 18:29:30.901916   547 local_puller.cpp:162] The repositories JSON file 
> for image 'abc' is '{"abc":{"latest":"456"}}'
> I0406 18:29:30.902304   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/123/rootfs'
> I0406 18:29:30.909144   547 local_puller.cpp:290] Extracting layer tar ball 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar to rootfs 
> '/tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs'
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:183: Failure
> (imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/layer.tar, -C, 
> /tmp/3l8ZBv/store/staging/5tw8bD/456/rootfs' failed: tar: This does not look 
> like a tar archive
> tar: Exiting with failure status due to previous errors
> [  FAILED  ] ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar (243 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5173) Allow master/agent to take multiple --modules flags

2016-04-12 Thread Artem Harutyunyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237476#comment-15237476
 ] 

Artem Harutyunyan commented on MESOS-5173:
--

[~karya] please make sure this ticket has a shepherd.

> Allow master/agent to take multiple --modules flags
> ---
>
> Key: MESOS-5173
> URL: https://issues.apache.org/jira/browse/MESOS-5173
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> When loading multiple modules into master/agent, one has to merge all module 
> metadata (library name, module name, parameters, etc.) into a single json 
> file which is then passed on to the --modules flag. This quickly becomes 
> cumbersome especially if the modules are coming from different 
> vendors/developers.
> An alternate would be to allow multiple invocations of --modules flag that 
> can then be passed on to the module manager. That way, each flag corresponds 
> to just one module library and modules from that library.
> Another approach is to create a new flag (e.g., --modules-dir) that contains 
> a path to a directory that would contain multiple json files. One can think 
> of it as an analogous to systemd units. The operator that drops a new file 
> into this directory and the file would automatically be picked up by the 
> master/agent module manager. Further, the naming scheme can also be inherited 
> to prefix the filename with an "NN_" to signify oad order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4785) Reorganize ACL subject/object descriptions

2016-04-12 Thread Artem Harutyunyan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237474#comment-15237474
 ] 

Artem Harutyunyan commented on MESOS-4785:
--

[~arojas] please make sure this ticket has a shepherd.

> Reorganize ACL subject/object descriptions
> --
>
> Key: MESOS-4785
> URL: https://issues.apache.org/jira/browse/MESOS-4785
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Alexander Rojas
>  Labels: documentation, mesosphere, security
> Fix For: 0.29.0
>
>
> The authorization documentation would benefit from a reorganization of the 
> ACL subject/object descriptions. Instead of simple lists of the available 
> subjects and objects, it would be nice to see a table showing which subject 
> and object is used with each action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4938) Support docker registry authentication

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4938:
-
Sprint: Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33  
(was: Mesosphere Sprint 31, Mesosphere Sprint 32)

> Support docker registry authentication
> --
>
> Key: MESOS-4938
> URL: https://issues.apache.org/jira/browse/MESOS-4938
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Gilbert Song
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5130) Enable `newtork/cni` isolator in `MesosContainerizer` as the default `network` isolator.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5130:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Enable `newtork/cni` isolator in `MesosContainerizer` as the default 
> `network` isolator.
> 
>
> Key: MESOS-5130
> URL: https://issues.apache.org/jira/browse/MESOS-5130
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> Currently there are no default `network` isolators for `MesosContainerizer`. 
> With the development of the `network/cni` isolator we have an interface to 
> run Mesos on multitude of IP networks. Given that its based on an open 
> standard (the CNI spec) which is gathering a lot of traction from vendors 
> (calico, weave, coreOS) and already works on some default networks (bridge, 
> ipvlan, macvlan) it makes sense to make it as the default network isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4922) Setup proper /etc/hostname, /etc/hosts and /etc/resolv.conf for containers in network/cni isolator.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4922:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Setup proper /etc/hostname, /etc/hosts and /etc/resolv.conf for containers in 
> network/cni isolator.
> ---
>
> Key: MESOS-4922
> URL: https://issues.apache.org/jira/browse/MESOS-4922
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Qian Zhang
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> The network/cni isolator needs to properly setup /etc/hostname and /etc/hosts 
> for the container with a hostname (e.g., randomly generated) and the assigned 
> IP returned by CNI plugin.
> We should consider the following cases:
> 1) container is using host filesystem
> 2) container is using a different filesystem
> 3) custom executor and command executor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4544) Propose design doc for agent partitioning behavior

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4544:
-
Sprint: Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30, 
Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33  (was: 
Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere 
Sprint 31, Mesosphere Sprint 32)

> Propose design doc for agent partitioning behavior
> --
>
> Key: MESOS-4544
> URL: https://issues.apache.org/jira/browse/MESOS-4544
> Project: Mesos
>  Issue Type: Task
>  Components: general
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5051) Create helpers for manipulating Linux capabilities.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5051:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Create helpers for manipulating Linux capabilities.
> ---
>
> Key: MESOS-5051
> URL: https://issues.apache.org/jira/browse/MESOS-5051
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> These helpers can either based on some existing library (e.g. libcap), or use 
> system calls directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4053:
-
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 31, 
Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 26, 
Mesosphere Sprint 27, Mesosphere Sprint 31, Mesosphere Sprint 32)

> MemoryPressureMesosTest tests fail on CentOS 6.6
> 
>
> Key: MESOS-4053
> URL: https://issues.apache.org/jira/browse/MESOS-4053
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere, test-failure
>
> {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
> {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
> seems that mounted cgroups are not properly cleaned up after previous tests, 
> so multiple hierarchies are detected and thus an error is produced:
> {code}
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> ../../src/tests/mesos.cpp:849: Failure
> Value of: _baseHierarchy.get()
>   Actual: "/cgroup"
> Expected: baseHierarchy
> Which is: "/tmp/mesos_test_cgroup"
> -
> Multiple cgroups base hierarchies detected:
>   '/tmp/mesos_test_cgroup'
>   '/cgroup'
> Mesos does not support multiple cgroups base hierarchies.
> Please unmount the corresponding (or all) subsystems.
> -
> ../../src/tests/mesos.cpp:932: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> ../../src/tests/mesos.cpp:849: Failure
> Value of: _baseHierarchy.get()
>   Actual: "/cgroup"
> Expected: baseHierarchy
> Which is: "/tmp/mesos_test_cgroup"
> -
> Multiple cgroups base hierarchies detected:
>   '/tmp/mesos_test_cgroup'
>   '/cgroup'
> Mesos does not support multiple cgroups base hierarchies.
> Please unmount the corresponding (or all) subsystems.
> -
> ../../src/tests/mesos.cpp:932: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5064) Remove default value for the agent `work_dir`

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5064:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Remove default value for the agent `work_dir`
> -
>
> Key: MESOS-5064
> URL: https://issues.apache.org/jira/browse/MESOS-5064
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Greg Mann
>
> Following a crash report from the user we need to be more explicit about the 
> dangers of using {{/tmp}} as agent {{work_dir}}. In addition, we can remove 
> the default value for the {{\-\-work_dir}} flag, forcing users to explicitly 
> set the work directory for the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4932) Propose Design for Authorization based filtering for endpoints.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4932:
-
Sprint: Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33  
(was: Mesosphere Sprint 31, Mesosphere Sprint 32)

> Propose Design for Authorization based filtering for endpoints.
> ---
>
> Key: MESOS-4932
> URL: https://issues.apache.org/jira/browse/MESOS-4932
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: authorization, mesosphere, security
> Fix For: 0.29.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4766) Improve allocator performance.

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4766:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Improve allocator performance.
> --
>
> Key: MESOS-4766
> URL: https://issues.apache.org/jira/browse/MESOS-4766
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Critical
>
> This is an epic to track the various tickets around improving the performance 
> of the allocator, including the following:
> * Preventing un-necessary backup of the allocator.
> * Reducing the cost of allocations and allocator state updates.
> * Improving performance of the DRF sorter.
> * More benchmarking to simulate scenarios with performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5142) Add agent flags for HTTP authorization

2016-04-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5142:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33  (was: Mesosphere Sprint 
32)

> Add agent flags for HTTP authorization
> --
>
> Key: MESOS-5142
> URL: https://issues.apache.org/jira/browse/MESOS-5142
> Project: Mesos
>  Issue Type: Bug
>  Components: security, slave
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> Flags should be added to the agent to:
> 1. Enable authorization ({{--authorizers}})
> 2. Provide ACLs ({{--acls}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4175) ContentType/SchedulerTest.Decline is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4175:
---
  Sprint: Mesosphere Sprint 33
Story Points: 1

> ContentType/SchedulerTest.Decline is slow.
> --
>
> Key: MESOS-4175
> URL: https://issues.apache.org/jira/browse/MESOS-4175
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Shuai Lin
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> The {{ContentType/SchedulerTest.Decline}} test takes more than {{1s}} to 
> finish on my Mac OS 10.10.4:
> {code}
> ContentType/SchedulerTest.Decline/0 (1022 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4174) HookTest.VerifySlaveLaunchExecutorHook is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4174:
---
Shepherd: Alexander Rukletsov  (was: Timothy Chen)
  Sprint: Mesosphere Sprint 33
Story Points: 1
 Summary: HookTest.VerifySlaveLaunchExecutorHook is slow.  (was: 
HookTest.VerifySlaveLaunchExecutorHook is slow)

> HookTest.VerifySlaveLaunchExecutorHook is slow.
> ---
>
> Key: MESOS-4174
> URL: https://issues.apache.org/jira/browse/MESOS-4174
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Jian Qiu
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{HookTest.VerifySlaveLaunchExecutorHook}} test takes more than {{5s}} to 
> finish on my Mac OS 10.10.4:
> {code}
> HookTest.VerifySlaveLaunchExecutorHook (5061 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4171) OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4171:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 33
Story Points: 1
 Summary: OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover is 
slow.  (was: OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover is slow)

> OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover is slow.
> ---
>
> Key: MESOS-4171
> URL: https://issues.apache.org/jira/browse/MESOS-4171
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover}} test takes 
> more than {{1s}} to finish on my Mac OS 10.10.4:
> {code}
> OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4643) PortMappingIsolatorTest fail when no namespaces are set.

2016-04-12 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237379#comment-15237379
 ] 

Jan Schlicht commented on MESOS-4643:
-

With the use of {{os::realpath}} since 
{{eaecce08a7ce0d72537512cbfccb77e545bb10b8}} this will cause the tests to crash 
if {{/var/run/netns}} does not exist.

> PortMappingIsolatorTest fail when no namespaces are set.
> 
>
> Key: MESOS-4643
> URL: https://issues.apache.org/jira/browse/MESOS-4643
> Project: Mesos
>  Issue Type: Bug
> Environment: Linux Kernel 3.19.0-49-generic,
> libnl-3.2.27
>Reporter: Till Toenshoff
>Priority: Minor
>
> Currently our network isolator tests fail with the following output on a 
> Ubuntu 14.04 VM.
> {noformat}
> [02:10:15][Step 8/8] [ RUN  ] 
> PortMappingIsolatorTest.ROOT_NC_ContainerToContainerTCP
> [02:10:15][Step 8/8] 
> ../../src/tests/containerizer/port_mapping_tests.cpp:164: Failure
> [02:10:15][Step 8/8] entries: Failed to opendir '/var/run/netns': No such 
> file or directory
> [02:10:15][Step 8/8] 
> ../../src/tests/containerizer/port_mapping_tests.cpp:164: Failure
> [02:10:15][Step 8/8] entries: Failed to opendir '/var/run/netns': No such 
> file or directory
> [02:10:15][Step 8/8] [  FAILED  ] 
> PortMappingIsolatorTest.ROOT_NC_ContainerToContainerTCP (4 ms)
> {noformat}
> The machine has no network namespaces set, hence {{/var/run/netns}} does not 
> exist. 
> We should help users understanding this prerequisite or maybe even get these 
> things in a fixture.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4170) OversubscriptionTest.UpdateAllocatorOnSchedulerFailover is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4170:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 33
Story Points: 1
 Summary: OversubscriptionTest.UpdateAllocatorOnSchedulerFailover is 
slow.  (was: OversubscriptionTest.UpdateAllocatorOnSchedulerFailover is slow)

> OversubscriptionTest.UpdateAllocatorOnSchedulerFailover is slow.
> 
>
> Key: MESOS-4170
> URL: https://issues.apache.org/jira/browse/MESOS-4170
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> The {{OversubscriptionTest.UpdateAllocatorOnSchedulerFailover}} test takes 
> more than {{1s}} to finish on my Mac OS 10.10.4:
> {code}
> OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4167) MasterTest.OfferTimeout is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4167:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 33
Story Points: 1
 Summary: MasterTest.OfferTimeout is slow.  (was: 
MasterTest.OfferTimeout is slow)

> MasterTest.OfferTimeout is slow.
> 
>
> Key: MESOS-4167
> URL: https://issues.apache.org/jira/browse/MESOS-4167
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{MasterTest.OfferTimeout}} test takes more than {{1s}} to finish on my 
> Mac OS 10.10.4:
> {code}
> MasterTest.OfferTimeout (1053 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4166) MasterTest.LaunchCombinedOfferTest is slow

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4166:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 33
Story Points: 1

> MasterTest.LaunchCombinedOfferTest is slow
> --
>
> Key: MESOS-4166
> URL: https://issues.apache.org/jira/browse/MESOS-4166
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{MasterTest.LaunchCombinedOfferTest}} test takes more than {{2s}} to 
> finish on my Mac OS 10.10.4:
> {code}
> MasterTest.LaunchCombinedOfferTest (2023 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4166) MasterTest.LaunchCombinedOfferTest is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4166:
---
Summary: MasterTest.LaunchCombinedOfferTest is slow.  (was: 
MasterTest.LaunchCombinedOfferTest is slow)

> MasterTest.LaunchCombinedOfferTest is slow.
> ---
>
> Key: MESOS-4166
> URL: https://issues.apache.org/jira/browse/MESOS-4166
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{MasterTest.LaunchCombinedOfferTest}} test takes more than {{2s}} to 
> finish on my Mac OS 10.10.4:
> {code}
> MasterTest.LaunchCombinedOfferTest (2023 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4165) MasterTest.MasterInfoOnReElection is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4165:
---
Story Points: 1

> MasterTest.MasterInfoOnReElection is slow.
> --
>
> Key: MESOS-4165
> URL: https://issues.apache.org/jira/browse/MESOS-4165
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> The {{MasterTest.MasterInfoOnReElection}} test takes more than {{1s}} to 
> finish on my Mac OS 10.10.4:
> {code}
> MasterTest.MasterInfoOnReElection (1024 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4164) MasterTest.RecoverResources is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4164:
---
  Sprint: Mesosphere Sprint 33
Story Points: 1

> MasterTest.RecoverResources is slow.
> 
>
> Key: MESOS-4164
> URL: https://issues.apache.org/jira/browse/MESOS-4164
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
> Fix For: 0.29.0
>
>
> The {{MasterTest.RecoverResources}} test takes more than {{1s}} to finish on 
> my Mac OS 10.10.4:
> {code}
> MasterTest.RecoverResources (1018 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4160) Log recover tests are slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4160:
---
  Sprint: Mesosphere Sprint 33
Story Points: 1
 Summary: Log recover tests are slow.  (was: Log recover tests are slow)

> Log recover tests are slow.
> ---
>
> Key: MESOS-4160
> URL: https://issues.apache.org/jira/browse/MESOS-4160
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Shuai Lin
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> On Mac OS 10.10.4, some tests take longer than {{1s}} to finish:
> {code}
> RecoverTest.AutoInitialization (1003 ms)
> RecoverTest.AutoInitializationRetry (1000 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3775) MasterAllocatorTest.SlaveLost is slow.

2016-04-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3775:
---
  Sprint: Mesosphere Sprint 33
Story Points: 1

> MasterAllocatorTest.SlaveLost is slow.
> --
>
> Key: MESOS-3775
> URL: https://issues.apache.org/jira/browse/MESOS-3775
> Project: Mesos
>  Issue Type: Improvement
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Assignee: Shuai Lin
>Priority: Minor
>  Labels: mesosphere, newbie++, tech-debt
>
> The {{MasterAllocatorTest.SlaveLost}} takes more that {{5s}} to complete. A 
> brief look into the code hints that the stopped agent does not quit 
> immediately (and hence its resources are not released by the allocator) 
> because [it waits for the executor to 
> terminate|https://github.com/apache/mesos/blob/master/src/tests/master_allocator_tests.cpp#L717].
>  {{5s}} timeout comes from {{EXECUTOR_SHUTDOWN_GRACE_PERIOD}} agent constant.
> Possible solutions:
> * Do not wait until the stopped agent quits (can be flaky, needs deeper 
> analysis).
> * Decrease the agent's {{executor_shutdown_grace_period}} flag.
> * Terminate the executor faster (this may require some refactoring since the 
> executor driver is created in the {{TestContainerizer}} and we do not have 
> direct access to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5167) Add tests for `network/cni` isolator

2016-04-12 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237322#comment-15237322
 ] 

Qian Zhang commented on MESOS-5167:
---

Review chain:
https://reviews.apache.org/r/46096/
https://reviews.apache.org/r/46097/

> Add tests for `network/cni` isolator
> 
>
> Key: MESOS-5167
> URL: https://issues.apache.org/jira/browse/MESOS-5167
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> We need to add tests to verify the functionality of `network/cni` isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1781) CpuIsolatorTest/0.UserCpuUsage is flaky

2016-04-12 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237286#comment-15237286
 ] 

Benjamin Bannier commented on MESOS-1781:
-

Still an issue as of {{371072e}}.

> CpuIsolatorTest/0.UserCpuUsage is flaky
> ---
>
> Key: MESOS-1781
> URL: https://issues.apache.org/jira/browse/MESOS-1781
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: fedora-19-clang jenkins VM
>Reporter: Yan Xu
>  Labels: flaky, mesosphere
>
> {noformat:title=}
> [ RUN  ] CpuIsolatorTest/0.UserCpuUsage
> Using temporary directory '/tmp/CpuIsolatorTest_0_UserCpuUsage_AGvLD7'
> I0909 07:43:18.813490 23253 launcher.cpp:137] Forked child with pid '25840' 
> for container 'user_cpu_usage'
> tests/isolator_tests.cpp:217: Failure
> Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
> 2014-09-09 
> 07:43:21,335:23253(0x7f6b7503f700):ZOO_WARN@zookeeper_interest@1557: Exceeded 
> deadline by 1205ms
> 2014-09-09 
> 07:43:21,340:23253(0x7f6b7503f700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:36224] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
> mesos::internal::slave::PosixCpuIsolatorProcess (3848 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4785) Reorganize ACL subject/object descriptions

2016-04-12 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237287#comment-15237287
 ] 

Alexander Rojas commented on MESOS-4785:


I've been looking into this, and my feeling is that we might need a whole 
rewrite of the document. The main problem is that the document treats 
authorization interface and the local authorizer as one, but since they are now 
separate entities, they probably need a new document.

I will be working on that.

> Reorganize ACL subject/object descriptions
> --
>
> Key: MESOS-4785
> URL: https://issues.apache.org/jira/browse/MESOS-4785
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Alexander Rojas
>  Labels: documentation, mesosphere, security
> Fix For: 0.29.0
>
>
> The authorization documentation would benefit from a reorganization of the 
> ACL subject/object descriptions. Instead of simple lists of the available 
> subjects and objects, it would be nice to see a table showing which subject 
> and object is used with each action.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5192) LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is flaky

2016-04-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5192:
---
Environment: CentOS 7 + SSL

> LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is flaky
> 
>
> Key: MESOS-5192
> URL: https://issues.apache.org/jira/browse/MESOS-5192
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7 + SSL
>Reporter: Neil Conway
>  Labels: flaky, flaky-test, mesosphere
>
> Observed on internal Mesosphere CI:
> {noformat}
> [11:32:03] :   [Step 11/11] [ RUN  ] 
> LinuxFilesystemIsolatorTest.ROOT_MultipleContainers
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.587877 17436 linux.cpp:81] Making 
> '/tmp/NMWl31' a shared mount
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.603153 17436 
> linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.604372 17456 
> containerizer.cpp:682] Starting container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40' for executor 'test_executor1' of 
> framework ''
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.604940 17450 provisioner.cpp:285] 
> Provisioning image rootfs 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
>  for container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:09]W:   [Step 11/11] I0412 11:32:09.605561 17454 copy.cpp:128] Copying 
> layer path '/tmp/NMWl31/test_image1' to rootfs 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.784813 17450 linux.cpp:355] Bind 
> mounting work directory from 
> '/tmp/NMWl31/slaves/test_slave/frameworks/executors/test_executor1/runs/f1f5de4c-aaef-45c7-b28c-83014327eb40'
>  to 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox'
>  for container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.785079 17450 linux.cpp:683] 
> Changing the ownership of the persistent volume at 
> '/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' with uid 0 and gid > 0
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.790043 17450 linux.cpp:723] 
> Mounting '/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' to 
> '/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox/volume'
>  for persistent volume disk(test_role)[persistent_volume_id:volume]:32 of 
> container f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.791724 17455 
> linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.797466 17453 
> containerizer.cpp:682] Starting container 
> '2c9bd571-7804-481d-9a42-1df65489dda8' for executor 'test_executor2' of 
> framework ''
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.797947 17453 
> containerizer.cpp:1439] Destroying container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.798017 17457 provisioner.cpp:285] 
> Provisioning image rootfs 
> '/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
>  for container 2c9bd571-7804-481d-9a42-1df65489dda8
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.798683 17451 copy.cpp:128] Copying 
> layer path '/tmp/NMWl31/test_image2' to rootfs 
> '/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.802534 17456 cgroups.cpp:2676] 
> Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.804713 17451 cgroups.cpp:1409] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
> 2.123008ms
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.807065 17450 cgroups.cpp:2694] 
> Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.809074 17450 cgroups.cpp:1438] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
> 1.96992ms
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.984756 17457 
> containerizer.cpp:1674] Executor for container 
> 'f1f5de4c-aaef-45c7-b28c-83014327eb40' has exited
> [11:32:13]W:   [Step 11/11] I0412 11:32:13.987437 17451 linux.cpp:798] 
> Unmounting volume 
> 

[jira] [Created] (MESOS-5192) LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is flaky

2016-04-12 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5192:
--

 Summary: LinuxFilesystemIsolatorTest.ROOT_MultipleContainers is 
flaky
 Key: MESOS-5192
 URL: https://issues.apache.org/jira/browse/MESOS-5192
 Project: Mesos
  Issue Type: Bug
Reporter: Neil Conway


Observed on internal Mesosphere CI:

{noformat}
[11:32:03] : [Step 11/11] [ RUN  ] 
LinuxFilesystemIsolatorTest.ROOT_MultipleContainers
[11:32:09]W: [Step 11/11] I0412 11:32:09.587877 17436 linux.cpp:81] Making 
'/tmp/NMWl31' a shared mount
[11:32:09]W: [Step 11/11] I0412 11:32:09.603153 17436 
linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
for the Linux launcher
[11:32:09]W: [Step 11/11] I0412 11:32:09.604372 17456 
containerizer.cpp:682] Starting container 
'f1f5de4c-aaef-45c7-b28c-83014327eb40' for executor 'test_executor1' of 
framework ''
[11:32:09]W: [Step 11/11] I0412 11:32:09.604940 17450 provisioner.cpp:285] 
Provisioning image rootfs 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
 for container f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:09]W: [Step 11/11] I0412 11:32:09.605561 17454 copy.cpp:128] Copying 
layer path '/tmp/NMWl31/test_image1' to rootfs 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53'
[11:32:13]W: [Step 11/11] I0412 11:32:13.784813 17450 linux.cpp:355] Bind 
mounting work directory from 
'/tmp/NMWl31/slaves/test_slave/frameworks/executors/test_executor1/runs/f1f5de4c-aaef-45c7-b28c-83014327eb40'
 to 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox'
 for container f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:13]W: [Step 11/11] I0412 11:32:13.785079 17450 linux.cpp:683] 
Changing the ownership of the persistent volume at 
'/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' with uid 0 and gid 0
[11:32:13]W: [Step 11/11] I0412 11:32:13.790043 17450 linux.cpp:723] 
Mounting '/tmp/NMWl31/volumes/roles/test_role/persistent_volume_id' to 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox/volume'
 for persistent volume disk(test_role)[persistent_volume_id:volume]:32 of 
container f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:13]W: [Step 11/11] I0412 11:32:13.791724 17455 
linux_launcher.cpp:281] Cloning child process with flags = CLONE_NEWNS
[11:32:13]W: [Step 11/11] I0412 11:32:13.797466 17453 
containerizer.cpp:682] Starting container 
'2c9bd571-7804-481d-9a42-1df65489dda8' for executor 'test_executor2' of 
framework ''
[11:32:13]W: [Step 11/11] I0412 11:32:13.797947 17453 
containerizer.cpp:1439] Destroying container 
'f1f5de4c-aaef-45c7-b28c-83014327eb40'
[11:32:13]W: [Step 11/11] I0412 11:32:13.798017 17457 provisioner.cpp:285] 
Provisioning image rootfs 
'/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
 for container 2c9bd571-7804-481d-9a42-1df65489dda8
[11:32:13]W: [Step 11/11] I0412 11:32:13.798683 17451 copy.cpp:128] Copying 
layer path '/tmp/NMWl31/test_image2' to rootfs 
'/tmp/NMWl31/provisioner/containers/2c9bd571-7804-481d-9a42-1df65489dda8/backends/copy/rootfses/3da54ab5-8871-4913-a9d1-ae658286023a'
[11:32:13]W: [Step 11/11] I0412 11:32:13.802534 17456 cgroups.cpp:2676] 
Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:13]W: [Step 11/11] I0412 11:32:13.804713 17451 cgroups.cpp:1409] 
Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
2.123008ms
[11:32:13]W: [Step 11/11] I0412 11:32:13.807065 17450 cgroups.cpp:2694] 
Thawing cgroup /sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:13]W: [Step 11/11] I0412 11:32:13.809074 17450 cgroups.cpp:1438] 
Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/f1f5de4c-aaef-45c7-b28c-83014327eb40 after 
1.96992ms
[11:32:13]W: [Step 11/11] I0412 11:32:13.984756 17457 
containerizer.cpp:1674] Executor for container 
'f1f5de4c-aaef-45c7-b28c-83014327eb40' has exited
[11:32:13]W: [Step 11/11] I0412 11:32:13.987437 17451 linux.cpp:798] 
Unmounting volume 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox/volume'
 for container f1f5de4c-aaef-45c7-b28c-83014327eb40
[11:32:13]W: [Step 11/11] I0412 11:32:13.987499 17451 linux.cpp:817] 
Unmounting sandbox/work directory 
'/tmp/NMWl31/provisioner/containers/f1f5de4c-aaef-45c7-b28c-83014327eb40/backends/copy/rootfses/1dcaba0b-23ab-462f-be91-cd49090c4f53/mnt/mesos/sandbox'

[jira] [Updated] (MESOS-1781) CpuIsolatorTest/0.UserCpuUsage is flaky

2016-04-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-1781:
---
Labels: flaky mesosphere  (was: flaky)

> CpuIsolatorTest/0.UserCpuUsage is flaky
> ---
>
> Key: MESOS-1781
> URL: https://issues.apache.org/jira/browse/MESOS-1781
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: fedora-19-clang jenkins VM
>Reporter: Yan Xu
>  Labels: flaky, mesosphere
>
> {noformat:title=}
> [ RUN  ] CpuIsolatorTest/0.UserCpuUsage
> Using temporary directory '/tmp/CpuIsolatorTest_0_UserCpuUsage_AGvLD7'
> I0909 07:43:18.813490 23253 launcher.cpp:137] Forked child with pid '25840' 
> for container 'user_cpu_usage'
> tests/isolator_tests.cpp:217: Failure
> Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
> 2014-09-09 
> 07:43:21,335:23253(0x7f6b7503f700):ZOO_WARN@zookeeper_interest@1557: Exceeded 
> deadline by 1205ms
> 2014-09-09 
> 07:43:21,340:23253(0x7f6b7503f700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:36224] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
> mesos::internal::slave::PosixCpuIsolatorProcess (3848 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-04-12 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237126#comment-15237126
 ] 

Jan Schlicht commented on MESOS-5188:
-

The executor will use the return code of a finished task to determine whether 
it was successful or not. By running a {{docker stop}} you'll send a 
{{SIGTERM}} to the sleep task. This will result in the sleep task terminating 
with a return code != 0 which is interpreted by the Mesos executor as failed, 
because only a return code == 0 is seen as successful.
Other tasks than {{sleep}} might work, because they could catch that 
{{SIGTERM}} and return gracefully. Hence it depends on the task you're running 
if this will result in TASK_FAILED. Something that cannot be covered by Mesos.

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
> Fix For: 0.29.0
>
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> 

[jira] [Created] (MESOS-5191) Broken credentials file accepted without error

2016-04-12 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-5191:
---

 Summary: Broken credentials file accepted without error
 Key: MESOS-5191
 URL: https://issues.apache.org/jira/browse/MESOS-5191
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


Starting a mesos agent with the following broken JSON credentials currently 
emits no error
{code}
{
  "principal": "username",
  "secret": "secret"
}
{code}

A correct JSON format would have been
{code}
{
  "credentials": [
{
  "principal": "username",
  "secret": "secret"
}
  ]
}
{code}

No diagnostic is emitted in the agent log and (as expected) {{username:secret}}
cannot be used to authenticate.

>From adding some logging to {{mesos::internal::credentials::read}} it seems 
>while above broken format is successfully rejected by the JSON parser, the 
>current fall-though logic next tries the parser for the legacy credentials 
>format which finds no credentials at all. This seems confusing as we have 
>specified some (albeit broken) information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5190) CpuIsolatorTest/0.UserCpuUsage is flaky

2016-04-12 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-5190:

Description: 
Observed in internal CI under Centos6 with default configure flags.

{code}
[ RUN  ] CpuIsolatorTest/0.UserCpuUsage
I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
failed: cpus:1.0
Trying semicolon-delimited string format instead
I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' for 
container 'a6117255-c666-4e32-9025-9f770c7008ed'
../../src/tests/containerizer/isolator_tests.cpp:308: Failure
Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
[  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
{code}


  was:
Observed in internal CI under Centos6 

{code}
[ RUN  ] CpuIsolatorTest/0.UserCpuUsage
I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
failed: cpus:1.0
Trying semicolon-delimited string format instead
I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' for 
container 'a6117255-c666-4e32-9025-9f770c7008ed'
../../src/tests/containerizer/isolator_tests.cpp:308: Failure
Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
[  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
{code}



> CpuIsolatorTest/0.UserCpuUsage is flaky
> ---
>
> Key: MESOS-5190
> URL: https://issues.apache.org/jira/browse/MESOS-5190
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: Benjamin Bannier
>  Labels: flaky, flaky-test, mesosphere
>
> Observed in internal CI under Centos6 with default configure flags.
> {code}
> [ RUN  ] CpuIsolatorTest/0.UserCpuUsage
> I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
> failed: cpus:1.0
> Trying semicolon-delimited string format instead
> I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' 
> for container 'a6117255-c666-4e32-9025-9f770c7008ed'
> ../../src/tests/containerizer/isolator_tests.cpp:308: Failure
> Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
> [  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
> mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5190) CpuIsolatorTest/0.UserCpuUsage is flaky

2016-04-12 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-5190:

Description: 
Observed in internal CI under Centos6 

{code}
[ RUN  ] CpuIsolatorTest/0.UserCpuUsage
I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
failed: cpus:1.0
Trying semicolon-delimited string format instead
I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' for 
container 'a6117255-c666-4e32-9025-9f770c7008ed'
../../src/tests/containerizer/isolator_tests.cpp:308: Failure
Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
[  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
{code}


  was:
Observed in internal CI.

{code}
[ RUN  ] CpuIsolatorTest/0.UserCpuUsage
I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
failed: cpus:1.0
Trying semicolon-delimited string format instead
I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' for 
container 'a6117255-c666-4e32-9025-9f770c7008ed'
../../src/tests/containerizer/isolator_tests.cpp:308: Failure
Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
[  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
{code}



> CpuIsolatorTest/0.UserCpuUsage is flaky
> ---
>
> Key: MESOS-5190
> URL: https://issues.apache.org/jira/browse/MESOS-5190
> Project: Mesos
>  Issue Type: Bug
>  Components: flaky, test
>Reporter: Benjamin Bannier
>  Labels: flaky, flaky-test, mesosphere
>
> Observed in internal CI under Centos6 
> {code}
> [ RUN  ] CpuIsolatorTest/0.UserCpuUsage
> I0412 11:28:14.031723 24202 resources.cpp:572] Parsing resources as JSON 
> failed: cpus:1.0
> Trying semicolon-delimited string format instead
> I0412 11:28:14.034379 24202 launcher.cpp:123] Forked child with pid '5783' 
> for container 'a6117255-c666-4e32-9025-9f770c7008ed'
> ../../src/tests/containerizer/isolator_tests.cpp:308: Failure
> Expected: (0.125) <= (statistics.cpus_user_time_secs()), actual: 0.125 vs 0
> [  FAILED  ] CpuIsolatorTest/0.UserCpuUsage, where TypeParam = 
> mesos::internal::slave::PosixCpuIsolatorProcess (1100 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >