[jira] [Assigned] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-11-26 Thread Chun-Hung Hsiao (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-8623:
--

Assignee: Chun-Hung Hsiao

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1, 1.5.1, 1.6.1, 1.7.0
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client 
> environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@737: Client 
> 

[jira] [Commented] (MESOS-9319) Move root filesystem creation to the `filesystem/linux` isolator.

2018-11-26 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699772#comment-16699772
 ] 

James Peach commented on MESOS-9319:


Updated patch series:

| [r/69211|https://reviews.apache.org/r/69211] | Improved the code comments for 
`getContainerDevicesPath`. |
| [r/69210|https://reviews.apache.org/r/69210] | Used the MS_SILENT mount flag 
to elide unwanted logging. |
| [r/69086|https://reviews.apache.org/r/69086] | Moved the container root 
construction to the isolators. |
| [r/69450|https://reviews.apache.org/r/69450] | Applied the 
`ContainerMountInfo` protobuf helper. |

> Move root filesystem creation to the `filesystem/linux` isolator.
> -
>
> Key: MESOS-9319
> URL: https://issues.apache.org/jira/browse/MESOS-9319
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> When using a custom user namespace isolator, the task fails at launch because 
> opening devices fails with a EPERM error. This problem is described in [this 
> systemd issue|https://github.com/systemd/systemd/pull/9483] and [this 
> lxd|https://github.com/lxc/lxd/issues/4950] issue.
> The problem arises in the Mesos containerizer due to the order of operations:
> # Clone the containerizer with {{CLONE_NEWNS}}
> # Mount a tmpfs for the devices
> # mknod for the various device nodes
> Referring back to the lxc issue, because we do (1) before (2), the tmpfs on 
> {{/dev}} is marked {{SB_I_NODEV}}. Due to the new 4.18 behavior, the mkdir in 
> (3) now succeeds (see commit 
> [55956b59df33|https://github.com/torvalds/linux/commit/55956b59df336f6738da916dbb520b6e37df9fbd]).
>  Previously it would fail and we would fall back to bind mounting the device. 
> However, even though we created the device, we can't actually open it due to 
> the {{SB_I_NODEV}} flag on the tmpfs mount. It appears that the purpose of 
> allowing mknod is to that containers can create overlayfs whiteouts.
> One approach to deal with this in the Mesos containerizer is to complete the 
> device node cleanup that was begun in with the linux/devices isolator. This 
> approach involves moving all the responsibility for creating devices back to 
> the isolators. Then, at containerization time, we simply bind-mount the whole 
> of /dev from the per-container staging area. Since the isolators create the 
> devices in the host namespace and on the Mesos work directory, none of the 
> conditions that trigger the failure would be invoked.
> The failure we observed with our tasks was a failure to open {{/dev/null}}, 
> when redirecting it as standard input to a child process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8623) Crashed framework brings down the whole Mesos cluster

2018-11-26 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699724#comment-16699724
 ] 

Benjamin Mahler commented on MESOS-8623:


Looks like we really dropped the ball on this one, linking in MESOS-9419 and 
upgrading to blocker.

> Crashed framework brings down the whole Mesos cluster
> -
>
> Key: MESOS-8623
> URL: https://issues.apache.org/jira/browse/MESOS-8623
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
> Environment: Debian 8
> Mesos 1.4.1
>Reporter: Tomas Barton
>Priority: Critical
>
> It might be hard to replicate, but when you do your Mesos cluster is gone. 
> The issue was caused by an unresponsive Docker engine on a single agent node. 
> Unfortunately even after fixing Docker issues, all Mesos masters repeatedly 
> failed to start. In despair I've deleted all {{replicated_log}} data from 
> Master and ZooKeeper. Even after that messages agent's {{replicated_log}} got 
> replayed and the master crashed again. Average lifetime for Mesos master was 
> less than 1min.
> {code}
> mesos-master[3814]: I0228 00:25:55.269835  3828 network.hpp:436] ZooKeeper 
> group memberships changed
> mesos-master[3814]: I0228 00:25:55.269979  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002519' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.271117  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002520' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.277971  3832 group.cpp:700] Trying to get 
> '/mesos/log_replicas/002521' in ZooKeeper
> mesos-master[3814]: I0228 00:25:55.279296  3827 network.hpp:484] ZooKeeper 
> group PIDs: { log-replica(1)
> mesos-master[3814]: W0228 00:26:15.261255  3831 master.hpp:2372] Master 
> attempted to send message to disconnected framework 
> 911c4b47-2ba7-4959-b59e-c48d896fe210-0005 (kafka)
> mesos-master[3814]: F0228 00:26:15.261318  3831 master.hpp:2382] 
> CHECK_SOME(pid): is NONE
> mesos-master[3814]: *** Check failure stack trace: ***
> mesos-master[3814]: @ 0x7f7187ca073d  google::LogMessage::Fail()
> mesos-master[3814]: @ 0x7f7187ca23bd  google::LogMessage::SendToLog()
> mesos-master[3814]: @ 0x7f7187ca0302  google::LogMessage::Flush()
> mesos-master[3814]: @ 0x7f7187ca2da9  
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[3814]: @ 0x7f7186d6d769  _CheckFatal::~_CheckFatal()
> mesos-master[3814]: @ 0x7f71870465d5  
> mesos::internal::master::Framework::send<>()
> mesos-master[3814]: @ 0x7f7186fcfe8a  
> mesos::internal::master::Master::executorMessage()
> mesos-master[3814]: @ 0x7f718706b1a1  ProtobufProcess<>::handler4<>()
> mesos-master[3814]: @ 0x7f7187008e36  
> std::_Function_handler<>::_M_invoke()
> mesos-master[3814]: @ 0x7f71870293d1  ProtobufProcess<>::visit()
> mesos-master[3814]: @ 0x7f7186fb7ee4  
> mesos::internal::master::Master::_visit()
> mesos-master[3814]: @ 0x7f7186fd0d5d  
> mesos::internal::master::Master::visit()
> mesos-master[3814]: @ 0x7f7187c02e22  process::ProcessManager::resume()
> mesos-master[3814]: @ 0x7f7187c08d46  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vE
> mesos-master[3814]: @ 0x7f7185babca0  (unknown)
> mesos-master[3814]: @ 0x7f71853c6064  start_thread
> mesos-master[3814]: @ 0x7f71850fb62d  (unknown)
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopping Mesos Master...
> systemd[1]: Starting Mesos Master...
> systemd[1]: Started Mesos Master.
> mesos-master[27840]: WARNING: Logging before InitGoogleLogging() is written 
> to STDERR
> mesos-master[27840]: I0228 01:32:38.294122 27829 main.cpp:232] Build: 
> 2017-11-18 02:15:41 by admin
> mesos-master[27840]: I0228 01:32:38.294168 27829 main.cpp:233] Version: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294178 27829 main.cpp:236] Git tag: 1.4.1
> mesos-master[27840]: I0228 01:32:38.294186 27829 main.cpp:240] Git SHA: 
> c844db9ac7c0cef59be87438c6781bfb71adcc42
> mesos-master[27840]: I0228 01:32:38.296067 27829 main.cpp:340] Using 
> 'HierarchicalDRF' allocator
> mesos-master[27840]: I0228 01:32:38.411576 27829 replica.cpp:779] Replica 
> recovered with log positions 13 -> 14 with 0 holes and 0 unlearned
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-master[27840]: 2018-02-28 
> 01:32:38,412:27829(0x7fab44755700):ZOO_INFO@log_env@730: Client 
> environment:host.name=svc01
> mesos-master[27840]: 2018-02-28 
> 

[jira] [Created] (MESOS-9419) Executor to framework message crashes master if framework has not re-registered.

2018-11-26 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-9419:
--

 Summary: Executor to framework message crashes master if framework 
has not re-registered.
 Key: MESOS-9419
 URL: https://issues.apache.org/jira/browse/MESOS-9419
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.7.0, 1.6.1, 1.5.1
Reporter: Benjamin Mahler
Assignee: Chun-Hung Hsiao


If the executor sends a framework message after a master failover, and the 
framework has not yet re-registered with the master, this will crash the master:

{code}
W20181105 22:02:48.782819 172709 master.hpp:2304] Master attempted to send 
message to disconnected framework 03dc2603-acd6-491e-\ 8717-3f03e5ee37f4- 
(Cook-1.24.0-9299b474217db499c9d28738050b359ac8dd55bb)
F20181105 22:02:48.782830 172709 master.hpp:2314] CHECK_SOME(pid): is NONE
*** Check failure stack trace: ***
*** @ 0x7f09e016b6cd google::LogMessage::Fail()
*** @ 0x7f09e016d38d google::LogMessage::SendToLog()
*** @ 0x7f09e016b2b3 google::LogMessage::Flush()
*** @ 0x7f09e016de09 google::LogMessageFatal::~LogMessageFatal()
*** @ 0x7f09df086228 _CheckFatal::~_CheckFatal()
*** @ 0x7f09df3a403d mesos::internal::master::Framework::send<>()
*** @ 0x7f09df2f4886 mesos::internal::master::Master::executorMessage()
*** @ 0x7f09df3b06a4 
_ZN15ProtobufProcessIN5mesos8internal6master6MasterEE8handlerNINS1_26ExecutorToFrameworkMessageEJRKNS0\
 
_7SlaveIDERKNS0_11FrameworkIDERKNS0_10ExecutorIDERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcJS9_SC_SF_SN_EEEvPS3_MS3\
 _FvRKN7process4UPIDEDpT1_ESS_SN_DpMT_KFT0_vE @ 0x7f09df345b43 
std::_Function_handler<>::_M_invoke()
*** @ 0x7f09df36930f ProtobufProcess<>::consume()
*** @ 0x7f09df2e0ff5 mesos::internal::master::Master::_consume()
*** @ 0x7f09df2f5542 mesos::internal::master::Master::consume()
*** @ 0x7f09e00d9c7a process::ProcessManager::resume()
*** @ 0x7f09e00dd836 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
*** @ 0x7f09dd467ac8 execute_native_thread_routine
*** @ 0x7f09dd6f6b50 start_thread
*** @ 0x7f09dcc7030d (unknown)
{code}

This is because Framework::send proceeds if the framework is disconnected. In 
the case of a recovered framework, it will not have a pid or http connection 
yet:

https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.hpp#L2590-L2610

{code}
// Sends a message to the connected framework.
template 
void Framework::send(const Message& message)
{
  if (!connected()) {
LOG(WARNING) << "Master attempted to send message to disconnected"
 << " framework " << *this;
// XXX proceeds!
  }

  metrics.incrementEvent(message);

  if (http.isSome()) {
if (!http->send(message)) {
  LOG(WARNING) << "Unable to send event to framework " << *this << ":"
   << " connection closed";
}
  } else {
CHECK_SOME(pid); // XXX Will crash.
master->send(pid.get(), message);
  }
}
{code}

The executor to framework path does not guard against the framework being 
disconnected, unlike the status update path:

https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L6472-L6495

vs.

https://github.com/apache/mesos/blob/9b889a10927b13510a1d02e7328925dba3438a0b/src/master/master.cpp#L8371-L8373

It was reported that this crash didn't occur for the user on 1.2.0, however the 
issue appears to present there as well, so we will try to backport a test to 
see if it's indeed not occurring in 1.2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.

2018-11-26 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699577#comment-16699577
 ] 

Vinod Kone commented on MESOS-8930:
---

Still seeing this in CI.

 

[~bmahler] Do we have any abstractions/techniques in place that allows us to 
ensure the http request is enqueued in a more robust matter? Sounds like the 
10ms is sometimes not enough in ASF CI.

 

Kinda unrelated bug here is that the code does a "response->body" on a 
(possibly pending) future causing it to hang forever. This will block the whole 
test suite!

{code}

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);

 

  // Parse the response.

  Try responseJSON = JSON::parse(response->body);

  ASSERT_SOME(responseJSON);

{code}

 

I think we should atleast change the `AWAIT_EXPECT_*` above to `AWAIT_ASSERT` 
so that the rest of the test code is skipped. cc [~greggomann] [~bmahler]

 

> THREADSAFE_SnapshotTimeout is flaky.
> 
>
> Key: MESOS-8930
> URL: https://issues.apache.org/jira/browse/MESOS-8930
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test, mesosphere
>
> Observed on ASF CI, might be related to a recent test change 
> https://reviews.apache.org/r/66831/
> {noformat}
> 18:23:31 2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
> 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: 
> Failure
> 18:23:46 2: Failed to wait 15secs for response
> 22:57:13 Build timed out (after 300 minutes). Marking the build as failed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels

2018-11-26 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-9418:
--

Assignee: James Peach

> CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels
> -
>
> Key: MESOS-9418
> URL: https://issues.apache.org/jira/browse/MESOS-9418
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, test
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 
> kernels.
> {noformat}
> [jpeach@jpeach mesos]$ uname -r
> 4.19.3-300.fc29.x86_64
> [jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose 
> --gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage
> ...
> W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for 
> container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio 
> value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor 
> device number: 'Discard'
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure
> Value of: usage->has_blkio_statistics()
>   Actual: false
> Expected: true
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure
> Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0
> ../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure
> totalThrottling is NONE
> mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T 
> ::get() & [T = 
> mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed.
> ...
> {noformat}
> The actual cgroup format is:
> {noformat}
> [jpeach@jpeach blkio]$ pwd
> /sys/fs/cgroup/blkio
> [jpeach@jpeach blkio]$ cat 
> mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes
> 8:0 Read 0
> 8:0 Write 0
> 8:0 Sync 0
> 8:0 Async 0
> 8:0 Discard 0
> 8:0 Total 0
> Total 0
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9278) Add an operation status update manager to the agent

2018-11-26 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9278:


Assignee: Greg Mann  (was: Gastón Kleiman)

> Add an operation status update manager to the agent
> ---
>
> Key: MESOS-9278
> URL: https://issues.apache.org/jira/browse/MESOS-9278
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Affects Versions: 1.8.0
>Reporter: Gastón Kleiman
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9418) CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 kernels

2018-11-26 Thread James Peach (JIRA)
James Peach created MESOS-9418:
--

 Summary: CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage fails on 4.19 
kernels
 Key: MESOS-9418
 URL: https://issues.apache.org/jira/browse/MESOS-9418
 Project: Mesos
  Issue Type: Bug
  Components: containerization, test
Reporter: James Peach


The {{CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage}} test fails on Linux 4.19 
kernels.

{noformat}
[jpeach@jpeach mesos]$ uname -r
4.19.3-300.fc29.x86_64
[jpeach@jpeach build]$ sudo env GLOG_v=1 ./src/mesos-tests --verbose 
--gtest_filter=CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage
...
W1126 10:45:44.941278 30021 cgroups.cpp:895] Skipping resource statistic for 
container 8f67e5f9-ebf0-436c-a1d2-f30c69883a27 because: Failed to parse blkio 
value '8:0 Discard 0' from 'blkio.io_service_bytes': Invalid major:minor device 
number: 'Discard'
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1890: Failure
Value of: usage->has_blkio_statistics()
  Actual: false
Expected: true
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1891: Failure
Expected: (2) <= (usage->blkio_statistics().throttling_size()), actual: 2 vs 0
../../../src/tests/containerizer/cgroups_isolator_tests.cpp:1902: Failure
totalThrottling is NONE
mesos-tests: ../../../3rdparty/stout/include/stout/option.hpp:119: T 
::get() & [T = 
mesos::CgroupInfo_Blkio_Throttling_Statistics]: Assertion `isSome()' failed.
...
{noformat}

The actual cgroup format is:
{noformat}
[jpeach@jpeach blkio]$ pwd
/sys/fs/cgroup/blkio
[jpeach@jpeach blkio]$ cat 
mesos_test_e9c8e0aa-3172-4d8d-b216-c8f5286a7efc/blkio.io_service_bytes
8:0 Read 0
8:0 Write 0
8:0 Sync 0
8:0 Async 0
8:0 Discard 0
8:0 Total 0
Total 0
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9417) User mesosphere made lots of incorrect ticket updates

2018-11-26 Thread Marco Monaco (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Monaco reassigned MESOS-9417:
---

Assignee: Vinod Kone  (was: Marco Monaco)

> User mesosphere made lots of incorrect ticket updates
> -
>
> Key: MESOS-9417
> URL: https://issues.apache.org/jira/browse/MESOS-9417
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Vinod Kone
>Priority: Blocker
>  Labels: mesosphere
> Attachments: Screen Shot 2018-11-26 at 10.51.36 AM.png
>
>
> Around 4 days ago JIRA user [~mesosphere] made a lot of incorrect status 
> changes to tickets, e.g., reopening resolved issues. These tickets now have 
> an incorrect status, making it hard to see   what work needs to be done and 
> what work already happened.
> We should role back all (incorrect) changes from that user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9399) Update 'mesos task list' to only list running tasks

2018-11-26 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698954#comment-16698954
 ] 

Kevin Klues commented on MESOS-9399:


{noformat}
commit 18e51b86fac848330dea640d7b7b7bf2d6584fe5
Author: Armand Grillet 
Date:   Mon Nov 26 08:47:25 2018 -0500

Replaced CLI test helper function 'running_tasks' by 'wait_for_task'.

Replaces 'running_tasks(master)', a function that was not generic nor
explicit, by 'wait_for_task(master, name, state, delay)'. This helper
function waits a 'delay' for a task with a given 'name' to be in a
certain 'state'.

All uses of 'running_tasks' have been replaced by the new function.

Review: https://reviews.apache.org/r/69426/
{noformat}

> Update 'mesos task list' to only list running tasks
> ---
>
> Key: MESOS-9399
> URL: https://issues.apache.org/jira/browse/MESOS-9399
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> Doing a {{mesos task list}} currently returns all tasks that have ever been 
> run (not just running tasks). The default behavior should be to return only 
> the running tasks and offer an option to return all of them. To tell them 
> apart, there should be a state field in the table returned by this command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9399) Update 'mesos task list' to only list running tasks

2018-11-26 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698924#comment-16698924
 ] 

Kevin Klues commented on MESOS-9399:


{noformat}
commit 48cdd101c7a9730029471b8f881df46e136bfae4
Author: Armand Grillet 
Date:   Mon Nov 26 08:22:58 2018 -0500

Fixed name of task created when running mesos-cli-tests.

Review: https://reviews.apache.org/r/69425/
{noformat}

> Update 'mesos task list' to only list running tasks
> ---
>
> Key: MESOS-9399
> URL: https://issues.apache.org/jira/browse/MESOS-9399
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> Doing a {{mesos task list}} currently returns all tasks that have ever been 
> run (not just running tasks). The default behavior should be to return only 
> the running tasks and offer an option to return all of them. To tell them 
> apart, there should be a state field in the table returned by this command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9399) Update 'mesos task list' to only list running tasks

2018-11-26 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698909#comment-16698909
 ] 

Kevin Klues commented on MESOS-9399:


{noformat}
commit 2b03f942b5cf9375a75f08b36091c3b3e7f096ff
Author: Armand Grillet 
Date:   Mon Nov 26 08:14:42 2018 -0500

Updated 'mesos task list' to only display running tasks.

Review: https://reviews.apache.org/r/69394/
{noformat}

> Update 'mesos task list' to only list running tasks
> ---
>
> Key: MESOS-9399
> URL: https://issues.apache.org/jira/browse/MESOS-9399
> Project: Mesos
>  Issue Type: Task
>  Components: cli
>Reporter: Armand Grillet
>Assignee: Armand Grillet
>Priority: Major
>
> Doing a {{mesos task list}} currently returns all tasks that have ever been 
> run (not just running tasks). The default behavior should be to return only 
> the running tasks and offer an option to return all of them. To tell them 
> apart, there should be a state field in the table returned by this command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9417) User mesosphere made lots of incorrect ticket updates

2018-11-26 Thread Marco Monaco (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Monaco reassigned MESOS-9417:
---

Assignee: Marco Monaco

> User mesosphere made lots of incorrect ticket updates
> -
>
> Key: MESOS-9417
> URL: https://issues.apache.org/jira/browse/MESOS-9417
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Marco Monaco
>Priority: Blocker
>  Labels: mesosphere
> Attachments: Screen Shot 2018-11-26 at 10.51.36 AM.png
>
>
> Around 4 days ago JIRA user [~mesosphere] made a lot of incorrect status 
> changes to tickets, e.g., reopening resolved issues. These tickets now have 
> an incorrect status, making it hard to see   what work needs to be done and 
> what work already happened.
> We should role back all (incorrect) changes from that user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9417) User mesosphere made lots of incorrect ticket updates

2018-11-26 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9417:
---

 Summary: User mesosphere made lots of incorrect ticket updates
 Key: MESOS-9417
 URL: https://issues.apache.org/jira/browse/MESOS-9417
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier
 Attachments: Screen Shot 2018-11-26 at 10.51.36 AM.png

Around 4 days ago JIRA user [~mesosphere] made a lot of incorrect status 
changes to tickets, e.g., reopening resolved issues. These tickets now have an 
incorrect status, making it hard to see   what work needs to be done and what 
work already happened.

We should role back all (incorrect) changes from that user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-4664) Add allocator metrics.

2018-11-26 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-4664:
---

Assignee: (was: Benjamin Bannier)

> Add allocator metrics.
> --
>
> Key: MESOS-4664
> URL: https://issues.apache.org/jira/browse/MESOS-4664
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Critical
>
> There are currently no metrics that provide visibility into the allocator, 
> except for the event queue size. This makes monitoring an debugging 
> allocation behavior in a multi-framework setup difficult.
> Some thoughts for initial metrics to add:
> * How many allocation runs have completed? (counter): MESOS-4718
> * How many offers has each role / framework received? (counter): MESOS-4719
> * Current allocation breakdown: allocated / available / total (gauges): 
> MESOS-4720
> * Current maximum shares (gauges): MESOS-4724
> * How many active filters are there for the role / framework? (gauges): 
> MESOS-4722
> * How many frameworks are suppressing offers? (gauges)
> * How long does an allocation run take? (timers): MESOS-4721
> * Maintenance related metrics:
> ** How many maintenance events are active? (gauges)
> ** How many maintenance events are scheduled but not active (gauges)
> * Quota related metrics:
> ** How much quota is set for each role? (gauges)
> ** How much quota is satisfied? How much unsatisfied? (gauges): MESOS-4723
>  
> Some of these are already exposed from the master's metrics, but we should 
> not assume this within the allocator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)