[jira] [Commented] (MESOS-9502) IOswitchboard cleanup could get stuck.

2018-12-27 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730063#comment-16730063
 ] 

Meng Zhu commented on MESOS-9502:
-

[~jieyu] mentioned that one possibility is pid reuse. When the io switchboard 
terminates and agent restarts, the old pid for io switchboard will be reaped by 
init. The agent will reap on the old pid after failover, in common case, this 
will return None() immediately. However, in the corner case, if the pid is 
reused, the agent can get stuck.

> IOswitchboard cleanup could get stuck.
> --
>
> Key: MESOS-9502
> URL: https://issues.apache.org/jira/browse/MESOS-9502
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.0
>Reporter: Meng Zhu
>Priority: Critical
>
> Our check container got stuck during destroy which in turned stucks the 
> parent container. It is blocked by the I/O switchboard cleanup:
> 1223 18:04:41.00 16269 switchboard.cpp:814] Sending SIGTERM to I/O 
> switchboard server (pid: 62854) since container 
> 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
>  is being destroyed
> 
> 1227 04:45:38.00  5189 switchboard.cpp:916] I/O switchboard server 
> process for container 
> 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
>  has terminated (status=N/A)
> Note the timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9502) IOswitchboard cleanup could get stuck.

2018-12-27 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9502:
---

 Summary: IOswitchboard cleanup could get stuck.
 Key: MESOS-9502
 URL: https://issues.apache.org/jira/browse/MESOS-9502
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.7.0
Reporter: Meng Zhu


Our check container got stuck during destroy which in turned stucks the parent 
container. It is blocked by the I/O switchboard cleanup:

1223 18:04:41.00 16269 switchboard.cpp:814] Sending SIGTERM to I/O 
switchboard server (pid: 62854) since container 
4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
 is being destroyed

1227 04:45:38.00  5189 switchboard.cpp:916] I/O switchboard server process 
for container 
4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e
 has terminated (status=N/A)

Note the timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7023) IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky

2018-12-27 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7023:
--

Assignee: (was: Kevin Klues)

> IOSwitchboardTest.RecoverThenKillSwitchboardContainerDestroyed is flaky
> ---
>
> Key: MESOS-7023
> URL: https://issues.apache.org/jira/browse/MESOS-7023
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, test
>Affects Versions: 1.2.2
> Environment: ASF CI, cmake, gcc, Ubuntu 14.04, without libevent/SSL
>Reporter: Greg Mann
>Priority: Major
>  Labels: debugging, disabled-test, flaky
> Attachments: IOSwitchboardTest. 
> RecoverThenKillSwitchboardContainerDestroyed.txt
>
>
> This was observed on ASF CI:
> {code}
> /mesos/src/tests/containerizer/io_switchboard_tests.cpp:1052: Failure
> Value of: statusFailed->reason()
>   Actual: 1
> Expected: TaskStatus::REASON_IO_SWITCHBOARD_EXITED
> Which is: 27
> {code}
> Find full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8252) MasterAuthorizationTest.SlaveRemovedLost is flaky.

2018-12-27 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8252:
--

Assignee: (was: Alexander Rojas)

> MasterAuthorizationTest.SlaveRemovedLost is flaky.
> --
>
> Key: MESOS-8252
> URL: https://issues.apache.org/jira/browse/MESOS-8252
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: SlaveRemovedLost-badrun.txt
>
>
> Observed it in the internal CI today. Most likely related to the recent 
> introduction of {{Abandoned}} future state. Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9491) There exists no way to statically configure a weight for a Mesos role

2018-12-27 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729570#comment-16729570
 ] 

Alexander Rukletsov commented on MESOS-9491:


[~bbannier] Why do you static configuration would be useful? We wanted to move 
away from a concept of statically defining roles in a cluster.

> There exists no way to statically configure a weight for a Mesos role
> -
>
> Key: MESOS-9491
> URL: https://issues.apache.org/jira/browse/MESOS-9491
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Bannier
>Priority: Major
>
> While it is possible to change the weight of any role at runtime over the 
> operator API, it seems we currently have no supported way to configure this 
> statically with configuration flags. Both the {{\-\-weights}} and {{--roles}} 
> flag would in principle allow this, but are deprecated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)