[jira] [Created] (MESOS-9560) ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky

2019-02-06 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9560:
---

 Summary: ContentType/AgentAPITest.MarkResourceProviderGone/1 is 
flaky
 Key: MESOS-9560
 URL: https://issues.apache.org/jira/browse/MESOS-9560
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Bannier


We observed a segfault in 
{{ContentType/AgentAPITest.MarkResourceProviderGone/1}} on test teardown.
{noformat}
I0131 23:55:59.378453  6798 slave.cpp:923] Agent terminating
I0131 23:55:59.378813 31143 master.cpp:1269] Agent 
a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
(ip-172-16-10-236.ec2.internal) disconnected
I0131 23:55:59.378831 31143 master.cpp:3272] Disconnecting agent 
a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
(ip-172-16-10-236.ec2.internal)
I0131 23:55:59.378846 31143 master.cpp:3291] Deactivating agent 
a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
(ip-172-16-10-236.ec2.internal)
I0131 23:55:59.378891 31143 hierarchical.cpp:793] Agent 
a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 deactivated
F0131 23:55:59.378891 31149 logging.cpp:67] RAW: Pure virtual method called
@ 0x7f633aaaebdd  google::LogMessage::Fail()
@ 0x7f633aab6281  google::RawLog__()
@ 0x7f6339821262  __cxa_pure_virtual
@ 0x55671cacc113  
testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
@ 0x55671b532e78  
mesos::internal::tests::resource_provider::MockResourceProvider<>::disconnected()
@ 0x7f633978f6b0  process::AsyncExecutorProcess::execute<>()
@ 0x7f633979f218  
_ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_
@ 0x7f633a9f5d01  process::ProcessBase::consume()
@ 0x7f633aa1a08a  process::ProcessManager::resume()
@ 0x7f633aa1db06  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7f633acc9f80  execute_native_thread_routine
@ 0x7f6337142e25  start_thread
@ 0x7f6336241bad  __clone
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Jeff Pollard (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762251#comment-16762251
 ] 

Jeff Pollard commented on MESOS-9555:
-

So sounds like another engineer is going to start our upgrade process to 1.5.2 
in the next day or two. I'll report back if the upgrade stops the crashes.

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor

2019-02-06 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762217#comment-16762217
 ] 

Gilbert Song commented on MESOS-9180:
-

[~Kirill P], could you add the agent logs for triaging. Also this may related 
to the recent stuck task fix due to a FD leak MESOS-9151 and MESOS-9501, could 
you please upgrade and verify if you still have this issue?

> tasks get stuck in TASK_KILLING on the default executor
> ---
>
> Key: MESOS-9180
> URL: https://issues.apache.org/jira/browse/MESOS-9180
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.6.1
> Environment: Ubuntu 18.04, Ubuntu 16.04
>Reporter: Kirill Plyashkevich
>Priority: Critical
>  Labels: containerization
>
> during our load tests tasks get stuck in TASK_KILLING state
> {quote}{noformat}
> I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1
> I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED 
> event
> I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on 
> XX.XXX.XX.XXX
> I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP 
> event
> I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting 
> 'MESOS_CONTAINER_IP' to: 172.26.10.222
> I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching 
> tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ] in child containers [ 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ]
> I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child 
> containers of tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ]
> I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
> I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
> I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
> I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stdout):
> 0
> PONG
> I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stderr):
> I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stdout):
> I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stderr):
> I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
>  (stdout):
> I0823 16:30:38.700598 21681 

[jira] [Comment Edited] (MESOS-9473) Add end to end tests for operations on agent default resources.

2019-02-06 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762141#comment-16762141
 ] 

Gastón Kleiman edited comment on MESOS-9473 at 2/6/19 9:58 PM:
---

https://reviews.apache.org/r/69910/
https://reviews.apache.org/r/69911/
https://reviews.apache.org/r/69913/


was (Author: gkleiman):
https://reviews.apache.org/r/69910/
https://reviews.apache.org/r/69911/

> Add end to end tests for operations on agent default resources.
> ---
>
> Key: MESOS-9473
> URL: https://issues.apache.org/jira/browse/MESOS-9473
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: foundations, mesosphere, operation-feedback
>
> Making note of particular cases we need to test:
> * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR for 
> operations on agent default resources when an agent is marked gone
> * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR when they 
> reconcile operations on agents which have been marked gone



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9559) OPERATION_UNREACHABLE and OPERATION_GONE_BY_OPERATOR updates don't include the agent/RP IDs

2019-02-06 Thread JIRA
Gastón Kleiman created MESOS-9559:
-

 Summary: OPERATION_UNREACHABLE and OPERATION_GONE_BY_OPERATOR 
updates don't include the agent/RP IDs
 Key: MESOS-9559
 URL: https://issues.apache.org/jira/browse/MESOS-9559
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Gastón Kleiman
Assignee: Gastón Kleiman






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9558) Track removed agents' operations in the Framework struct

2019-02-06 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9558:


 Summary: Track removed agents' operations in the Framework struct
 Key: MESOS-9558
 URL: https://issues.apache.org/jira/browse/MESOS-9558
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Greg Mann


The master currently relies on the agent ID in operation reconciliation 
requests in order to send {{OPERATION_UNREACHABLE}} and 
{{OPERATION_GONE_BY_OPERATOR}} updates. This also means that in the case of an 
operation that was terminal-but-unacknowledged at the time of agent removal 
(i.e. operation was FINISHED but the update not yet acknowledged), the 
framework will receive OPERATION_UNREACHABLE/OPERATION_GONE_BY_OPERATOR for 
reconciliation requests, when the master could report the actual terminal state 
if it were retained.

We should consider updating the master to track unreachable/gone operations 
explicitly in order to provide better reconciliation information in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Jeff Pollard (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762066#comment-16762066
 ] 

Jeff Pollard commented on MESOS-9555:
-

 

Full log for the master available here: 
[https://www.dropbox.com/s/vzoaf8pva95qs8c/mesos.log?dl=0]

Thanks for the links to the bugs fixed. I wasn't sure if there was a specific 
issue you had in mind that our issue might be. I'll spend some time reading 
some fixed bugs that look relevant.

I'll need to check internally to see what our schedule is with upgrading Mesos 
and will report back.

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762060#comment-16762060
 ] 

Benjamin Mahler commented on MESOS-9555:


[~fluxx] the latest 1.5.x release is 1.5.2. The bug fixes are listed in the 
CHANGELOG here:

1.5.2:
https://github.com/apache/mesos/blob/bb545b338f94c1f1e4e00dc6dbdb9a0484d4f163/CHANGELOG#L890-L948

1.5.1:
https://github.com/apache/mesos/blob/bb545b338f94c1f1e4e00dc6dbdb9a0484d4f163/CHANGELOG#L960-L995

Attaching the logs will still be useful if you have them, as this may not be a 
fixed issue. In particular it would be great to know what the resources on the 
removed agent(s) that trigger the crash look like.

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Jeff Pollard (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762048#comment-16762048
 ] 

Jeff Pollard commented on MESOS-9555:
-

[~bmahler], that is correct we are running 1.5.0 - specifically the 
{{mesos_1.5.0-2.0.1.ubuntu1604_amd64.deb}} package.

We do plan to upgrade our cluster in the near term, we can give 1.5.x a shot. 
Could you link to the previous issue that was fixed, I'd be curious to read it 
to see if there was anything in it that seems related to our environment.

I'll hold off attaching the log for right now until then.

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762009#comment-16762009
 ] 

Benjamin Mahler commented on MESOS-9555:


[~fluxx] can you confirm that you're not using the latest 1.5.x release? i.e. 
you're using 1.5.0? If so, can you try with 1.5.2 to save us potentially 
re-investigating a known issue that's been fixed?

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9557) Operations are leaked in Framework struct when agents are removed

2019-02-06 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9557:


 Summary: Operations are leaked in Framework struct when agents are 
removed
 Key: MESOS-9557
 URL: https://issues.apache.org/jira/browse/MESOS-9557
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Greg Mann


Currently, when agents are removed from the master, their operations are not 
removed from the {{Framework}} structs. We should ensure that this occurs in 
all cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.

2019-02-06 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-9143:
--

Assignee: (was: Greg Mann)

> MasterQuotaTest.RemoveSingleQuota is flaky.
> ---
>
> Key: MESOS-9143
> URL: https://issues.apache.org/jira/browse/MESOS-9143
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: RemoveSingleQuota-badrun.txt
>
>
> {noformat}
> ../../src/tests/master_quota_tests.cpp:493
> Value of: metrics.at(metricKey).isNone()
>   Actual: false
> Expected: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761935#comment-16761935
 ] 

Benjamin Mahler commented on MESOS-9555:


Can you attach the logs?

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)

2019-02-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761935#comment-16761935
 ] 

Benjamin Mahler edited comment on MESOS-9555 at 2/6/19 5:07 PM:


[~fluxx] Can you attach the logs?


was (Author: bmahler):
Can you attach the logs?

> Check failed: reservationScalarQuantities.contains(role)
> 
>
> Key: MESOS-9555
> URL: https://issues.apache.org/jira/browse/MESOS-9555
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0
> Environment: * Mesos 1.5
>  * {{DISTRIB_ID=Ubuntu}}
>  * {{DISTRIB_RELEASE=16.04}}
>  * {{DISTRIB_CODENAME=xenial}}
>  * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}}
>Reporter: Jeff Pollard
>Priority: Critical
>
> We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since 
> then have been getting periodic master crashes due to this error:
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 
> hierarchical.cpp:2630] Check failed: 
> reservationScalarQuantities.contains(role){code}
> Full stack trace is at the end of this issue description. When the master 
> fails, we automatically restart it and it rejoins the cluster just fine. I 
> did some initial searching and was unable to find any existing bug reports or 
> other people experiencing this issue. We run a cluster of 3 masters, and see 
> crashes on all 3 instances.
> Right before the crash, we saw a {{Removed agent:...}} log line noting that 
> it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed.
> {code:java}
> 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 
> 15:53:57.384759 8432 master.cpp:9893] Removed agent 
> 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 
> (10.0.18.78): the agent unregistered{code}
> I saved the full log from the master, so happy to provide more info from it, 
> or anything else about our current environment.
> Full stack trace is below.
> {code:java}
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d 
> google::LogMessage::Fail()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 
> google::LogMessage::SendToLog()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 
> google::LogMessage::Flush()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 
> google::LogMessageFatal::~LogMessageFatal()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd 
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 
> process::ProcessBase::consume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a 
> process::ProcessManager::resume()
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown)
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba 
> start_thread
> Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d 
> (unknown){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9556) Define agent states for cases handled by 'Master::_removeSlave()'

2019-02-06 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9556:


 Summary: Define agent states for cases handled by 
'Master::_removeSlave()'
 Key: MESOS-9556
 URL: https://issues.apache.org/jira/browse/MESOS-9556
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Greg Mann


The {{Master::_removeSlave()}} function currently handles three cases of agent 
removal:
* Starting maintenance on an agent via the 'startMaintenance()' handler
* When an agent submits a new registration from a previously-known IP:port, via 
the _registerSlave() method
* When an agent shuts itself down via an UnregisterSlaveMessage

In these cases, the agent is not transitioned to a new state in the master, it 
is simply removed. We should define agent states for these cases and ensure 
that the master stores these agent IDs and/or agent infos.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9556) Define agent states for cases handled by 'Master::_removeSlave()'

2019-02-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761918#comment-16761918
 ] 

Greg Mann commented on MESOS-9556:
--

Perhaps a new state like {{SLAVE_DOWN}} or {{SLAVE_DRAINED}} might make sense? 
This would represent the case where an agent has been removed but may come back 
after a time.

> Define agent states for cases handled by 'Master::_removeSlave()'
> -
>
> Key: MESOS-9556
> URL: https://issues.apache.org/jira/browse/MESOS-9556
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Greg Mann
>Priority: Major
>  Labels: agent-lifecycle, foundations, mesosphere
>
> The {{Master::_removeSlave()}} function currently handles three cases of 
> agent removal:
> * Starting maintenance on an agent via the 'startMaintenance()' handler
> * When an agent submits a new registration from a previously-known IP:port, 
> via the _registerSlave() method
> * When an agent shuts itself down via an UnregisterSlaveMessage
> In these cases, the agent is not transitioned to a new state in the master, 
> it is simply removed. We should define agent states for these cases and 
> ensure that the master stores these agent IDs and/or agent infos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8887) Improve the master registry GC on task state transitioning.

2019-02-06 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8887:
-

Assignee: Vinod Kone

> Improve the master registry GC on task state transitioning.
> ---
>
> Key: MESOS-8887
> URL: https://issues.apache.org/jira/browse/MESOS-8887
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Gilbert Song
>Assignee: Vinod Kone
>Priority: Major
>  Labels: mesosphere, partition, registry
>
> Unreachable agents will be gc-ed by the master registry after 
> `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the 
> GC happens, the agent will be removed from the master's unreachable agent 
> list, but its corresponding tasks are still in UNREACHABLE state in the 
> framework struct (though removed from `slaves.unreachableTasks`). We should 
> instead remove those tasks from everywhere or transition those tasks to a 
> terminal state, either TASK_LOST or TASK_GONE (further discussion is needed 
> to define the semantic).
> This improvement relates to how do we want to couple the update of task with 
> the GC of agent. Right now they are somewhat decoupled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761903#comment-16761903
 ] 

Vinod Kone commented on MESOS-8096:
---

Observed this with 
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/2

{code}
...
...
I0206 05:23:37.884572 19578 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) 
for task producer o
f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent
I0206 05:23:37.884624 19578 slave.cpp:5808] Forwarding the update TASK_FINISHED 
(Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of 
framework ffd3400c-13b0-4d
40-b63a-f4d3efc720de- to master@172.16.10.36:45979
I0206 05:23:37.884678 19578 slave.cpp:5701] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
2612f9b7-a190-4924-b40a-8193bced2dd8) for tas
k producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.884764 19578 master.cpp:8516] Status update TASK_FINISHED 
(Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of 
framework ffd3400c-13b0-4d40-b63a
-f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at 
slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal)
I0206 05:23:37.884784 19578 master.cpp:8573] Forwarding status update 
TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task 
producer of framework ffd3400c-13b
0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.884881 19578 master.cpp:11210] Updating the state of task 
producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- (latest state: 
TASK_FINISHED, status updat
e state: TASK_FINISHED)
I0206 05:23:37.885048 19577 hierarchical.cpp:1230] Recovered cpus(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.1; 
mem(allocated: default-role)
(reservations: [(DYNAMIC,default-role,test-principal)]):32; disk(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):32 (total: 
cpus:1.7; mem:928; disk
:928; ports:[31000-32000]; cpus(reservations: 
[(DYNAMIC,default-role,test-principal)]):0.3; mem(reservations: 
[(DYNAMIC,default-role,test-principal)]):96; disk(reservations: [(DYN
AMIC,default-role,test-principal)]):95; disk(reservations: 
[(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1, 
allocated: disk(allocated: default-role)(rese
rvations: 
[(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1; 
disk(allocated: default-role)(reservations: 
[(DYNAMIC,default-role,test-principal)]):63; mem(a
llocated: default-role)(reservations: 
[(DYNAMIC,default-role,test-principal)]):64; cpus(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.2) on age
nt ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 from framework 
ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885195 19572 scheduler.cpp:845] Enqueuing event UPDATE received 
from http://172.16.10.36:45979/master/api/v1/scheduler
I0206 05:23:37.885380 19571 scheduler.cpp:248] Sending ACKNOWLEDGE call to 
http://172.16.10.36:45979/master/api/v1/scheduler
I0206 05:23:37.885645 19572 task_status_update_manager.cpp:328] Received task 
status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) 
for task consumer of 
framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885682 19572 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) 
for task consumer o
f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent
I0206 05:23:37.885735 19572 slave.cpp:5808] Forwarding the update TASK_FINISHED 
(Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of 
framework ffd3400c-13b0-4d
40-b63a-f4d3efc720de- to master@172.16.10.36:45979
I0206 05:23:37.885792 19572 slave.cpp:5701] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
2dd9e000-d74f-4d94-ad72-0b313492) for tas
k consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885802 19578 process.cpp:3588] Handling HTTP event for process 
'master' with path: '/master/api/v1/scheduler'
I0206 05:23:37.885885 19578 master.cpp:8516] Status update TASK_FINISHED 
(Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of 
framework ffd3400c-13b0-4d40-b63a
-f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at 
slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal)
I0206 05:23:37.885905 19578 master.cpp:8573] Forwarding status update 
TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task 
consumer of framework ffd3400c-13b
0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885991 19578 master.cpp:11210] Updating the state of task 

[jira] [Commented] (MESOS-8796) Some GroupTest.* are flaky on Mac.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761769#comment-16761769
 ] 

Vinod Kone commented on MESOS-8796:
---

Saw this again on internal CI (on Mac).
{code}
[ RUN  ] GroupTest.GroupPathWithRestrictivePerms
I0205 21:14:33.530055 296834496 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 50946
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui
ld
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 
watcher=0x1145565d0 sessionId=0 s
essionPasswd= context=0x7fb3e0c9bc90 flags=0
2019-02-05 21:14:33,530:8369(0x73fcf000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50946]
2019-02-05 21:14:33,532:8369(0x73fcf000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:50946], 
sessionId=0x168c13aa8b9, negotiated timeou
t=1
2019-02-05 
21:14:36,875:8369(0x73fcf000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui
ld
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 
watcher=0x1145565d0 sessionId=0 s
essionPasswd= context=0x7fb3e0a4db10 flags=0
2019-02-05 21:14:36,879:8369(0x74767000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50946]
2019-02-05 21:14:36,880:8369(0x74767000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:50946], 
sessionId=0x168c13aa8b90001, negotiated timeou
t=1
I0205 21:14:36.880167 55189504 group.cpp:341] Group process 
(zookeeper-group(48)@10.0.49.4:65013) connected to ZooKeeper
I0205 21:14:36.880213 55189504 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I0205 21:14:36.880225 55189504 group.cpp:395] Authenticating with ZooKeeper 
using digest
2019-02-05 
21:14:40,222:8369(0x74767000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
I0205 21:14:40.24 55189504 group.cpp:419] Trying to create path 
'/read-only' in ZooKeeper
2019-02-05 21:14:40,223:8369(0x736ae000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 

[jira] [Commented] (MESOS-8266) MasterMaintenanceTest.AcceptInvalidInverseOffer is flaky.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761770#comment-16761770
 ] 

Vinod Kone commented on MESOS-8266:
---

Observed this on internal CI.

{code}
[ RUN  ] MasterMaintenanceTest.AcceptInvalidInverseOffer
I0206 05:13:46.592031 27319 cluster.cpp:174] Creating default 'local' authorizer
I0206 05:13:46.593217 27341 master.cpp:414] Master 
9ee5ab9a-1898-4ba6-a7f3-0093d03b19f8 (ip-172-16-10-145.ec2.internal) started on 
172.16.10.145:36957
I0206 05:13:46.593240 27341 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator
="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwri
te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/cBTYhp/credentials" 
--filter_gpu_resources="true" --framework_s
orter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize=
"true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream
_subscribers="1000" --max_unreachable_tasks_per_framework="1000" 
--memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" 
--port="5050" --publish_per_framework_me
trics="true" --quiet="false" --recovery_agent_removal_limit="100%" 
--registry="in_memory" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age
="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="t
rue" --version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/cBTYhp/master" --zk_session_timeout="10secs"
I0206 05:13:46.593377 27341 master.cpp:466] Master only allowing authenticated 
frameworks to register
I0206 05:13:46.593385 27341 master.cpp:472] Master only allowing authenticated 
agents to register
I0206 05:13:46.593391 27341 master.cpp:478] Master only allowing authenticated 
HTTP frameworks to register
I0206 05:13:46.593397 27341 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/cBTYhp/credentials'
I0206 05:13:46.593485 27341 master.cpp:522] Using default 'crammd5' 
authenticator
I0206 05:13:46.593521 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0206 05:13:46.593560 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0206 05:13:46.593582 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0206 05:13:46.593605 27341 master.cpp:603] Authorization enabled
I0206 05:13:46.594100 27340 hierarchical.cpp:176] Initialized hierarchical 
allocator process
I0206 05:13:46.594298 27341 whitelist_watcher.cpp:77] No whitelist given
I0206 05:13:46.594842 27344 master.cpp:2103] Elected as the leading master!
I0206 05:13:46.594856 27344 master.cpp:1638] Recovering from registrar
I0206 05:13:46.594935 27344 registrar.cpp:339] Recovering registrar
I0206 05:13:46.595073 27344 registrar.cpp:383] Successfully fetched the 
registry (0B) in 115968ns
I0206 05:13:46.595101 27344 registrar.cpp:487] Applied 1 operations in 6424ns; 
attempting to update the registry
I0206 05:13:46.595223 27344 registrar.cpp:544] Successfully updated the 
registry in 105984ns
I0206 05:13:46.595314 27344 registrar.cpp:416] Successfully recovered registrar
I0206 05:13:46.595392 27344 master.cpp:1752] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0206 05:13:46.595446 27344 hierarchical.cpp:216] Skipping recovery of 
hierarchical allocator: nothing to recover
W0206 05:13:46.595887 27319 process.cpp:2829] Attempted to spawn already 
running process version@172.16.10.145:36957
I0206 05:13:46.597141 27319 sched.cpp:232] Version: 1.8.0
I0206 05:13:46.597421 27345 sched.cpp:336] New master detected at 
master@172.16.10.145:36957
I0206 05:13:46.597458 27345 sched.cpp:401] Authenticating with master 
master@172.16.10.145:36957
I0206 05:13:46.597509 27345 sched.cpp:408] Using default CRAM-MD5 authenticatee
I0206 05:13:46.597611 27345 authenticatee.cpp:121] Creating new client SASL 
connection
I0206 05:13:46.597707 27345 master.cpp:9902] Authenticating 
scheduler-6e5ae29d-e284-4d9b-bbc2-2df8747428fd@172.16.10.145:36957
I0206 05:13:46.597754 27345 authenticator.cpp:414] Starting authentication 
session for crammd5-authenticatee(459)@172.16.10.145:36957
I0206 05:13:46.597805 27345 authenticator.cpp:98] Creating new server SASL 
connection