[jira] [Created] (MESOS-9560) ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky
Benjamin Bannier created MESOS-9560: --- Summary: ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky Key: MESOS-9560 URL: https://issues.apache.org/jira/browse/MESOS-9560 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Bannier We observed a segfault in {{ContentType/AgentAPITest.MarkResourceProviderGone/1}} on test teardown. {noformat} I0131 23:55:59.378453 6798 slave.cpp:923] Agent terminating I0131 23:55:59.378813 31143 master.cpp:1269] Agent a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 (ip-172-16-10-236.ec2.internal) disconnected I0131 23:55:59.378831 31143 master.cpp:3272] Disconnecting agent a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 (ip-172-16-10-236.ec2.internal) I0131 23:55:59.378846 31143 master.cpp:3291] Deactivating agent a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 (ip-172-16-10-236.ec2.internal) I0131 23:55:59.378891 31143 hierarchical.cpp:793] Agent a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 deactivated F0131 23:55:59.378891 31149 logging.cpp:67] RAW: Pure virtual method called @ 0x7f633aaaebdd google::LogMessage::Fail() @ 0x7f633aab6281 google::RawLog__() @ 0x7f6339821262 __cxa_pure_virtual @ 0x55671cacc113 testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() @ 0x55671b532e78 mesos::internal::tests::resource_provider::MockResourceProvider<>::disconnected() @ 0x7f633978f6b0 process::AsyncExecutorProcess::execute<>() @ 0x7f633979f218 _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_ @ 0x7f633a9f5d01 process::ProcessBase::consume() @ 0x7f633aa1a08a process::ProcessManager::resume() @ 0x7f633aa1db06 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f633acc9f80 execute_native_thread_routine @ 0x7f6337142e25 start_thread @ 0x7f6336241bad __clone {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762251#comment-16762251 ] Jeff Pollard commented on MESOS-9555: - So sounds like another engineer is going to start our upgrade process to 1.5.2 in the next day or two. I'll report back if the upgrade stops the crashes. > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor
[ https://issues.apache.org/jira/browse/MESOS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762217#comment-16762217 ] Gilbert Song commented on MESOS-9180: - [~Kirill P], could you add the agent logs for triaging. Also this may related to the recent stuck task fix due to a FD leak MESOS-9151 and MESOS-9501, could you please upgrade and verify if you still have this issue? > tasks get stuck in TASK_KILLING on the default executor > --- > > Key: MESOS-9180 > URL: https://issues.apache.org/jira/browse/MESOS-9180 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.6.1 > Environment: Ubuntu 18.04, Ubuntu 16.04 >Reporter: Kirill Plyashkevich >Priority: Critical > Labels: containerization > > during our load tests tasks get stuck in TASK_KILLING state > {quote}{noformat} > I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1 > I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED > event > I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on > XX.XXX.XX.XXX > I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP > event > I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting > 'MESOS_CONTAINER_IP' to: 172.26.10.222 > I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching > tasks [ > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery > ] in child containers [ > 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, > 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, > 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ] > I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child > containers of tasks [ > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery > ] > I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' > I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > (stdout): > 0 > PONG > I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > (stderr): > I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > (stdout): > I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > (stderr): > I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' > (stdout): > I0823 16:30:38.700598 21681
[jira] [Comment Edited] (MESOS-9473) Add end to end tests for operations on agent default resources.
[ https://issues.apache.org/jira/browse/MESOS-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762141#comment-16762141 ] Gastón Kleiman edited comment on MESOS-9473 at 2/6/19 9:58 PM: --- https://reviews.apache.org/r/69910/ https://reviews.apache.org/r/69911/ https://reviews.apache.org/r/69913/ was (Author: gkleiman): https://reviews.apache.org/r/69910/ https://reviews.apache.org/r/69911/ > Add end to end tests for operations on agent default resources. > --- > > Key: MESOS-9473 > URL: https://issues.apache.org/jira/browse/MESOS-9473 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Major > Labels: foundations, mesosphere, operation-feedback > > Making note of particular cases we need to test: > * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR for > operations on agent default resources when an agent is marked gone > * Verify that frameworks will receive OPERATION_GONE_BY_OPERATOR when they > reconcile operations on agents which have been marked gone -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9559) OPERATION_UNREACHABLE and OPERATION_GONE_BY_OPERATOR updates don't include the agent/RP IDs
Gastón Kleiman created MESOS-9559: - Summary: OPERATION_UNREACHABLE and OPERATION_GONE_BY_OPERATOR updates don't include the agent/RP IDs Key: MESOS-9559 URL: https://issues.apache.org/jira/browse/MESOS-9559 Project: Mesos Issue Type: Bug Components: master Reporter: Gastón Kleiman Assignee: Gastón Kleiman -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9558) Track removed agents' operations in the Framework struct
Greg Mann created MESOS-9558: Summary: Track removed agents' operations in the Framework struct Key: MESOS-9558 URL: https://issues.apache.org/jira/browse/MESOS-9558 Project: Mesos Issue Type: Improvement Components: master Reporter: Greg Mann The master currently relies on the agent ID in operation reconciliation requests in order to send {{OPERATION_UNREACHABLE}} and {{OPERATION_GONE_BY_OPERATOR}} updates. This also means that in the case of an operation that was terminal-but-unacknowledged at the time of agent removal (i.e. operation was FINISHED but the update not yet acknowledged), the framework will receive OPERATION_UNREACHABLE/OPERATION_GONE_BY_OPERATOR for reconciliation requests, when the master could report the actual terminal state if it were retained. We should consider updating the master to track unreachable/gone operations explicitly in order to provide better reconciliation information in these cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762066#comment-16762066 ] Jeff Pollard commented on MESOS-9555: - Full log for the master available here: [https://www.dropbox.com/s/vzoaf8pva95qs8c/mesos.log?dl=0] Thanks for the links to the bugs fixed. I wasn't sure if there was a specific issue you had in mind that our issue might be. I'll spend some time reading some fixed bugs that look relevant. I'll need to check internally to see what our schedule is with upgrading Mesos and will report back. > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762060#comment-16762060 ] Benjamin Mahler commented on MESOS-9555: [~fluxx] the latest 1.5.x release is 1.5.2. The bug fixes are listed in the CHANGELOG here: 1.5.2: https://github.com/apache/mesos/blob/bb545b338f94c1f1e4e00dc6dbdb9a0484d4f163/CHANGELOG#L890-L948 1.5.1: https://github.com/apache/mesos/blob/bb545b338f94c1f1e4e00dc6dbdb9a0484d4f163/CHANGELOG#L960-L995 Attaching the logs will still be useful if you have them, as this may not be a fixed issue. In particular it would be great to know what the resources on the removed agent(s) that trigger the crash look like. > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762048#comment-16762048 ] Jeff Pollard commented on MESOS-9555: - [~bmahler], that is correct we are running 1.5.0 - specifically the {{mesos_1.5.0-2.0.1.ubuntu1604_amd64.deb}} package. We do plan to upgrade our cluster in the near term, we can give 1.5.x a shot. Could you link to the previous issue that was fixed, I'd be curious to read it to see if there was anything in it that seems related to our environment. I'll hold off attaching the log for right now until then. > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762009#comment-16762009 ] Benjamin Mahler commented on MESOS-9555: [~fluxx] can you confirm that you're not using the latest 1.5.x release? i.e. you're using 1.5.0? If so, can you try with 1.5.2 to save us potentially re-investigating a known issue that's been fixed? > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9557) Operations are leaked in Framework struct when agents are removed
Greg Mann created MESOS-9557: Summary: Operations are leaked in Framework struct when agents are removed Key: MESOS-9557 URL: https://issues.apache.org/jira/browse/MESOS-9557 Project: Mesos Issue Type: Bug Components: master Reporter: Greg Mann Currently, when agents are removed from the master, their operations are not removed from the {{Framework}} structs. We should ensure that this occurs in all cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-9143: -- Assignee: (was: Greg Mann) > MasterQuotaTest.RemoveSingleQuota is flaky. > --- > > Key: MESOS-9143 > URL: https://issues.apache.org/jira/browse/MESOS-9143 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Priority: Major > Labels: flaky, flaky-test, mesosphere > Attachments: RemoveSingleQuota-badrun.txt > > > {noformat} > ../../src/tests/master_quota_tests.cpp:493 > Value of: metrics.at(metricKey).isNone() > Actual: false > Expected: true > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761935#comment-16761935 ] Benjamin Mahler commented on MESOS-9555: Can you attach the logs? > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9555) Check failed: reservationScalarQuantities.contains(role)
[ https://issues.apache.org/jira/browse/MESOS-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761935#comment-16761935 ] Benjamin Mahler edited comment on MESOS-9555 at 2/6/19 5:07 PM: [~fluxx] Can you attach the logs? was (Author: bmahler): Can you attach the logs? > Check failed: reservationScalarQuantities.contains(role) > > > Key: MESOS-9555 > URL: https://issues.apache.org/jira/browse/MESOS-9555 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.5.0 > Environment: * Mesos 1.5 > * {{DISTRIB_ID=Ubuntu}} > * {{DISTRIB_RELEASE=16.04}} > * {{DISTRIB_CODENAME=xenial}} > * {{DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"}} >Reporter: Jeff Pollard >Priority: Critical > > We recently upgraded our Mesos cluster from version 1.3 to 1.5, and since > then have been getting periodic master crashes due to this error: > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118 8434 > hierarchical.cpp:2630] Check failed: > reservationScalarQuantities.contains(role){code} > Full stack trace is at the end of this issue description. When the master > fails, we automatically restart it and it rejoins the cluster just fine. I > did some initial searching and was unable to find any existing bug reports or > other people experiencing this issue. We run a cluster of 3 masters, and see > crashes on all 3 instances. > Right before the crash, we saw a {{Removed agent:...}} log line noting that > it was agent 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 that was removed. > {code:java} > 294929:Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: I0205 > 15:53:57.384759 8432 master.cpp:9893] Removed agent > 9b912afa-1ced-49db-9c85-7bc5a22ef072-S6 at slave(1)@10.0.18.78:5051 > (10.0.18.78): the agent unregistered{code} > I saved the full log from the master, so happy to provide more info from it, > or anything else about our current environment. > Full stack trace is below. > {code:java} > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170a7d > google::LogMessage::Fail() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9172830 > google::LogMessage::SendToLog() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9170663 > google::LogMessage::Flush() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e9173259 > google::LogMessageFatal::~LogMessageFatal() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8443cbd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e8448fcd > mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90c4f11 > process::ProcessBase::consume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90dea4a > process::ProcessManager::resume() > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e90e25d6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e6700c80 (unknown) > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5f136ba > start_thread > Feb 5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @ 0x7f87e5c4941d > (unknown){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9556) Define agent states for cases handled by 'Master::_removeSlave()'
Greg Mann created MESOS-9556: Summary: Define agent states for cases handled by 'Master::_removeSlave()' Key: MESOS-9556 URL: https://issues.apache.org/jira/browse/MESOS-9556 Project: Mesos Issue Type: Improvement Components: master Reporter: Greg Mann The {{Master::_removeSlave()}} function currently handles three cases of agent removal: * Starting maintenance on an agent via the 'startMaintenance()' handler * When an agent submits a new registration from a previously-known IP:port, via the _registerSlave() method * When an agent shuts itself down via an UnregisterSlaveMessage In these cases, the agent is not transitioned to a new state in the master, it is simply removed. We should define agent states for these cases and ensure that the master stores these agent IDs and/or agent infos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9556) Define agent states for cases handled by 'Master::_removeSlave()'
[ https://issues.apache.org/jira/browse/MESOS-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761918#comment-16761918 ] Greg Mann commented on MESOS-9556: -- Perhaps a new state like {{SLAVE_DOWN}} or {{SLAVE_DRAINED}} might make sense? This would represent the case where an agent has been removed but may come back after a time. > Define agent states for cases handled by 'Master::_removeSlave()' > - > > Key: MESOS-9556 > URL: https://issues.apache.org/jira/browse/MESOS-9556 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Greg Mann >Priority: Major > Labels: agent-lifecycle, foundations, mesosphere > > The {{Master::_removeSlave()}} function currently handles three cases of > agent removal: > * Starting maintenance on an agent via the 'startMaintenance()' handler > * When an agent submits a new registration from a previously-known IP:port, > via the _registerSlave() method > * When an agent shuts itself down via an UnregisterSlaveMessage > In these cases, the agent is not transitioned to a new state in the master, > it is simply removed. We should define agent states for these cases and > ensure that the master stores these agent IDs and/or agent infos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8887) Improve the master registry GC on task state transitioning.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-8887: - Assignee: Vinod Kone > Improve the master registry GC on task state transitioning. > --- > > Key: MESOS-8887 > URL: https://issues.apache.org/jira/browse/MESOS-8887 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Gilbert Song >Assignee: Vinod Kone >Priority: Major > Labels: mesosphere, partition, registry > > Unreachable agents will be gc-ed by the master registry after > `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the > GC happens, the agent will be removed from the master's unreachable agent > list, but its corresponding tasks are still in UNREACHABLE state in the > framework struct (though removed from `slaves.unreachableTasks`). We should > instead remove those tasks from everywhere or transition those tasks to a > terminal state, either TASK_LOST or TASK_GONE (further discussion is needed > to define the semantic). > This improvement relates to how do we want to couple the update of task with > the GC of agent. Right now they are somewhat decoupled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761903#comment-16761903 ] Vinod Kone commented on MESOS-8096: --- Observed this with LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/2 {code} ... ... I0206 05:23:37.884572 19578 task_status_update_manager.cpp:383] Forwarding task status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer o f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent I0206 05:23:37.884624 19578 slave.cpp:5808] Forwarding the update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b0-4d 40-b63a-f4d3efc720de- to master@172.16.10.36:45979 I0206 05:23:37.884678 19578 slave.cpp:5701] Task status update manager successfully handled status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for tas k producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.884764 19578 master.cpp:8516] Status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b0-4d40-b63a -f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal) I0206 05:23:37.884784 19578 master.cpp:8573] Forwarding status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b 0-4d40-b63a-f4d3efc720de- I0206 05:23:37.884881 19578 master.cpp:11210] Updating the state of task producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- (latest state: TASK_FINISHED, status updat e state: TASK_FINISHED) I0206 05:23:37.885048 19577 hierarchical.cpp:1230] Recovered cpus(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.1; mem(allocated: default-role) (reservations: [(DYNAMIC,default-role,test-principal)]):32; disk(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):32 (total: cpus:1.7; mem:928; disk :928; ports:[31000-32000]; cpus(reservations: [(DYNAMIC,default-role,test-principal)]):0.3; mem(reservations: [(DYNAMIC,default-role,test-principal)]):96; disk(reservations: [(DYN AMIC,default-role,test-principal)]):95; disk(reservations: [(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1, allocated: disk(allocated: default-role)(rese rvations: [(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1; disk(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):63; mem(a llocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):64; cpus(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.2) on age nt ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 from framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885195 19572 scheduler.cpp:845] Enqueuing event UPDATE received from http://172.16.10.36:45979/master/api/v1/scheduler I0206 05:23:37.885380 19571 scheduler.cpp:248] Sending ACKNOWLEDGE call to http://172.16.10.36:45979/master/api/v1/scheduler I0206 05:23:37.885645 19572 task_status_update_manager.cpp:328] Received task status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885682 19572 task_status_update_manager.cpp:383] Forwarding task status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer o f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent I0206 05:23:37.885735 19572 slave.cpp:5808] Forwarding the update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d 40-b63a-f4d3efc720de- to master@172.16.10.36:45979 I0206 05:23:37.885792 19572 slave.cpp:5701] Task status update manager successfully handled status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for tas k consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885802 19578 process.cpp:3588] Handling HTTP event for process 'master' with path: '/master/api/v1/scheduler' I0206 05:23:37.885885 19578 master.cpp:8516] Status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d40-b63a -f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal) I0206 05:23:37.885905 19578 master.cpp:8573] Forwarding status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b 0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885991 19578 master.cpp:11210] Updating the state of task
[jira] [Commented] (MESOS-8796) Some GroupTest.* are flaky on Mac.
[ https://issues.apache.org/jira/browse/MESOS-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761769#comment-16761769 ] Vinod Kone commented on MESOS-8796: --- Saw this again on internal CI (on Mac). {code} [ RUN ] GroupTest.GroupPathWithRestrictivePerms I0205 21:14:33.530055 296834496 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 50946 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui ld 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 watcher=0x1145565d0 sessionId=0 s essionPasswd= context=0x7fb3e0c9bc90 flags=0 2019-02-05 21:14:33,530:8369(0x73fcf000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50946] 2019-02-05 21:14:33,532:8369(0x73fcf000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:50946], sessionId=0x168c13aa8b9, negotiated timeou t=1 2019-02-05 21:14:36,875:8369(0x73fcf000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui ld 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 watcher=0x1145565d0 sessionId=0 s essionPasswd= context=0x7fb3e0a4db10 flags=0 2019-02-05 21:14:36,879:8369(0x74767000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50946] 2019-02-05 21:14:36,880:8369(0x74767000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:50946], sessionId=0x168c13aa8b90001, negotiated timeou t=1 I0205 21:14:36.880167 55189504 group.cpp:341] Group process (zookeeper-group(48)@10.0.49.4:65013) connected to ZooKeeper I0205 21:14:36.880213 55189504 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0205 21:14:36.880225 55189504 group.cpp:395] Authenticating with ZooKeeper using digest 2019-02-05 21:14:40,222:8369(0x74767000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded I0205 21:14:40.24 55189504 group.cpp:419] Trying to create path '/read-only' in ZooKeeper 2019-02-05 21:14:40,223:8369(0x736ae000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05
[jira] [Commented] (MESOS-8266) MasterMaintenanceTest.AcceptInvalidInverseOffer is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761770#comment-16761770 ] Vinod Kone commented on MESOS-8266: --- Observed this on internal CI. {code} [ RUN ] MasterMaintenanceTest.AcceptInvalidInverseOffer I0206 05:13:46.592031 27319 cluster.cpp:174] Creating default 'local' authorizer I0206 05:13:46.593217 27341 master.cpp:414] Master 9ee5ab9a-1898-4ba6-a7f3-0093d03b19f8 (ip-172-16-10-145.ec2.internal) started on 172.16.10.145:36957 I0206 05:13:46.593240 27341 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator ="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwri te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/cBTYhp/credentials" --filter_gpu_resources="true" --framework_s orter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize= "true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream _subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_me trics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age ="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="t rue" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/cBTYhp/master" --zk_session_timeout="10secs" I0206 05:13:46.593377 27341 master.cpp:466] Master only allowing authenticated frameworks to register I0206 05:13:46.593385 27341 master.cpp:472] Master only allowing authenticated agents to register I0206 05:13:46.593391 27341 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I0206 05:13:46.593397 27341 credentials.hpp:37] Loading credentials for authentication from '/tmp/cBTYhp/credentials' I0206 05:13:46.593485 27341 master.cpp:522] Using default 'crammd5' authenticator I0206 05:13:46.593521 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0206 05:13:46.593560 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0206 05:13:46.593582 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0206 05:13:46.593605 27341 master.cpp:603] Authorization enabled I0206 05:13:46.594100 27340 hierarchical.cpp:176] Initialized hierarchical allocator process I0206 05:13:46.594298 27341 whitelist_watcher.cpp:77] No whitelist given I0206 05:13:46.594842 27344 master.cpp:2103] Elected as the leading master! I0206 05:13:46.594856 27344 master.cpp:1638] Recovering from registrar I0206 05:13:46.594935 27344 registrar.cpp:339] Recovering registrar I0206 05:13:46.595073 27344 registrar.cpp:383] Successfully fetched the registry (0B) in 115968ns I0206 05:13:46.595101 27344 registrar.cpp:487] Applied 1 operations in 6424ns; attempting to update the registry I0206 05:13:46.595223 27344 registrar.cpp:544] Successfully updated the registry in 105984ns I0206 05:13:46.595314 27344 registrar.cpp:416] Successfully recovered registrar I0206 05:13:46.595392 27344 master.cpp:1752] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0206 05:13:46.595446 27344 hierarchical.cpp:216] Skipping recovery of hierarchical allocator: nothing to recover W0206 05:13:46.595887 27319 process.cpp:2829] Attempted to spawn already running process version@172.16.10.145:36957 I0206 05:13:46.597141 27319 sched.cpp:232] Version: 1.8.0 I0206 05:13:46.597421 27345 sched.cpp:336] New master detected at master@172.16.10.145:36957 I0206 05:13:46.597458 27345 sched.cpp:401] Authenticating with master master@172.16.10.145:36957 I0206 05:13:46.597509 27345 sched.cpp:408] Using default CRAM-MD5 authenticatee I0206 05:13:46.597611 27345 authenticatee.cpp:121] Creating new client SASL connection I0206 05:13:46.597707 27345 master.cpp:9902] Authenticating scheduler-6e5ae29d-e284-4d9b-bbc2-2df8747428fd@172.16.10.145:36957 I0206 05:13:46.597754 27345 authenticator.cpp:414] Starting authentication session for crammd5-authenticatee(459)@172.16.10.145:36957 I0206 05:13:46.597805 27345 authenticator.cpp:98] Creating new server SASL connection