[jira] [Commented] (MESOS-9580) Master sends inconsistent `UpdateFrameworkMessage` to agents.
[ https://issues.apache.org/jira/browse/MESOS-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770020#comment-16770020 ] Chun-Hung Hsiao commented on MESOS-9580: Some more comments about this issue. The current behavior, although making such an inconsistency between master and agents, can handle the following scenario: 1. A framework reregistered with a new user. 2. The master broadcasted the new framework info and all agents received it (inconsistency occured). 3. The master failed over. 4. No matter the framework or any agent registered first, the master would always get the same framework info. > Master sends inconsistent `UpdateFrameworkMessage` to agents. > - > > Key: MESOS-9580 > URL: https://issues.apache.org/jira/browse/MESOS-9580 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Chun-Hung Hsiao >Priority: Major > Labels: foundations > > If a framework reregisters with a new user, the master would ignore the user > update because of MESOS-703: > > [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/framework.cpp#L526-L529] > However, it would send the framework info *coming from the framework* (i.e., > with the new user) provided by the framework to all agents: > > [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757] > > [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L3156-L3162] > But, when an agent reregistered, the master would send the framework info > from its in-memory state: > > [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L7827-L7842] > This would make the framework info inconsistent between the master and some > of its agents. Although it won't affect executor and task launch (as the > framework info would be injected into {{RunTask(Group)*Message}}), if there > is a master failover, a race between framework and agent reregistrations > would make the new master learn different framework info. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9581) Mesos package naming appears to be undeterministic.
Till Toenshoff created MESOS-9581: - Summary: Mesos package naming appears to be undeterministic. Key: MESOS-9581 URL: https://issues.apache.org/jira/browse/MESOS-9581 Project: Mesos Issue Type: Bug Components: build Affects Versions: 1.7.1 Reporter: Till Toenshoff Transcribed from slack; https://mesos.slack.com/archives/C7N086PK2/p1550158266006900 It appears there are a number of RPM packages called “mesos-1.7.1-2.0.1.el7.x86_64.rpm” in the wild. I’ve caught specimens with build dates February 1st, 7th and 13th. While it’s somewhat troubling in itself, none of these packages is the one referred to in Yum repository metadata (repos.mesosphere.com), which is a package built today on the 14th, so I can’t install Mesos right now. Could it be that your pipeline is creating a new package with the same verson and release in every nightly build? Repository metadata {noformat} sqlite3 *primary.sqlite "select name, version, release, strftime('%d-%m-%Y %H:%M', datetime(time_build, 'unixepoch')) build_as_string, rpm_buildhost from packages where name = 'mesos' and version = '1.7.1';" mesos|1.7.1|2.0.1|14-02-2019 12:30|ip-172-16-10-254.ec2.internal Packages downloaded while investigating over the past few days Name : mesos Version : 1.7.1 Release : 2.0.1 Architecture: x86_64 Install Date: (not installed) Group : misc Size : 298787793 License : Apache-2.0 Signature : RSA/SHA256, Fri 01 Feb 2019 11:38:47 PM UTC, Key ID df7d54cbe56151bf Source RPM : mesos-1.7.1-2.0.1.src.rpm Build Date : Fri 01 Feb 2019 11:15:17 PM UTC Build Host : ip-172-16-10-11.ec2.internal Relocations : / Packager : d...@mesos.apache.org URL : https://mesos.apache.org/ Summary : Cluster resource manager with efficient resource isolation Description : [snip] Name : mesos Version : 1.7.1 Release : 2.0.1 Architecture: x86_64 Install Date: (not installed) Group : misc Size : 298791347 License : Apache-2.0 Signature : RSA/SHA256, Thu 07 Feb 2019 10:33:06 PM UTC, Key ID df7d54cbe56151bf Source RPM : mesos-1.7.1-2.0.1.src.rpm Build Date : Thu 07 Feb 2019 10:31:02 PM UTC Build Host : ip-172-16-10-4.ec2.internal Relocations : / Packager : d...@mesos.apache.org URL : https://mesos.apache.org/ Summary : Cluster resource manager with efficient resource isolation Description : [snip] Name : mesos Version : 1.7.1 Release : 2.0.1 Architecture: x86_64 Install Date: (not installed) Group : misc Size : 298789309 License : Apache-2.0 Signature : RSA/SHA256, Wed Feb 13 04:35:02 2019, Key ID df7d54cbe56151bf Source RPM : mesos-1.7.1-2.0.1.src.rpm Build Date : Wed Feb 13 04:32:41 2019 Build Host : ip-172-16-10-83.ec2.internal Relocations : / Packager : d...@mesos.apache.org URL : https://mesos.apache.org/ Summary : Cluster resource manager with efficient resource isolation Description : {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9580) Master sends inconsistent `UpdateFrameworkMessage` to agents.
Chun-Hung Hsiao created MESOS-9580: -- Summary: Master sends inconsistent `UpdateFrameworkMessage` to agents. Key: MESOS-9580 URL: https://issues.apache.org/jira/browse/MESOS-9580 Project: Mesos Issue Type: Bug Components: master Reporter: Chun-Hung Hsiao If a framework reregisters with a new user, the master would ignore the user update because of MESOS-703: [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/framework.cpp#L526-L529] However, it would send the *original* framework info provided by the framework to all agents: [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757 https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L3156-L3162|https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757] But, when an agent reregistered, the master would send the framework info from its in-memory state: [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L7827-L7842] This would make the framework info inconsistent between the master and some of its agents. Although it won't affect executor and task launch (as the framework info would be injected into {{RunTask(Group)*Message}}), if there is a master failover, a race between framework and agent reregistrations would make the new master learn different framework info. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.
[ https://issues.apache.org/jira/browse/MESOS-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769892#comment-16769892 ] Greg Mann commented on MESOS-9191: -- Retargeted 1.6.3. > Docker command executor may stuck at infinite unkillable loop. > -- > > Key: MESOS-9191 > URL: https://issues.apache.org/jira/browse/MESOS-9191 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Major > Labels: containerizer > > Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the > behavior of docker command executor to discard the future of docker stop was > changed. If there is a new killTask() invoked and there is an existing docker > stop in pending state, the old one would call discard and then execute the > new one. This is ok for most of cases. > However, docker stop could take long (depends on grace period and whether the > application could handle SIGTERM). If the framework retry killTask more > frequently than grace period (depends on killpolicy API, env var, or agent > flags), then the executor may be stuck forever with unkillable tasks. Because > everytime before the docker stop finishes, the future of docker stop is > discarded by the new incoming killTask. > We should consider re-use grace period before calling discard() to a pending > docker stop future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9579) ExecutorHttpApiTest.HeartbeatCalls is flaky.
Till Toenshoff created MESOS-9579: - Summary: ExecutorHttpApiTest.HeartbeatCalls is flaky. Key: MESOS-9579 URL: https://issues.apache.org/jira/browse/MESOS-9579 Project: Mesos Issue Type: Bug Components: executor Affects Versions: 1.8.0 Environment: Centos 6 Reporter: Till Toenshoff I just saw this failing on our internal CI: {noformat} 21:42:35 [ RUN ] ExecutorHttpApiTest.HeartbeatCalls 21:42:35 I0215 21:42:35.917752 17173 executor.cpp:206] Version: 1.8.0 21:42:35 W0215 21:42:35.917771 17173 process.cpp:2829] Attempted to spawn already running process version@172.16.10.166:35439 21:42:35 I0215 21:42:35.918581 17174 executor.cpp:432] Connected with the agent 21:42:35 F0215 21:42:35.918857 17174 owned.hpp:112] Check failed: 'get()' Must be non NULL 21:42:35 *** Check failure stack trace: *** 21:42:35 @ 0x7fb93ce1d1dd google::LogMessage::Fail() 21:42:35 @ 0x7fb93ce1ee7d google::LogMessage::SendToLog() 21:42:35 @ 0x7fb93ce1cdb3 google::LogMessage::Flush() 21:42:35 @ 0x7fb93ce1f879 google::LogMessageFatal::~LogMessageFatal() 21:42:35 @ 0x55e80a099f76 google::CheckNotNull<>() 21:42:35 @ 0x55e80a07dde4 _ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data 21:42:35 @ 0x7fb93baea260 process::AsyncExecutorProcess::execute<>() 21:42:35 @ 0x7fb93baf62cb _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_ 21:42:36 @ 0x7fb93cd646b1 process::ProcessBase::consume() 21:42:36 @ 0x7fb93cd794ba process::ProcessManager::resume() 21:42:36 @ 0x7fb93cd7d486 _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv 21:42:36 @ 0x7fb93d02a1af execute_native_thread_routine 21:42:36 @ 0x7fb939794aa1 start_thread 21:42:36 @ 0x7fb938b39c4d clone 21:42:36 The test binary has crashed OR the timeout has been exceeded! 21:42:36 ~/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-centos-6 21:42:36 mkswap: /tmp/swapfile: warning: don't erase bootbits sectors 21:42:36 on whole disk. Use -f to force. 21:42:36 Setting up swapspace version 1, size = 8388604 KiB 21:42:36 no label, UUID=dda5aa26-dba6-4ac8-bc6c-41264f510694 21:42:36 gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3) 21:42:36 Copyright (C) 2016 Free Software Foundation, Inc. 21:42:36 This is free software; see the source for copying conditions. There is NO 21:42:36 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 21:42:36 Docker version 1.7.1, build 786b29d 21:42:36 curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.0.1e zlib/1.2.3 c-ares/1.14.0 libssh2/1.8.0 nghttp2/1.6.0 21:42:36 Release-Date: 2018-09-05{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769813#comment-16769813 ] Vinod Kone commented on MESOS-8887: --- Landed on master: commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 Backported to 1.7.x commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 > Unreachable tasks are not GC'ed when unreachable agent is GC'ed. > > > Key: MESOS-8887 > URL: https://issues.apache.org/jira/browse/MESOS-8887 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.3, 1.5.2, 1.6.1, 1.7.1 >Reporter: Gilbert Song >Assignee: Vinod Kone >Priority: Major > Labels: foundations, mesosphere, partition, registry > > Unreachable agents will be gc-ed by the master registry after > `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the > GC happens, the agent will be removed from the master's unreachable agent > list, but its corresponding tasks are still in UNREACHABLE state in the > framework struct (though removed from `slaves.unreachableTasks`). We should > instead remove those tasks from everywhere or transition those tasks to a > terminal state, either TASK_LOST or TASK_GONE (further discussion is needed > to define the semantic). > This improvement relates to how do we want to couple the update of task with > the GC of agent. Right now they are somewhat decoupled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769822#comment-16769822 ] Vinod Kone commented on MESOS-8892: --- Observed this on 1.6.x branch {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileDroppedOperation I0215 21:36:18.921594 4052 cluster.cpp:172] Creating default 'local' authorizer I0215 21:36:18.922894 4057 master.cpp:465] Master 21d3c979-83c3-4141-9a3a-635fd550d45a (ip-172-16-10-236.ec2.internal) started on 172.16.10.236:36326 I0215 21:36:18.922915 4057 master.cpp:468] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator ="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwri te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/exYTvt/credentials" --filter_gpu_resources="true" --framework_s orter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize= "true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per _framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memo ry" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --reg istry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/exY Tvt/master" --zk_session_timeout="10secs" I0215 21:36:18.923121 4057 master.cpp:517] Master only allowing authenticated frameworks to register I0215 21:36:18.923393 4057 master.cpp:523] Master only allowing authenticated agents to register I0215 21:36:18.923408 4057 master.cpp:529] Master only allowing authenticated HTTP frameworks to register I0215 21:36:18.923414 4057 credentials.hpp:37] Loading credentials for authentication from '/tmp/exYTvt/credentials' I0215 21:36:18.923651 4057 master.cpp:573] Using default 'crammd5' authenticator I0215 21:36:18.923777 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0215 21:36:18.923904 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0215 21:36:18.924266 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0215 21:36:18.924465 4057 master.cpp:654] Authorization enabled I0215 21:36:18.924823 4056 hierarchical.cpp:179] Initialized hierarchical allocator process I0215 21:36:18.927826 4058 whitelist_watcher.cpp:77] No whitelist given I0215 21:36:18.928741 4054 master.cpp:2176] Elected as the leading master! I0215 21:36:18.928759 4054 master.cpp:1711] Recovering from registrar I0215 21:36:18.928800 4054 registrar.cpp:339] Recovering registrar I0215 21:36:18.929002 4054 registrar.cpp:383] Successfully fetched the registry (0B) in 132096ns I0215 21:36:18.929033 4054 registrar.cpp:487] Applied 1 operations in 7184ns; attempting to update the registry I0215 21:36:18.929154 4058 registrar.cpp:544] Successfully updated the registry in 108032ns I0215 21:36:18.929232 4058 registrar.cpp:416] Successfully recovered registrar I0215 21:36:18.929361 4055 master.cpp:1825] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0215 21:36:18.929415 4055 hierarchical.cpp:217] Skipping recovery of hierarchical allocator: nothing to recover W0215 21:36:18.931118 4052 process.cpp:2829] Attempted to spawn already running process files@172.16.10.236:36326 I0215 21:36:18.931596 4052 containerizer.cpp:300] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0215 21:36:18.934453 4052 linux_launcher.cpp:147] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher I0215 21:36:18.934859 4052 provisioner.cpp:299] Using default backend 'aufs' I0215 21:36:18.935410 4052 cluster.cpp:460] Creating default 'local' authorizer I0215 21:36:18.936164 4060 slave.cpp:259] Mesos agent started on (230)@172.16.10.236:36326 W0215 21:36:18.936399 4052 process.cpp:2829] Attempted to spawn already running process version@172.16.10.236:36326 I0215 21:36:18.936187 4060 slave.cpp:260] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/exYTvt/GHfic5/store/appc" --authenticate _http_exe
[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769824#comment-16769824 ] Vinod Kone commented on MESOS-8892: --- [~bbannier] Can we backport this test fix to 1.6.x branch? > MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky > > > Key: MESOS-8892 > URL: https://issues.apache.org/jira/browse/MESOS-8892 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.6.0 >Reporter: Greg Mann >Assignee: Benjamin Bannier >Priority: Major > Labels: mesosphere > Fix For: 1.7.0 > > Attachments: > MasterSlaveReconciliationTest.ReconcileDroppedOperation.txt > > > This was observed on a Debian 9 SSL/GRPC-enabled build. It appears that a > poorly-timed {{UpdateSlaveMessage}} leads to the operation reconciliation > occurring before the expectation for the {{ReconcileOperationsMessage}} is > registered: > {code} > I0508 00:11:09.700815 22498 master.cpp:4362] Processing ACCEPT call for > offers: [ f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-O0 ] on agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) for framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) > at scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 > I0508 00:11:09.700870 22498 master.cpp:3602] Authorizing principal > 'test-principal' to reserve resources 'cpus(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):2; > mem(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):1024; disk(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > ports(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):[31000-32000]' > I0508 00:11:09.701228 22493 master.cpp:4725] Applying RESERVE operation for > resources > [{"allocation_info":{"role":"default-role"},"name":"cpus","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"mem","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"disk","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"type":"RANGES"}] > from framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) at > scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 to agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) > I0508 00:11:09.701498 22493 master.cpp:11265] Sending operation '' (uuid: > 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) to agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) > I0508 00:11:09.701627 22494 slave.cpp:1564] Forwarding agent update > {"operations":{},"resource_version_uuid":{"value":"0HeA06ftS6m76SNoNZNPag=="},"slave_id":{"value":"f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0"},"update_oversubscribed_resources":true} > I0508 00:11:09.701848 22494 master.cpp:7800] Received update of agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) with total oversubscribed resources {} > W0508 00:11:09.701905 22494 master.cpp:7974] Performing explicit > reconciliation with agent for known operation > 81dffb62-6e75-4c6c-a97b-41c92c58d6a7 since it was not present in original > reconciliation message from agent > I0508 00:11:09.702085 22494 master.cpp:11015] Updating the state of operation > '' (uuid: 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) for framework > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (latest state: OPERATION_PENDING, > status update state: OPERATION_DROPPED) > I0508 00:11:09.702239 22491 hierarchical.cpp:925] Updated allocation of > framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- on agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 from cpus(allocated: default-role):2; > mem(allocated: default-role):1024; disk(allocated: default-role):1024; > ports(allocated: default-role):[31000-32000] to disk(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > cpus(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):2; mem(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > ports(allocated: default-role)(reservations: > [(DYNAMIC,default-role,
[jira] [Comment Edited] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769813#comment-16769813 ] Vinod Kone edited comment on MESOS-8887 at 2/15/19 10:38 PM: - Landed on master: --- commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 --- Backported to 1.7.x --- commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 was (Author: vinodkone): Landed on master: commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 Backported to 1.7.x commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable
[jira] [Comment Edited] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769794#comment-16769794 ] Meng Zhu edited comment on MESOS-9143 at 2/15/19 10:15 PM: --- {noformat} commit 4380e5ba999b31782ef2fb32f51a1f225d28f5c5 Date: Wed Feb 13 16:06:26 2019 -0800 Fixed a flaky test `MasterQuotaTest.RemoveSingleQuota`. The test is flaky due to a race between metrics update and metrics query. This patch adds clock settle to ensure quota update and removal are fully processed (including metrics updates) before continuing with the metrics query. Review: https://reviews.apache.org/r/69981 {noformat} was (Author: mzhu): commit 4380e5ba999b31782ef2fb32f51a1f225d28f5c5 Date: Wed Feb 13 16:06:26 2019 -0800 Fixed a flaky test `MasterQuotaTest.RemoveSingleQuota`. The test is flaky due to a race between metrics update and metrics query. This patch adds clock settle to ensure quota update and removal are fully processed (including metrics updates) before continuing with the metrics query. Review: https://reviews.apache.org/r/69981 > MasterQuotaTest.RemoveSingleQuota is flaky. > --- > > Key: MESOS-9143 > URL: https://issues.apache.org/jira/browse/MESOS-9143 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Assignee: Meng Zhu >Priority: Major > Labels: flaky, flaky-test, mesosphere, resource-management > Fix For: 1.8.0 > > Attachments: RemoveSingleQuota-badrun.txt > > > {noformat} > ../../src/tests/master_quota_tests.cpp:493 > Value of: metrics.at(metricKey).isNone() > Actual: false > Expected: true > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769771#comment-16769771 ] Benjamin Mahler commented on MESOS-9490: [~bennoe] Can you include the stack trace of the CHECK failure? > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > with a decode error in this case. > For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)
[ https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769727#comment-16769727 ] Vinod Kone commented on MESOS-8750: --- [~megha.sharma] [~xujyan] Why was this not backported to older versions? > Check failed: !slaves.registered.contains(task->slave_id) > - > > Key: MESOS-8750 > URL: https://issues.apache.org/jira/browse/MESOS-8750 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.6.0 >Reporter: Megha Sharma >Assignee: Megha Sharma >Priority: Critical > Fix For: 1.6.0 > > > It appears that in certain circumstances an unreachable task doesn't get > cleaned up from the framework.unreachableTasks when the respective agent > re-registers leading to this check failure later when the framework is being > removed. When an agent goes unreachable master adds the tasks from this agent > to {{framework.unreachableTasks}} and when such an agent re-registers the > master removes the tasks that it specifies during re-registeration from this > datastructure but there could be tasks that the agent doesn't know about e.g. > if the runTask message for them got dropped and so such tasks will not get > removed from unreachableTasks. > {noformat} > F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: > !slaves.registered.contains(task->slave_id()) Unreachable task of > framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered > agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9578) Document per framework minimal allocatable resources in framework development guides
[ https://issues.apache.org/jira/browse/MESOS-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769467#comment-16769467 ] Benjamin Mahler commented on MESOS-9578: Also nice would be to document this in the multi-scheduler scalability guidelines. > Document per framework minimal allocatable resources in framework development > guides > > > Key: MESOS-9578 > URL: https://issues.apache.org/jira/browse/MESOS-9578 > Project: Mesos > Issue Type: Task > Components: documentation >Reporter: Benjamin Bannier >Priority: Blocker > > With MESOS-9523 we introduced fields into {{FrameworkInfo}} to give > frameworks a way to express their resource requirements. We should document > this feature in the framework development guide(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9578) Document per framework minimal allocatable resources in framework development guides
Benjamin Bannier created MESOS-9578: --- Summary: Document per framework minimal allocatable resources in framework development guides Key: MESOS-9578 URL: https://issues.apache.org/jira/browse/MESOS-9578 Project: Mesos Issue Type: Task Components: documentation Reporter: Benjamin Bannier With MESOS-9523 we introduced fields into {{FrameworkInfo}} to give frameworks a way to express their resource requirements. We should document this feature in the framework development guide(s). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234 ] Benno Evers edited comment on MESOS-9490 at 2/15/19 11:57 AM: -- [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} If I remember correctly, running this test leads to a segfault due to some internal CHECK failure. was (Author: bennoe): [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > with a decode error in this case. > For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess
[ https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234 ] Benno Evers commented on MESOS-9490: [~bmahler], the full code which originally hit this issue is pasted in the linked issue, a more minimal version looks like this: {noformat} TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) { Try> master = StartMaster(); Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL); Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}}; auto response = process::http::get( master.get()->pid, "/state", None(), authHeaders + acceptGzipHeaders); AWAIT_READY(response); } {noformat} > Support accepting gzipped responses in libprocess > - > > Key: MESOS-9490 > URL: https://issues.apache.org/jira/browse/MESOS-9490 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > Labels: libprocess > > Currently all libprocess endpoints support the serving of gzipped responses > when the client is requesting this with an `Accept-Encoding: gzip` header. > However, libprocess does not support receiving gzipped responses, failing > with a decode error in this case. > For symmetry, we should try to support compression in this case as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9575) Mesos Web UI can't display relative timestamps in the future
[ https://issues.apache.org/jira/browse/MESOS-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769093#comment-16769093 ] Armand Grillet commented on MESOS-9575: --- relative-date has not been updated since 2011: https://github.com/azer/relative-date We could use https://github.com/moment/moment/ which can [parse UNIX timestamps|https://momentjs.com/docs/#/parsing/unix-timestamp/] and return relative dates in the past and the future: {noformat} moment("20111031", "MMDD").fromNow(); // 7 years ago moment().startOf('day').fromNow();// 10 hours ago moment().endOf('day').fromNow(); // in 14 hours {noformat} > Mesos Web UI can't display relative timestamps in the future > > > Key: MESOS-9575 > URL: https://issues.apache.org/jira/browse/MESOS-9575 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > > The `relativeDate()` function used by the Mesos WebUI > (https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=src/webui/assets/libs/relative-date.js;hb=HEAD) > is only able to handle dates in the past. All dates in the future are > rendered as "just now". > This can be especially confusing when posting maintenance windows, where > usually both dates are in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)