[jira] [Assigned] (MESOS-10199) Mesos doesn't set correct client request headers for HTTP requests
[ https://issues.apache.org/jira/browse/MESOS-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-10199: -- Assignee: Abdul Qadeer > Mesos doesn't set correct client request headers for HTTP requests > -- > > Key: MESOS-10199 > URL: https://issues.apache.org/jira/browse/MESOS-10199 > Project: Mesos > Issue Type: Bug > Components: agent, libprocess, master >Reporter: Abdul Qadeer >Assignee: Abdul Qadeer >Priority: Major > > The agents are not able to contact/register with master as the requests > don't set 'Host' parameter and nginx is required to return 400 for such > requests per [RFC|https://tools.ietf.org/html/rfc7230#section-5.4] specs : > {noformat} > *7 client sent invalid host header while reading client request headers, > client: x.x.x.x, server: , request: "POST > /master/mesos.internal.ReregisterSlaveMessage HTTP/1.1", host: ""{noformat} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097036#comment-17097036 ] Vinod Kone commented on MESOS-8038: --- Thanks [~cf.natali] for the repro and analysis. The above log lines you pasted in the comment doesn't capture everything that transpired, you would need to do a grep like this to get the whole picture. {quote} grep -E "task-650af3bd-3f5b-4e17-9d34-4642480b4da0|:36541|6f446173-2bba-4cc4-bc15-c956bc159d4e" mesos_agent.log {quote} But, anyway, I think your observations are largely correct. When a container is in the process of being destroyed, the agent does short-circuit to send the terminal update to the master causing the resources to be released and offered and used by some other task. I remember discussions around this behavior in the past, but not sure where we landed in terms of the long term solution. Right now, we err on the side of releasing the resources incase the cgroup gets stuck in destroying instead of hoarding it. If we do decide to change this code to always wait for the cgroup destruction to be finished (or update to be finished) there's a possibility that resources are locked forever incase of bugs (either in mesos or kernel) in the destruction path. I can't remember if we have seen this behavior in production clusters before. [~abudnik] [~greggomann] thoughts on fixing this? > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-6084) Deprecate and remove the included MPI framework
[ https://issues.apache.org/jira/browse/MESOS-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-6084: - Assignee: Vinod Kone > Deprecate and remove the included MPI framework > --- > > Key: MESOS-6084 > URL: https://issues.apache.org/jira/browse/MESOS-6084 > Project: Mesos > Issue Type: Task >Affects Versions: 1.0.0 >Reporter: Joseph Wu >Assignee: Vinod Kone >Priority: Minor > Labels: mpi > > The Mesos codebase still includes code for an > [MPI|http://www.mcs.anl.gov/research/projects/mpi/] framework. This code has > been untouched and probably not used since around Mesos 0.9.0. Since we > don't support this code anymore, we should deprecate and remove it. > The code is located here: > https://github.com/apache/mesos/tree/db4c8a0e9eaf27f3e2d42a620a5e612863cbf9ea/mpi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10092) Cannot pull image from docker registry which does not reply with 'scope'/'service' in WWW-Authenticate header
[ https://issues.apache.org/jira/browse/MESOS-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044770#comment-17044770 ] Vinod Kone commented on MESOS-10092: Upto 1.7 should be fine I think. > Cannot pull image from docker registry which does not reply with > 'scope'/'service' in WWW-Authenticate header > - > > Key: MESOS-10092 > URL: https://issues.apache.org/jira/browse/MESOS-10092 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Critical > Fix For: 1.8.2, 1.9.1, 1.10.0 > > > This problem was encountered when trying to specify container image > nvcr.io/nvidia/tensorflow:19.12-tf1-py3 > When initiating Docker Registry authentication > (https://docs.docker.com/registry/spec/auth/token/) with nvcr.io, Mesos URI > fetcher receives 'WWW-Authenticate' header without 'service' and 'scope' > params, and fails here: > https://github.com/apache/mesos/blob/1e9b121273a6d9248a78ab44798bd4c1138c31ee/src/uri/fetchers/docker.cpp#L1083 > This is an example of an unsuccessful request made by Mesos: > {code} > curl -s -S -L -i --raw --http1.1 -H "Accept: > application/vnd.docker.distribution.manifest.v2+json,application/vnd.docker.distribution.manifest.v1+json,application/vnd.docker.distribution.manifest.v1+prettyjws" > -y 60 https://nvcr.io/v2/nvidia/tensorflow/manifests/19.08-py3 > HTTP/1.1 401 Unauthorized > Content-Type: text/html > Date: Wed, 22 Jan 2020 19:01:57 GMT > Server: nginx/1.14.2 > Www-Authenticate: Bearer > realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push; > Content-Length: 195 > Connection: keep-alive > > 401 Authorization Required > > 401 Authorization Required > nginx/1.14.2 > > > {code} > At the same time, docker is perfectly capable of pulling this image. > Note that the document "Token Authentication Specification" > (https://docs.docker.com/registry/spec/auth/token/), on which the Mesos > implementation is based, is vague on the issue of registries that do not > provide 'scope'/'service' in WWW-Authenticate header. > What Docker does differently (at the very least, in the case of nvcr.io): > It sends the initial request not to the maniferst/blob URI, but to the > repository root URI (http:://nvcr.io/v2 in this case): > {code} > GET /v2/ HTTP/1.1 > Host: nvcr.io > User-Agent: docker/18.03.1-ce go/go1.9.5 git-commit/9ee9f402cd > kernel/4.15.0-60-generic os/linux arch/amd64 > UpstreamClient(Docker-Client/18.09.7 \(linux\)) > {code} > To this, it receives response with a "realm" that contains no query arguments: > {code} > HTTP/1.1 401 Unauthorized > Connection: close > Content-Length: 195 > Content-Type: text/html > Date: Wed, 29 Jan 2020 12:22:43 GMT > Server: nginx/1.14.2 > Www-Authenticate: Bearer realm="https://nvcr.io/proxy_auth > {code} > Then, it composes the scope using the image ref and a hardcoded "pull" > action: > https://github.com/docker/distribution/blob/a8371794149d1d95f1e846744b05c87f2f825e5a/registry/client/auth/session.go#L174 > (in a full accordance with this spec: > https://docs.docker.com/registry/spec/auth/scope/) > and sends the following request to https://nvcr.io/proxy_auth : > {code} > GET /proxy_auth?scope=repository%3Anvidia%2Ftensorflow%3Apull HTTP/1.1 > Host: nvcr.io > User-Agent: Go-http-client/1.1 > {code} > (Note that 'push' is absent from the scope) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10092) Cannot pull image from docker registry which does not reply with 'scope'/'service' in WWW-Authenticate header
[ https://issues.apache.org/jira/browse/MESOS-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044728#comment-17044728 ] Vinod Kone commented on MESOS-10092: [~asekretenko] Should this be resolved? Also, is this being backported? > Cannot pull image from docker registry which does not reply with > 'scope'/'service' in WWW-Authenticate header > - > > Key: MESOS-10092 > URL: https://issues.apache.org/jira/browse/MESOS-10092 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Critical > > This problem was encountered when trying to specify container image > nvcr.io/nvidia/tensorflow:19.12-tf1-py3 > When initiating Docker Registry authentication > (https://docs.docker.com/registry/spec/auth/token/) with nvcr.io, Mesos URI > fetcher receives 'WWW-Authenticate' header without 'service' and 'scope' > params, and fails here: > https://github.com/apache/mesos/blob/1e9b121273a6d9248a78ab44798bd4c1138c31ee/src/uri/fetchers/docker.cpp#L1083 > This is an example of an unsuccessful request made by Mesos: > {code} > curl -s -S -L -i --raw --http1.1 -H "Accept: > application/vnd.docker.distribution.manifest.v2+json,application/vnd.docker.distribution.manifest.v1+json,application/vnd.docker.distribution.manifest.v1+prettyjws" > -y 60 https://nvcr.io/v2/nvidia/tensorflow/manifests/19.08-py3 > HTTP/1.1 401 Unauthorized > Content-Type: text/html > Date: Wed, 22 Jan 2020 19:01:57 GMT > Server: nginx/1.14.2 > Www-Authenticate: Bearer > realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push; > Content-Length: 195 > Connection: keep-alive > > 401 Authorization Required > > 401 Authorization Required > nginx/1.14.2 > > > {code} > At the same time, docker is perfectly capable of pulling this image. > Note that the document "Token Authentication Specification" > (https://docs.docker.com/registry/spec/auth/token/), on which the Mesos > implementation is based, is vague on the issue of registries that do not > provide 'scope'/'service' in WWW-Authenticate header. > What Docker does differently (at the very least, in the case of nvcr.io): > It sends the initial request not to the maniferst/blob URI, but to the > repository root URI (http:://nvcr.io/v2 in this case): > {code} > GET /v2/ HTTP/1.1 > Host: nvcr.io > User-Agent: docker/18.03.1-ce go/go1.9.5 git-commit/9ee9f402cd > kernel/4.15.0-60-generic os/linux arch/amd64 > UpstreamClient(Docker-Client/18.09.7 \(linux\)) > {code} > To this, it receives response with a "realm" that contains no query arguments: > {code} > HTTP/1.1 401 Unauthorized > Connection: close > Content-Length: 195 > Content-Type: text/html > Date: Wed, 29 Jan 2020 12:22:43 GMT > Server: nginx/1.14.2 > Www-Authenticate: Bearer realm="https://nvcr.io/proxy_auth > {code} > Then, it composes the scope using the image ref and a hardcoded "pull" > action: > https://github.com/docker/distribution/blob/a8371794149d1d95f1e846744b05c87f2f825e5a/registry/client/auth/session.go#L174 > (in a full accordance with this spec: > https://docs.docker.com/registry/spec/auth/scope/) > and sends the following request to https://nvcr.io/proxy_auth : > {code} > GET /proxy_auth?scope=repository%3Anvidia%2Ftensorflow%3Apull HTTP/1.1 > Host: nvcr.io > User-Agent: Go-http-client/1.1 > {code} > (Note that 'push' is absent from the scope) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover
[ https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038595#comment-17038595 ] Vinod Kone commented on MESOS-4659: --- I dont have the bandwidth right now, but happy to review the code if you work on a patch. Please see instructions here: https://mesos.readthedocs.io/en/latest/submitting-a-patch/ > Avoid leaving orphan task after framework failure + master failover > --- > > Key: MESOS-4659 > URL: https://issues.apache.org/jira/browse/MESOS-4659 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Major > Labels: failover, mesosphere > > If a framework becomes disconnected from the master, its tasks are killed > after waiting for {{failover_timeout}}. > However, if a master failover occurs but a framework never reconnects to the > new master, we never kill any of the tasks associated with that framework. > These tasks remain orphaned and presumably would need to be manually removed > by the operator. Similarly, if a framework gets torn down or disconnects > while it has running tasks on a partitioned agent, those tasks are not > shutdown when the agent reregisters. > We should consider whether to kill such orphaned tasks automatically, likely > after waiting for some (framework-configurable?) timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-6352) Expose information about unreachable agents via operator API
[ https://issues.apache.org/jira/browse/MESOS-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-6352: - Assignee: (was: Abhishek Dasgupta) > Expose information about unreachable agents via operator API > > > Key: MESOS-6352 > URL: https://issues.apache.org/jira/browse/MESOS-6352 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Neil Conway >Priority: Major > Labels: mesosphere > > Operators would probably find information about the set of unreachable agents > useful. Two main use cases I can see: (a) identifying which agents are > currently unreachable and when they were marked unreachable, (b) > understanding the size/content of the registry as a way to debug registry > perf issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9923) AgentAPITest.GetStateWithNonTerminalCompletedTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918740#comment-16918740 ] Vinod Kone commented on MESOS-9923: --- Observed this on ASF CI when testing 1.9.0-rc2 {code} 3: [ RUN ] ContentType/AgentAPITest.GetStateWithNonTerminalCompletedTask/0 3: I0828 21:20:00.647260 17669 cluster.cpp:177] Creating default 'local' authorizer 3: I0828 21:20:00.655491 17681 master.cpp:440] Master cff62302-83f2-4586-b6a6-ec603af07f35 (5ca4a76bb68c) started on 172.17.0.3:46115 3: I0828 21:20:00.655534 17681 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/49Bxak/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/49Bxak/master" --zk_session_timeout="10secs" 3: I0828 21:20:00.656090 17681 master.cpp:492] Master only allowing authenticated frameworks to register 3: I0828 21:20:00.656103 17681 master.cpp:498] Master only allowing authenticated agents to register 3: I0828 21:20:00.656111 17681 master.cpp:504] Master only allowing authenticated HTTP frameworks to register 3: I0828 21:20:00.656119 17681 credentials.hpp:37] Loading credentials for authentication from '/tmp/49Bxak/credentials' 3: I0828 21:20:00.656491 17681 master.cpp:548] Using default 'crammd5' authenticator 3: I0828 21:20:00.656787 17681 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 3: I0828 21:20:00.657025 17681 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 3: I0828 21:20:00.657196 17681 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 3: I0828 21:20:00.657344 17681 master.cpp:629] Authorization enabled 3: I0828 21:20:00.657789 17676 hierarchical.cpp:474] Initialized hierarchical allocator process 3: I0828 21:20:00.658103 17677 whitelist_watcher.cpp:77] No whitelist given 3: I0828 21:20:00.664515 17681 master.cpp:2170] Elected as the leading master! 3: I0828 21:20:00.664557 17681 master.cpp:1666] Recovering from registrar 3: I0828 21:20:00.665055 17681 registrar.cpp:339] Recovering registrar 3: I0828 21:20:00.666002 17676 registrar.cpp:383] Successfully fetched the registry (0B) in 896us 3: I0828 21:20:00.666203 17676 registrar.cpp:487] Applied 1 operations in 62949ns; attempting to update the registry 3: I0828 21:20:00.667132 17676 registrar.cpp:544] Successfully updated the registry in 852224ns 3: I0828 21:20:00.667313 17676 registrar.cpp:416] Successfully recovered registrar 3: I0828 21:20:00.667974 17676 master.cpp:1819] Recovered 0 agents from the registry (143B); allowing 10mins for agents to reregister 3: I0828 21:20:00.668090 17685 hierarchical.cpp:513] Skipping recovery of hierarchical allocator: nothing to recover 3: W0828 21:20:00.687932 17669 process.cpp:2877] Attempted to spawn already running process files@172.17.0.3:46115 3: I0828 21:20:00.689092 17669 cluster.cpp:518] Creating default 'local' authorizer 3: W0828 21:20:00.692358 17669 process.cpp:2877] Attempted to spawn already running process version@172.17.0.3:46115 3: I0828 21:20:00.692720 17684 slave.cpp:267] Mesos agent started on (901)@172.17.0.3:46115 3: I0828 21:20:00.692745 17684 slave.cpp:268] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/49Bxak/YlM9y5/store/appc" --authenticate_http_readonly="true" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authentication_timeout_max="1mins" --authentication_timeout_min="5secs"
[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918041#comment-16918041 ] Vinod Kone commented on MESOS-8983: --- Seen this again when testing 1.9.0-RC2. {code} 13:32:33 3: [ RUN ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery 13:32:33 3: I0828 18:32:33.580678 20801 cluster.cpp:177] Creating default 'local' authorizer 13:32:33 3: I0828 18:32:33.587858 20824 master.cpp:440] Master 3de64da7-619c-4652-9d33-3fe2ca2a3d5f (b766865f9da3) started on 172.17.0.2:42011 13:32:33 3: I0828 18:32:33.587904 20824 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="1secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/sIRhDp/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="2" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/sIRhDp/master" --zk_session_timeout="10secs" 13:32:33 3: I0828 18:32:33.588558 20824 master.cpp:492] Master only allowing authenticated frameworks to register 13:32:33 3: I0828 18:32:33.588574 20824 master.cpp:498] Master only allowing authenticated agents to register 13:32:33 3: I0828 18:32:33.588587 20824 master.cpp:504] Master only allowing authenticated HTTP frameworks to register 13:32:33 3: I0828 18:32:33.588599 20824 credentials.hpp:37] Loading credentials for authentication from '/tmp/sIRhDp/credentials' 13:32:33 3: I0828 18:32:33.588999 20824 master.cpp:548] Using default 'crammd5' authenticator 13:32:33 3: I0828 18:32:33.589262 20824 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 13:32:33 3: I0828 18:32:33.589529 20824 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 13:32:33 3: I0828 18:32:33.589697 20824 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 13:32:33 3: I0828 18:32:33.589866 20824 master.cpp:629] Authorization enabled 13:32:33 3: I0828 18:32:33.590817 20823 whitelist_watcher.cpp:77] No whitelist given 13:32:33 3: I0828 18:32:33.594827 20816 master.cpp:2170] Elected as the leading master! 13:32:33 3: I0828 18:32:33.594887 20816 master.cpp:1666] Recovering from registrar 13:32:33 3: I0828 18:32:33.595124 20808 hierarchical.cpp:474] Initialized hierarchical allocator process 13:32:33 3: I0828 18:32:33.595382 20808 registrar.cpp:339] Recovering registrar 13:32:33 3: I0828 18:32:33.596575 20808 registrar.cpp:383] Successfully fetched the registry (0B) in 1.14688ms 13:32:33 3: I0828 18:32:33.596779 20808 registrar.cpp:487] Applied 1 operations in 63194ns; attempting to update the registry 13:32:33 3: I0828 18:32:33.597638 20819 registrar.cpp:544] Successfully updated the registry in 788224ns 13:32:33 3: I0828 18:32:33.597805 20819 registrar.cpp:416] Successfully recovered registrar 13:32:33 3: I0828 18:32:33.598423 20819 master.cpp:1819] Recovered 0 agents from the registry (144B); allowing 10mins for agents to reregister 13:32:33 3: I0828 18:32:33.598599 20813 hierarchical.cpp:513] Skipping recovery of hierarchical allocator: nothing to recover 13:32:33 3: I0828 18:32:33.614511 20801 containerizer.cpp:318] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } 13:32:33 3: W0828 18:32:33.615756 20801 backend.cpp:76] Failed to create 'overlay' backend: OverlayBackend requires root privileges 13:32:33 3: W0828 18:32:33.615855 20801 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges 13:32:33 3: W0828 18:32:33.615934 20801 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges 13:32:33 3: I0828 18:32:33.616178 20801 provisioner.cpp:300] Using
[jira] [Created] (MESOS-9955) Automate publishing SNAPSHOT JAR
Vinod Kone created MESOS-9955: - Summary: Automate publishing SNAPSHOT JAR Key: MESOS-9955 URL: https://issues.apache.org/jira/browse/MESOS-9955 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Assignee: Vinod Kone Currently snapshot jars are manually published by a committer by running support/snapshot.sh. Instead, we should have Jenkins periodically build and publish the snapshot jar. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906473#comment-16906473 ] Vinod Kone commented on MESOS-9545: --- [~greggomann] Lets backport this to older releases. > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906276#comment-16906276 ] Vinod Kone commented on MESOS-9936: --- [~Fcomte] That's pretty weird and unexpected. Can you share gdb stack trace during one of these long recovery periods? > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.1 >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9921) Mesos UI should display TaskStatus Reason in Tasks table
Vinod Kone created MESOS-9921: - Summary: Mesos UI should display TaskStatus Reason in Tasks table Key: MESOS-9921 URL: https://issues.apache.org/jira/browse/MESOS-9921 Project: Mesos Issue Type: Improvement Components: webui Reporter: Vinod Kone Tasks table shows "State" but it would be useful for at-a-glance debugging to also show the "Reason" in either the same or different column. Especially important for completed tasks table. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-6566) The Docker executor should not leak task env variables in the Docker command cmd line.
[ https://issues.apache.org/jira/browse/MESOS-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883994#comment-16883994 ] Vinod Kone commented on MESOS-6566: --- See the description in MESOS-6951 for a potential solution using `--env` argument. > The Docker executor should not leak task env variables in the Docker command > cmd line. > -- > > Key: MESOS-6566 > URL: https://issues.apache.org/jira/browse/MESOS-6566 > Project: Mesos > Issue Type: Bug > Components: docker, security >Reporter: Gastón Kleiman >Assignee: Till Toenshoff >Priority: Major > > Task environment variables are sensitive, as they might contain secrets. > The Docker executor starts tasks by executing a {{docker run}} command, and > it includes the env variables in the cmd line of the docker command, exposing > them to all the users in the machine: > {code} > $ ./src/mesos-execute --command="sleep 200" --containerizer=docker > --docker_image=alpine --env='{"foo": "bar"}' --master=10.0.2.15:5050 > --name=test > $ ps aux | grep bar > [...] docker -H unix:///var/run/docker.sock run [...] -e foo=bar [...] alpine > -c sleep 200 > $ > {code} > The Docker executor could pass Docker the {{--env-file}} flag, pointing it to > a file with the environment variables. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-7473) Use "-dev" prerelease label for version during development
[ https://issues.apache.org/jira/browse/MESOS-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-7473: - Assignee: (was: Neil Conway) > Use "-dev" prerelease label for version during development > -- > > Key: MESOS-7473 > URL: https://issues.apache.org/jira/browse/MESOS-7473 > Project: Mesos > Issue Type: Task >Reporter: Neil Conway >Priority: Major > Labels: mesosphere > > Prior discussion: > https://lists.apache.org/thread.html/6e291c504fd44b79e452744b80073cb33adc1be85c17e22bbca35a6c@%3Cdev.mesos.apache.org%3E > https://lists.apache.org/thread.html/eb526c9295b3cf8e4efc7e0a7d2dacabb61ab5ed867a05e7d913d3fb@%3Cdev.mesos.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9868) NetworkInfo from the agent /state endpoint is not correct.
[ https://issues.apache.org/jira/browse/MESOS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9868: - Assignee: Qian Zhang > NetworkInfo from the agent /state endpoint is not correct. > -- > > Key: MESOS-9868 > URL: https://issues.apache.org/jira/browse/MESOS-9868 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Gilbert Song >Assignee: Qian Zhang >Priority: Blocker > Labels: containerization > > NetworkInfo from the agent /state endpoint is not correct, which is also > different from the networkInfo of /containers endpoint. Some frameworks rely > on the state endpoint to get the ip address for other containers to run. > agent's state endpoint > {noformat} > { > "state": "TASK_RUNNING", > "timestamp": 1561574343.1521769, > "container_status": { > "container_id": { > "value": "9a2633be-d2e5-4636-9ad4-7b2fc669da99", > "parent": { > "value": "45ebab16-9b4b-416e-a7f2-4833fd4ed8ff" > } > }, > "network_infos": [ > { > "ip_addresses": [ > { > "protocol": "IPv4", > "ip_address": "172.31.10.35" > } > ] > } > ] > }, > "healthy": true > } > {noformat} > agent's /containers endpoint > {noformat} > "status": { > "container_id": { > "value": "5ffc9df2-3be6-4879-8b2d-2fde3f0477e0" > }, > "executor_pid": 16063, > "network_infos": [ > { > "ip_addresses": [ > { > "ip_address": "9.0.35.71", > "protocol": "IPv4" > } > ], > "name": "dcos" > } > ] > } > {noformat} > The ip addresses are different^^. > The container is in RUNNING state and is running correctly. Just the state > endpoint is not correct. One thing to notice is that the state endpoint used > to show the correct IP. After there was an agent restart and master leader > re-election, the IP address in the state endpoint was changed. > Here is the checkpoint CNI network information > {noformat} > OK-23:37:48-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c > # cat > /var/run/mesos/isolators/network/cni/45ebab16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/network.conf > > {"args":{"org.apache.mesos":{"network_info":{"name":"dcos"}}},"chain":"M-DCOS","delegate":{"bridge":"m-dcos","hairpinMode":true,"ipMasq":false,"ipam":{"dataDir":"/var/run/dcos/cni/networks","routes":[{"dst":"0.0.0.0/0"}],"subnet":"9.0.73.0/25","type":"host-local"},"isGateway":true,"mtu":1420,"type":"bridge"},"excludeDevices":["m-dcos"],"name":"dcos","type":"mesos-cni-port-mapper"} > {noformat} > {noformat} > OK-01:30:05-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c > # cat > /var/run/mesos/isolators/network/cni/45eb16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/eth0/network.info > {"dns":{},"ip4":{"gateway":"9.0.73.1","ip":"9.0.73.65/25","routes":[{"dst":"0.0.0.0/0","gw":"9.0.73.1"}]}} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8500) Enhanced support for multi-role scalibility
[ https://issues.apache.org/jira/browse/MESOS-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-8500: - Assignee: Andrei Sekretenko (was: Kapil Arya) > Enhanced support for multi-role scalibility > --- > > Key: MESOS-8500 > URL: https://issues.apache.org/jira/browse/MESOS-8500 > Project: Mesos > Issue Type: Epic >Reporter: Kapil Arya >Assignee: Andrei Sekretenko >Priority: Major > Labels: mesosphere, resource-management > > CC: [~bmahler] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9784) Server side SSL Certificate Validation
Vinod Kone created MESOS-9784: - Summary: Server side SSL Certificate Validation Key: MESOS-9784 URL: https://issues.apache.org/jira/browse/MESOS-9784 Project: Mesos Issue Type: Epic Reporter: Vinod Kone Assignee: Benno Evers -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`
[ https://issues.apache.org/jira/browse/MESOS-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831691#comment-16831691 ] Vinod Kone commented on MESOS-9761: --- The columns there: "Guarantee" and "Limit" are currently reflecting Quota settings set via quota endpoints. While a reservation for a role technically guarantees that amount to that role (and which is where I assume your confusion stems from) that's currently not the intention of that column. There are plans to improve the quota page in the near future. cc [~mzhu] [~bmahler] > Mesos UI does not properly account for resources set via `--default-role` > - > > Key: MESOS-9761 > URL: https://issues.apache.org/jira/browse/MESOS-9761 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: resource-management, ui > Attachments: default_role_ui.png > > > In our cluster, we have two agents configured with > "--default_role=slave_public" and 64 cpus each, for a total of 128 cpus > allocated to this role. The right side of the screenshot shows one of them. > However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor > "Limit" does show any resources for this role. > See attached screenshot for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9739) When recovered agent marked gone, retain agent ID
[ https://issues.apache.org/jira/browse/MESOS-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826320#comment-16826320 ] Vinod Kone commented on MESOS-9739: --- The marked agent is already retained in the registry right? Right now if a gone agent attempts to reregister, master refuses it and shuts it down. Any reconciliation requests should've been answered with TASK_GONE_BY_OPERATOR already. So not sure if there is more to do? > When recovered agent marked gone, retain agent ID > - > > Key: MESOS-9739 > URL: https://issues.apache.org/jira/browse/MESOS-9739 > Project: Mesos > Issue Type: Improvement >Reporter: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > When a recovered agent is marked gone, we could retain its agent ID so that > if it attempts to reregister, we could send task status updates for its tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9123) Add metric role consumed quota.
[ https://issues.apache.org/jira/browse/MESOS-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9123: - Assignee: Meng Zhu (was: Till Toenshoff) > Add metric role consumed quota. > --- > > Key: MESOS-9123 > URL: https://issues.apache.org/jira/browse/MESOS-9123 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Meng Zhu >Priority: Critical > Labels: allocator, mesosphere, metrics, resource-management > > Currently, quota related metrics exposes quota guarantee and allocated quota. > We should expose "consumed" which is allocated quota plus unallocated > reservations. We already have this info in the allocator as > `consumedQuotaScalarQuantities`, just needs to expose it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.
[ https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812705#comment-16812705 ] Vinod Kone commented on MESOS-9687: --- Any plans to backport this? > Add the glog patch to pass microseconds via the LogSink interface. > -- > > Key: MESOS-9687 > URL: https://issues.apache.org/jira/browse/MESOS-9687 > Project: Mesos > Issue Type: Task >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Major > Fix For: 1.8.0 > > > Currently, custom LogSink implementations in the modules (for example, this > one: > [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] > ) > are logging `00` instead of microseconds in the timestamp - simply > because the LogSink interface in glog has no place for microseconds. > The proposed glog fix is here: [https://github.com/google/glog/pull/441] > Getting this into glog release might take a long time (they released 0.4.0 > recently, but the previous release 0.3.5 was two years ago), therefore it > makes sense to add this patch into Mesos build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812550#comment-16812550 ] Vinod Kone edited comment on MESOS-6285 at 4/8/19 5:27 PM: --- We already limit the number of completed tasks per executor (200, not configurable), completed executors per framework (150, configurable) and max frameworks (50, not configurable) in memory. I don't think there's much value in storing metadata information about more than these tasks/executors/frameworks on the disk? If yes, we need to figure out how to GC a task/executor/framework once it goes out of the in-memory circular buffers / bounded hashmaps holding these. was (Author: vinodkone): We already limit the number of completed tasks per executor (200, not configurable) and completed executors per framework (150, configurable) in memory. I don't think there's much value in storing metadata information about more than these tasks/executors on the disk? If yes, we need to figure out how to GC a task/executor once it goes out of the in-memory circular buffers / bounded hashmaps holding these. > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Critical > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812635#comment-16812635 ] Vinod Kone commented on MESOS-6285: --- Note that we currently read the executor state from disk for *all* completed executors in `state.cpp`. We can improve this to only read completed executor information until we reach the completed executors per framework limit. Same with completed tasks and completed frameworks. > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Critical > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812550#comment-16812550 ] Vinod Kone commented on MESOS-6285: --- We already limit the number of completed tasks per executor (200, not configurable) and completed executors per framework (150, configurable) in memory. I don't think there's much value in storing metadata information about more than these tasks/executors on the disk? If yes, we need to figure out how to GC a task/executor once it goes out of the in-memory circular buffers / bounded hashmaps holding these. > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Critical > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812533#comment-16812533 ] Vinod Kone commented on MESOS-6285: --- Raising the priority to Critical because we have seen this happen in a production cluster. > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Critical > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9693) Add master validation for SeccompInfo.
[ https://issues.apache.org/jira/browse/MESOS-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810961#comment-16810961 ] Vinod Kone commented on MESOS-9693: --- In addition to the points raised above, there is also an upgrade compatibility issue with implementing this. If a framework's task doesn't work when seccomp is enabled (e.g., a kubelet task that needs to run as unconfined so that it can launch k8s pods that are seccomp confined by docker seccomp profile), then the framework needs to be first upgraded to use seccomp unconfined option. Now if this framework was running already on non-seccomp enabled cluster, the upgraded framework needs to still keep running even with seccomp disabled. After framework upgrade, mesos agent can be upgraded to enable seccomp and this won't affect the framework. So Mesos cannot reject such a task but just ignore it. [~gilbert] [~abudnik] Should we close this as "Won't do"? > Add master validation for SeccompInfo. > -- > > Key: MESOS-9693 > URL: https://issues.apache.org/jira/browse/MESOS-9693 > Project: Mesos > Issue Type: Task >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Major > > 1. if seccomp is not enabled, we should return failure if any fw specify > seccompInfo and return appropriate status update. > 2. at most one field of profile_name and unconfined should be set. better to > validate in master -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6934) Support pulling Docker images with V2 Schema 2 image manifest
[ https://issues.apache.org/jira/browse/MESOS-6934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805022#comment-16805022 ] Vinod Kone commented on MESOS-6934: --- [~gilbert] Can you post the review chain here? > Support pulling Docker images with V2 Schema 2 image manifest > - > > Key: MESOS-6934 > URL: https://issues.apache.org/jira/browse/MESOS-6934 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: https://reviews.apache.org/r/70288/ > https://reviews.apache.org/r/70289/ > https://reviews.apache.org/r/70290/ > https://reviews.apache.org/r/70291/ >Reporter: Ilya Pronin >Assignee: Gilbert Song >Priority: Major > Labels: containerization > > MESOS-3505 added support for pulling Docker images by their digest to the > Mesos Containerizer provisioner. However currently it only works with images > that were pushed with Docker 1.9 and older or with Registry 2.2.1 and older. > Newer versions use Schema 2 manifests by default. Because of CAS constraints > the registry does not convert those manifests on-the-fly to Schema 1 when > they are being pulled by digest. > Compatibility details are documented here: > https://docs.docker.com/registry/compatibility/ > Image Manifest V2, Schema 2 is documented here: > https://docs.docker.com/registry/spec/manifest-v2-2/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-2842) Master crashes when framework changes principal on re-registration
[ https://issues.apache.org/jira/browse/MESOS-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804274#comment-16804274 ] Vinod Kone commented on MESOS-2842: --- Yes, this ticket should just focus on disallowing or ignoring principal changes on re-registration. > Master crashes when framework changes principal on re-registration > -- > > Key: MESOS-2842 > URL: https://issues.apache.org/jira/browse/MESOS-2842 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Assignee: Andrei Sekretenko >Priority: Critical > Labels: foundations, security > > The master should be updated to avoid crashing when a framework re-registers > with a different principal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.
[ https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801146#comment-16801146 ] Vinod Kone commented on MESOS-9672: --- I guess we would still need this incase the pid re-use happens even without an agent reboot (highly unlikely but technically possible). > Docker containerizer should ignore pids of executors that do not pass the > connection check. > --- > > Key: MESOS-9672 > URL: https://issues.apache.org/jira/browse/MESOS-9672 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Meng Zhu >Priority: Major > Labels: containerization > > When recovering executors with a tracked pid we first try to establish a > connection to its libprocess address to avoid reaping an irrelevant process: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054 > If the connection fails to establish, we should not track its pid: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071 > One trouble this might cause is that if the pid is being used by another > executor, this could lead to duplicate pid error and lead the agent into a > crash loop: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.
[ https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799331#comment-16799331 ] Vinod Kone commented on MESOS-9672: --- Not sure if this is still needed after https://issues.apache.org/jira/browse/MESOS-9501 > Docker containerizer should ignore pids of executors that do not pass the > connection check. > --- > > Key: MESOS-9672 > URL: https://issues.apache.org/jira/browse/MESOS-9672 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Meng Zhu >Priority: Major > Labels: containerization > > When recovering executors with a tracked pid we first try to establish a > connection to its libprocess address to avoid reaping an irrelevant process: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054 > If the connection fails to establish, we should not track its pid: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071 > One trouble this might cause is that if the pid is being used by another > executor, this could lead to duplicate pid error and lead the agent into a > crash loop: > https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-4719) Add allocator metric for number of offers each role / framework received.
[ https://issues.apache.org/jira/browse/MESOS-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796509#comment-16796509 ] Vinod Kone commented on MESOS-4719: --- [~greggomann] Don't we have this now? cc [~gkleiman] > Add allocator metric for number of offers each role / framework received. > - > > Key: MESOS-4719 > URL: https://issues.apache.org/jira/browse/MESOS-4719 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Bannier >Priority: Major > Labels: mesosphere > > A counter for the number of allocations to a framework can be used to monitor > allocation progress, e.g., when agents are added to a cluster, and as other > frameworks are added or removed. > Currently, an offer by the hierarchical allocator to a framework consists of > a list of resources on possibly many agents. Resources might be offered in > order to satisfy outstanding quota or for fairness. To capture allocations on > fine granularity we should not count the number of offers, but instead the > pieces making up that offer, as such a metric would better resolve the effect > of changes (e.g., adding/removing a framework). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9500) spark submit with docker image on mesos cluster fails.
[ https://issues.apache.org/jira/browse/MESOS-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796507#comment-16796507 ] Vinod Kone commented on MESOS-9500: --- [~atheethkaup] Can you paste all the relevant agent log lines related to this task? It's hard to tell from just the line you posted above. We need log lines from the task launch all the way to task failing. > spark submit with docker image on mesos cluster fails. > -- > > Key: MESOS-9500 > URL: https://issues.apache.org/jira/browse/MESOS-9500 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 1.7.0 >Reporter: atheeth kaup >Priority: Critical > > We have 3 node cluster with mesos(V 1.7), spark 2.4 and docker 18.06.1 > installed. one is master and other two are agents. while doing spark submit > job fails. UI shows only one task(Driver) launching on one of the slaves in > the failed state. > > command: > spark-submit \ > --master mesos://.32:7077 \ > --deploy-mode cluster \ > --class com.learning.spark.WordCount \ > --conf > spark.mesos.executor.docker.image=mesosphere/spark:2.4.0-2.2.1-3-hadoop-2.7 \ > --conf spark.master.rest.enabled=true \ > /home/mapr/mesos/wordcount.jar hdfs://***.36:8020/user/mapr/sparkL/input.txt > hdfs://***.36:8020/user/output > > Error in one of the Logs: > > Running on machine: **-i0058 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid > [file:line|file://line/]] msg > W1221 16:51:23.857431 17978 state.cpp:478] Failed to find executor forked pid > file > '/home/**/mesos/mesos-1.7.0/build/workDir/meta/slaves/822a5d52-b8ba-459f-ade2-7f3a2ebd240f-S0/frameworks/77c39bdf-09e3-4cb9-9026-21e900d08318-0007/executors/driver-20181221112019-0006/runs/7c1399ca-4e0a-4bd9-b02e-9c5ca3854c77/pids/forked.pid' > > Below is the only property that we have set on all the nodes and have started > the dispatcher: > *export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/libmesos.so* > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9651) Design for docker registry v2 schema2 basic support.
[ https://issues.apache.org/jira/browse/MESOS-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9651: - Assignee: Qian Zhang > Design for docker registry v2 schema2 basic support. > > > Key: MESOS-9651 > URL: https://issues.apache.org/jira/browse/MESOS-9651 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Gilbert Song >Assignee: Qian Zhang >Priority: Major > Labels: containerization > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9448) Semantics of RECONCILE_OPERATIONS framework API call are incorrect
[ https://issues.apache.org/jira/browse/MESOS-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792789#comment-16792789 ] Vinod Kone commented on MESOS-9448: --- [~greggomann] Can we close this as a dup of MESOS-9318? > Semantics of RECONCILE_OPERATIONS framework API call are incorrect > -- > > Key: MESOS-9448 > URL: https://issues.apache.org/jira/browse/MESOS-9448 > Project: Mesos > Issue Type: Bug > Components: framework, HTTP API, master >Reporter: Benjamin Bannier >Priority: Major > > The typical pattern in the framework HTTP API is that frameworks send calls > to which the master responds with {{Accepted}} responses and which trigger > events. The only designed exception to this are {{SUBSCRIBE}} calls to which > the master responds with an {{Ok}} response containing the assigned framework > ID. This is even codified in {{src/scheduler.cpp:646ff}}, > {code} > if (response->code == process::http::Status::OK) { > // Only SUBSCRIBE call should get a "200 OK" response. > CHECK_EQ(Call::SUBSCRIBE, call.type()); > {code} > Currently, the handling of {{RECONCILE_OPERATIONS}} calls does not follow > this pattern. Instead of sending events, the master immediately responds with > a {{Ok}} and a list of operations. This e.g., leads to assertion failures in > above hard check whenever one uses the {{Scheduler::send}} instead of > {{Scheduler::call}}. One can reproduce this by modifying the existing tests > in {{src/operation_reconciliation_tests.cpp}}, > {code} > mesos.send({createCallReconcileOperations(frameworkId, {operation})}); // ADD > THIS. > const Future result = > mesos.call({createCallReconcileOperations(frameworkId, {operation})}); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path
[ https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792153#comment-16792153 ] Vinod Kone commented on MESOS-8257: --- [~jasonlai] [~jieyu] Is there more to be done here? > Unified Containerizer "leaks" a target container mount path to the host FS > when the target resolves to an absolute path > --- > > Key: MESOS-8257 > URL: https://issues.apache.org/jira/browse/MESOS-8257 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.3.1, 1.4.1, 1.5.0 >Reporter: Jason Lai >Assignee: Jason Lai >Priority: Critical > Labels: bug, containerizer, mountpath > > If a target path under the root FS provisioned from an image resolves to an > absolute path, it will not appear in the container root FS after > {{pivot_root(2)}} is called. > A typical example is that when the target path is under {{/var/run}} (e.g. > {{/var/run/some-dir}}), which is usually a symlink to an absolute path of > {{/run}} in Debian images, the target path will get resolved as and created > at {{/run/some-dir}} in the host root FS, after the container root FS gets > provisioned. The target path will get unmounted after {{pivot_root(2)}} as it > is part of the old root (host FS). > A workaround is to use {{/run}} instead of {{/var/run}}, but absolute > symlinks need to be resolved within the scope of the container root FS path. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-4599) ReviewBot should re-verify a review chain if any of the reviews is updated
[ https://issues.apache.org/jira/browse/MESOS-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-4599: - Assignee: Vinod Kone > ReviewBot should re-verify a review chain if any of the reviews is updated > -- > > Key: MESOS-4599 > URL: https://issues.apache.org/jira/browse/MESOS-4599 > Project: Mesos > Issue Type: Improvement > Components: reviewbot >Reporter: Vinod Kone >Assignee: Vinod Kone >Priority: Major > Labels: integration, newbie++ > > Currently reviewbot only re-verifies a review chain if the last review in the > chain is updated (new diff or new depends on field). It should also re-verify > if one of the dependent reviews in the chain is updated! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9592) Mesos Websitebot is flaky
Vinod Kone created MESOS-9592: - Summary: Mesos Websitebot is flaky Key: MESOS-9592 URL: https://issues.apache.org/jira/browse/MESOS-9592 Project: Mesos Issue Type: Bug Components: project website Reporter: Vinod Kone Mesos Websitebot Jenkins job is sometimes failing during the endpoint documentation generation face. It looks like it is timing out on getting a response from the /health endpoint of the master. Example failing build: https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/1899/ {code} 01:20:30 make[2]: Leaving directory '/mesos/build/src' 01:20:30 make[1]: Leaving directory '/mesos/build/src' 01:20:30 /mesos 01:20:41 Timeout attempting to hit url: http://127.0.0.1:5050/health {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773095#comment-16773095 ] Vinod Kone commented on MESOS-8930: --- Saw this when testing 1.7.2 rc. {code} 2: [ RUN ] MetricsTest.THREADSAFE_SnapshotTimeout 2: I0219 23:34:37.010373 23554 process.cpp:3588] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' 2: I0219 23:34:37.062614 23555 process.cpp:3588] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: Failure {code} > THREADSAFE_SnapshotTimeout is flaky. > > > Key: MESOS-8930 > URL: https://issues.apache.org/jira/browse/MESOS-8930 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.7.2 > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky-test, foundations, mesosphere > > Observed on ASF CI, might be related to a recent test change > https://reviews.apache.org/r/66831/ > {noformat} > 18:23:31 2: [ RUN ] MetricsTest.THREADSAFE_SnapshotTimeout > 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: > Failure > 18:23:46 2: Failed to wait 15secs for response > 22:57:13 Build timed out (after 300 minutes). Marking the build as failed. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769813#comment-16769813 ] Vinod Kone commented on MESOS-8887: --- Landed on master: commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 Backported to 1.7.x commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 > Unreachable tasks are not GC'ed when unreachable agent is GC'ed. > > > Key: MESOS-8887 > URL: https://issues.apache.org/jira/browse/MESOS-8887 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.4.3, 1.5.2, 1.6.1, 1.7.1 >Reporter: Gilbert Song >Assignee: Vinod Kone >Priority: Major > Labels: foundations, mesosphere, partition, registry > > Unreachable agents will be gc-ed by the master registry after > `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the > GC happens, the agent will be removed from the master's unreachable agent > list, but its corresponding tasks are still in UNREACHABLE state in the > framework struct (though removed from `slaves.unreachableTasks`). We should > instead remove those tasks from everywhere or transition those tasks to a > terminal state, either TASK_LOST or TASK_GONE (further discussion is needed > to define the semantic). > This improvement relates to how do we want to couple the update of task with > the GC of agent. Right now they are somewhat decoupled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769822#comment-16769822 ] Vinod Kone commented on MESOS-8892: --- Observed this on 1.6.x branch {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileDroppedOperation I0215 21:36:18.921594 4052 cluster.cpp:172] Creating default 'local' authorizer I0215 21:36:18.922894 4057 master.cpp:465] Master 21d3c979-83c3-4141-9a3a-635fd550d45a (ip-172-16-10-236.ec2.internal) started on 172.16.10.236:36326 I0215 21:36:18.922915 4057 master.cpp:468] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator ="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwri te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/exYTvt/credentials" --filter_gpu_resources="true" --framework_s orter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize= "true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per _framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memo ry" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --reg istry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/exY Tvt/master" --zk_session_timeout="10secs" I0215 21:36:18.923121 4057 master.cpp:517] Master only allowing authenticated frameworks to register I0215 21:36:18.923393 4057 master.cpp:523] Master only allowing authenticated agents to register I0215 21:36:18.923408 4057 master.cpp:529] Master only allowing authenticated HTTP frameworks to register I0215 21:36:18.923414 4057 credentials.hpp:37] Loading credentials for authentication from '/tmp/exYTvt/credentials' I0215 21:36:18.923651 4057 master.cpp:573] Using default 'crammd5' authenticator I0215 21:36:18.923777 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0215 21:36:18.923904 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0215 21:36:18.924266 4057 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0215 21:36:18.924465 4057 master.cpp:654] Authorization enabled I0215 21:36:18.924823 4056 hierarchical.cpp:179] Initialized hierarchical allocator process I0215 21:36:18.927826 4058 whitelist_watcher.cpp:77] No whitelist given I0215 21:36:18.928741 4054 master.cpp:2176] Elected as the leading master! I0215 21:36:18.928759 4054 master.cpp:1711] Recovering from registrar I0215 21:36:18.928800 4054 registrar.cpp:339] Recovering registrar I0215 21:36:18.929002 4054 registrar.cpp:383] Successfully fetched the registry (0B) in 132096ns I0215 21:36:18.929033 4054 registrar.cpp:487] Applied 1 operations in 7184ns; attempting to update the registry I0215 21:36:18.929154 4058 registrar.cpp:544] Successfully updated the registry in 108032ns I0215 21:36:18.929232 4058 registrar.cpp:416] Successfully recovered registrar I0215 21:36:18.929361 4055 master.cpp:1825] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0215 21:36:18.929415 4055 hierarchical.cpp:217] Skipping recovery of hierarchical allocator: nothing to recover W0215 21:36:18.931118 4052 process.cpp:2829] Attempted to spawn already running process files@172.16.10.236:36326 I0215 21:36:18.931596 4052 containerizer.cpp:300] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0215 21:36:18.934453 4052 linux_launcher.cpp:147] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher I0215 21:36:18.934859 4052 provisioner.cpp:299] Using default backend 'aufs' I0215 21:36:18.935410 4052 cluster.cpp:460] Creating default 'local' authorizer I0215 21:36:18.936164 4060 slave.cpp:259] Mesos agent started on (230)@172.16.10.236:36326 W0215 21:36:18.936399 4052 process.cpp:2829] Attempted to spawn already running process version@172.16.10.236:36326 I0215 21:36:18.936187 4060 slave.cpp:260] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/exYTvt/GHfic5/store/appc" --authenticate _http_executors="true"
[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
[ https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769824#comment-16769824 ] Vinod Kone commented on MESOS-8892: --- [~bbannier] Can we backport this test fix to 1.6.x branch? > MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky > > > Key: MESOS-8892 > URL: https://issues.apache.org/jira/browse/MESOS-8892 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.6.0 >Reporter: Greg Mann >Assignee: Benjamin Bannier >Priority: Major > Labels: mesosphere > Fix For: 1.7.0 > > Attachments: > MasterSlaveReconciliationTest.ReconcileDroppedOperation.txt > > > This was observed on a Debian 9 SSL/GRPC-enabled build. It appears that a > poorly-timed {{UpdateSlaveMessage}} leads to the operation reconciliation > occurring before the expectation for the {{ReconcileOperationsMessage}} is > registered: > {code} > I0508 00:11:09.700815 22498 master.cpp:4362] Processing ACCEPT call for > offers: [ f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-O0 ] on agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) for framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) > at scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 > I0508 00:11:09.700870 22498 master.cpp:3602] Authorizing principal > 'test-principal' to reserve resources 'cpus(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):2; > mem(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):1024; disk(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > ports(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):[31000-32000]' > I0508 00:11:09.701228 22493 master.cpp:4725] Applying RESERVE operation for > resources > [{"allocation_info":{"role":"default-role"},"name":"cpus","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"mem","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"disk","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"type":"RANGES"}] > from framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) at > scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 to agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) > I0508 00:11:09.701498 22493 master.cpp:11265] Sending operation '' (uuid: > 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) to agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) > I0508 00:11:09.701627 22494 slave.cpp:1564] Forwarding agent update > {"operations":{},"resource_version_uuid":{"value":"0HeA06ftS6m76SNoNZNPag=="},"slave_id":{"value":"f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0"},"update_oversubscribed_resources":true} > I0508 00:11:09.701848 22494 master.cpp:7800] Received update of agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 > (localhost) with total oversubscribed resources {} > W0508 00:11:09.701905 22494 master.cpp:7974] Performing explicit > reconciliation with agent for known operation > 81dffb62-6e75-4c6c-a97b-41c92c58d6a7 since it was not present in original > reconciliation message from agent > I0508 00:11:09.702085 22494 master.cpp:11015] Updating the state of operation > '' (uuid: 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) for framework > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (latest state: OPERATION_PENDING, > status update state: OPERATION_DROPPED) > I0508 00:11:09.702239 22491 hierarchical.cpp:925] Updated allocation of > framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- on agent > f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 from cpus(allocated: default-role):2; > mem(allocated: default-role):1024; disk(allocated: default-role):1024; > ports(allocated: default-role):[31000-32000] to disk(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > cpus(allocated: default-role)(reservations: > [(DYNAMIC,default-role,test-principal)]):2; mem(allocated: > default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; > ports(allocated: default-role)(reservations: >
[jira] [Comment Edited] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769813#comment-16769813 ] Vinod Kone edited comment on MESOS-8887 at 2/15/19 10:38 PM: - Landed on master: --- commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 --- Backported to 1.7.x --- commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 was (Author: vinodkone): Landed on master: commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed from the registry. This patch fixes it so that the latter is also cleaned up. Review: https://reviews.apache.org/r/69908 commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373 Author: Vinod Kone Date: Sat Feb 2 09:51:09 2019 -0600 Fixed variable names in `Master::_doRegistryGC()`. Substituted `slave` with `slaveId` to be consistent with the code base. No functional changes. Review: https://reviews.apache.org/r/69907 Backported to 1.7.x commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9 Author: Vinod Kone Date: Fri Feb 15 14:33:00 2019 -0600 Added MESOS-8887 to the 1.7.2 CHANGELOG. commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba Author: Vinod Kone Date: Tue Feb 5 16:55:19 2019 -0600 Tested unreachable task behavior on agent GC. Updated `PartitionTest, RegistryGcByCount` test. This test fails without the previous patch. Review: https://reviews.apache.org/r/69909 commit c72a4f909054e5efa75d9e5d8dde71b0083402c1 Author: Vinod Kone Date: Sat Feb 2 10:01:56 2019 -0600 Removed unreachable tasks from `Master::Framework` on agent GC. Unreachable tasks are stored in `Slaves` and `Framework` structs of the master, but they were only being removed from the former when an unreachable agent is GCed
[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)
[ https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769727#comment-16769727 ] Vinod Kone commented on MESOS-8750: --- [~megha.sharma] [~xujyan] Why was this not backported to older versions? > Check failed: !slaves.registered.contains(task->slave_id) > - > > Key: MESOS-8750 > URL: https://issues.apache.org/jira/browse/MESOS-8750 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.6.0 >Reporter: Megha Sharma >Assignee: Megha Sharma >Priority: Critical > Fix For: 1.6.0 > > > It appears that in certain circumstances an unreachable task doesn't get > cleaned up from the framework.unreachableTasks when the respective agent > re-registers leading to this check failure later when the framework is being > removed. When an agent goes unreachable master adds the tasks from this agent > to {{framework.unreachableTasks}} and when such an agent re-registers the > master removes the tasks that it specifies during re-registeration from this > datastructure but there could be tasks that the agent doesn't know about e.g. > if the runTask message for them got dropped and so such tasks will not get > removed from unreachableTasks. > {noformat} > F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: > !slaves.registered.contains(task->slave_id()) Unreachable task of > framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered > agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9576) Provide a configuration option to disallow logrotate stdout/stderr options in task env
Vinod Kone created MESOS-9576: - Summary: Provide a configuration option to disallow logrotate stdout/stderr options in task env Key: MESOS-9576 URL: https://issues.apache.org/jira/browse/MESOS-9576 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Joseph Wu See MESOS-9564 for context. The configuration option could be module flag for the logrotate module. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9143: - Assignee: Meng Zhu Story Points: 3 > MasterQuotaTest.RemoveSingleQuota is flaky. > --- > > Key: MESOS-9143 > URL: https://issues.apache.org/jira/browse/MESOS-9143 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Alexander Rukletsov >Assignee: Meng Zhu >Priority: Major > Labels: flaky, flaky-test, mesosphere, resource-management > Attachments: RemoveSingleQuota-badrun.txt > > > {noformat} > ../../src/tests/master_quota_tests.cpp:493 > Value of: metrics.at(metricKey).isNone() > Actual: false > Expected: true > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8950) Framework operations can make resources unallocatable
[ https://issues.apache.org/jira/browse/MESOS-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767498#comment-16767498 ] Vinod Kone edited comment on MESOS-8950 at 2/13/19 7:30 PM: Resolving as Won't Fix for now. was (Author: vinodkone): Resolving is Won't Fix for now. > Framework operations can make resources unallocatable > - > > Key: MESOS-8950 > URL: https://issues.apache.org/jira/browse/MESOS-8950 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Bannier >Priority: Minor > > The allocator does not offer {{cpus}} or {{mem}} resources smaller than > certain, fixed sizes. For framework operations, we do not enforce the same > minimum size constraints which can lead the resources becoming unavailable > for any future allocations. This behavior seems most pronounced when a > framework can register in many roles. > Example: > * A single multirole framework which can register in any role, e.g., in a > certain role subhierarchy. > * Single agent with {{cpus:1.5*MIN_CPUS}} and {{mem:1.5*MIN_MEM}}. > * Framework is offered all resources and performs a {{RESERVE}} on > {{cpus:0.5*MIN_CPUS}}. It then changes its role. > * Same framework behavior in next two offer cycles. All {{cpus}} are then > reserved for different roles in unallocatable amounts. > * Last offer will be just for {{mem:1.5*MIN_MEM}}, framework reserves 0.6 of > these to another role. This fragements the {{mem}} resources as well. > * No allocatable resources left in cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8887) Improve the master registry GC on task state transitioning.
[ https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-8887: - Assignee: Vinod Kone > Improve the master registry GC on task state transitioning. > --- > > Key: MESOS-8887 > URL: https://issues.apache.org/jira/browse/MESOS-8887 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Gilbert Song >Assignee: Vinod Kone >Priority: Major > Labels: mesosphere, partition, registry > > Unreachable agents will be gc-ed by the master registry after > `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the > GC happens, the agent will be removed from the master's unreachable agent > list, but its corresponding tasks are still in UNREACHABLE state in the > framework struct (though removed from `slaves.unreachableTasks`). We should > instead remove those tasks from everywhere or transition those tasks to a > terminal state, either TASK_LOST or TASK_GONE (further discussion is needed > to define the semantic). > This improvement relates to how do we want to couple the update of task with > the GC of agent. Right now they are somewhat decoupled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.
[ https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761903#comment-16761903 ] Vinod Kone commented on MESOS-8096: --- Observed this with LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/2 {code} ... ... I0206 05:23:37.884572 19578 task_status_update_manager.cpp:383] Forwarding task status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer o f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent I0206 05:23:37.884624 19578 slave.cpp:5808] Forwarding the update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b0-4d 40-b63a-f4d3efc720de- to master@172.16.10.36:45979 I0206 05:23:37.884678 19578 slave.cpp:5701] Task status update manager successfully handled status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for tas k producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.884764 19578 master.cpp:8516] Status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b0-4d40-b63a -f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal) I0206 05:23:37.884784 19578 master.cpp:8573] Forwarding status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of framework ffd3400c-13b 0-4d40-b63a-f4d3efc720de- I0206 05:23:37.884881 19578 master.cpp:11210] Updating the state of task producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- (latest state: TASK_FINISHED, status updat e state: TASK_FINISHED) I0206 05:23:37.885048 19577 hierarchical.cpp:1230] Recovered cpus(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.1; mem(allocated: default-role) (reservations: [(DYNAMIC,default-role,test-principal)]):32; disk(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):32 (total: cpus:1.7; mem:928; disk :928; ports:[31000-32000]; cpus(reservations: [(DYNAMIC,default-role,test-principal)]):0.3; mem(reservations: [(DYNAMIC,default-role,test-principal)]):96; disk(reservations: [(DYN AMIC,default-role,test-principal)]):95; disk(reservations: [(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1, allocated: disk(allocated: default-role)(rese rvations: [(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1; disk(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):63; mem(a llocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):64; cpus(allocated: default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.2) on age nt ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 from framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885195 19572 scheduler.cpp:845] Enqueuing event UPDATE received from http://172.16.10.36:45979/master/api/v1/scheduler I0206 05:23:37.885380 19571 scheduler.cpp:248] Sending ACKNOWLEDGE call to http://172.16.10.36:45979/master/api/v1/scheduler I0206 05:23:37.885645 19572 task_status_update_manager.cpp:328] Received task status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885682 19572 task_status_update_manager.cpp:383] Forwarding task status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer o f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent I0206 05:23:37.885735 19572 slave.cpp:5808] Forwarding the update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d 40-b63a-f4d3efc720de- to master@172.16.10.36:45979 I0206 05:23:37.885792 19572 slave.cpp:5701] Task status update manager successfully handled status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for tas k consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885802 19578 process.cpp:3588] Handling HTTP event for process 'master' with path: '/master/api/v1/scheduler' I0206 05:23:37.885885 19578 master.cpp:8516] Status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b0-4d40-b63a -f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal) I0206 05:23:37.885905 19578 master.cpp:8573] Forwarding status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of framework ffd3400c-13b 0-4d40-b63a-f4d3efc720de- I0206 05:23:37.885991 19578 master.cpp:11210] Updating the state of task
[jira] [Commented] (MESOS-8796) Some GroupTest.* are flaky on Mac.
[ https://issues.apache.org/jira/browse/MESOS-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761769#comment-16761769 ] Vinod Kone commented on MESOS-8796: --- Saw this again on internal CI (on Mac). {code} [ RUN ] GroupTest.GroupPathWithRestrictivePerms I0205 21:14:33.530055 296834496 zookeeper_test_server.cpp:156] Started ZooKeeperTestServer on port 50946 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui ld 2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 watcher=0x1145565d0 sessionId=0 s essionPasswd= context=0x7fb3e0c9bc90 flags=0 2019-02-05 21:14:33,530:8369(0x73fcf000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50946] 2019-02-05 21:14:33,532:8369(0x73fcf000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:50946], sessionId=0x168c13aa8b9, negotiated timeou t=1 2019-02-05 21:14:36,875:8369(0x73fcf000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@774: Client environment:user.name=jenkins 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@782: Client environment:user.home=/Users/jenkins 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@794: Client environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui ld 2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 watcher=0x1145565d0 sessionId=0 s essionPasswd= context=0x7fb3e0a4db10 flags=0 2019-02-05 21:14:36,879:8369(0x74767000):ZOO_INFO@check_events@1764: initiated connection to server [127.0.0.1:50946] 2019-02-05 21:14:36,880:8369(0x74767000):ZOO_INFO@check_events@1811: session establishment complete on server [127.0.0.1:50946], sessionId=0x168c13aa8b90001, negotiated timeou t=1 I0205 21:14:36.880167 55189504 group.cpp:341] Group process (zookeeper-group(48)@10.0.49.4:65013) connected to ZooKeeper I0205 21:14:36.880213 55189504 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0) I0205 21:14:36.880225 55189504 group.cpp:395] Authenticating with ZooKeeper using digest 2019-02-05 21:14:40,222:8369(0x74767000):ZOO_INFO@auth_completion_func@1327: Authentication scheme digest succeeded I0205 21:14:40.24 55189504 group.cpp:419] Trying to create path '/read-only' in ZooKeeper 2019-02-05 21:14:40,223:8369(0x736ae000):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.8 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@757: Client environment:host.name=Jenkinss-Mac-mini.local 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@764: Client environment:os.name=Darwin 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@765: Client environment:os.arch=18.2.0 2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@766: Client environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 2018; root:xnu-4903.231.4~2/ RELEASE_X86_64 2019-02-05
[jira] [Commented] (MESOS-8266) MasterMaintenanceTest.AcceptInvalidInverseOffer is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761770#comment-16761770 ] Vinod Kone commented on MESOS-8266: --- Observed this on internal CI. {code} [ RUN ] MasterMaintenanceTest.AcceptInvalidInverseOffer I0206 05:13:46.592031 27319 cluster.cpp:174] Creating default 'local' authorizer I0206 05:13:46.593217 27341 master.cpp:414] Master 9ee5ab9a-1898-4ba6-a7f3-0093d03b19f8 (ip-172-16-10-145.ec2.internal) started on 172.16.10.145:36957 I0206 05:13:46.593240 27341 master.cpp:417] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator ="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwri te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/cBTYhp/credentials" --filter_gpu_resources="true" --framework_s orter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize= "true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream _subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_me trics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age ="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="t rue" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/cBTYhp/master" --zk_session_timeout="10secs" I0206 05:13:46.593377 27341 master.cpp:466] Master only allowing authenticated frameworks to register I0206 05:13:46.593385 27341 master.cpp:472] Master only allowing authenticated agents to register I0206 05:13:46.593391 27341 master.cpp:478] Master only allowing authenticated HTTP frameworks to register I0206 05:13:46.593397 27341 credentials.hpp:37] Loading credentials for authentication from '/tmp/cBTYhp/credentials' I0206 05:13:46.593485 27341 master.cpp:522] Using default 'crammd5' authenticator I0206 05:13:46.593521 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0206 05:13:46.593560 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0206 05:13:46.593582 27341 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0206 05:13:46.593605 27341 master.cpp:603] Authorization enabled I0206 05:13:46.594100 27340 hierarchical.cpp:176] Initialized hierarchical allocator process I0206 05:13:46.594298 27341 whitelist_watcher.cpp:77] No whitelist given I0206 05:13:46.594842 27344 master.cpp:2103] Elected as the leading master! I0206 05:13:46.594856 27344 master.cpp:1638] Recovering from registrar I0206 05:13:46.594935 27344 registrar.cpp:339] Recovering registrar I0206 05:13:46.595073 27344 registrar.cpp:383] Successfully fetched the registry (0B) in 115968ns I0206 05:13:46.595101 27344 registrar.cpp:487] Applied 1 operations in 6424ns; attempting to update the registry I0206 05:13:46.595223 27344 registrar.cpp:544] Successfully updated the registry in 105984ns I0206 05:13:46.595314 27344 registrar.cpp:416] Successfully recovered registrar I0206 05:13:46.595392 27344 master.cpp:1752] Recovered 0 agents from the registry (176B); allowing 10mins for agents to reregister I0206 05:13:46.595446 27344 hierarchical.cpp:216] Skipping recovery of hierarchical allocator: nothing to recover W0206 05:13:46.595887 27319 process.cpp:2829] Attempted to spawn already running process version@172.16.10.145:36957 I0206 05:13:46.597141 27319 sched.cpp:232] Version: 1.8.0 I0206 05:13:46.597421 27345 sched.cpp:336] New master detected at master@172.16.10.145:36957 I0206 05:13:46.597458 27345 sched.cpp:401] Authenticating with master master@172.16.10.145:36957 I0206 05:13:46.597509 27345 sched.cpp:408] Using default CRAM-MD5 authenticatee I0206 05:13:46.597611 27345 authenticatee.cpp:121] Creating new client SASL connection I0206 05:13:46.597707 27345 master.cpp:9902] Authenticating scheduler-6e5ae29d-e284-4d9b-bbc2-2df8747428fd@172.16.10.145:36957 I0206 05:13:46.597754 27345 authenticator.cpp:414] Starting authentication session for crammd5-authenticatee(459)@172.16.10.145:36957 I0206 05:13:46.597805 27345 authenticator.cpp:98] Creating new server SASL connection
[jira] [Created] (MESOS-9552) Tasks in unreachable state are not answered during implicit reconciliation
Vinod Kone created MESOS-9552: - Summary: Tasks in unreachable state are not answered during implicit reconciliation Key: MESOS-9552 URL: https://issues.apache.org/jira/browse/MESOS-9552 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Implicit reconciliation only answers about tasks in `pendingTasks` and `tasks` in the `Framework` struct. But it ignores tasks in `unreachableTasks` list. Even during explicit reconciliation master doesn't look at the `unreachableTasks` map, but it answers it correctly, in case the agent id is set, because the corresponding agent in in unreachable list. If instead master looks into `unreachableTasks` map it could answer irrespective of the agent id being set. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9547) Removing non-checkpointing framework on the master does not properly clean up all data structures
Vinod Kone created MESOS-9547: - Summary: Removing non-checkpointing framework on the master does not properly clean up all data structures Key: MESOS-9547 URL: https://issues.apache.org/jira/browse/MESOS-9547 Project: Mesos Issue Type: Bug Reporter: Vinod Kone When an agent is disconnected, non-checkpointing frameworks on it are removed via `removeFramework(Slave*, Framework*)`. But looks like this function only cleans up active tasks and executors in the slave struct. It doesn't cleanup `pendingTasks` or `killedTasks` for example. It also doesn't cleanup `operations`, but not sure if it's intentional. There are a bunch of `*Resources` variables in the struct, that probably should be updated? It's also worthwhile auditing `removeFramework(Framework*)` to see if it's leaking any resources as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9546) Operation status is not updated in master when agent is marked as unreachable or gone
Vinod Kone created MESOS-9546: - Summary: Operation status is not updated in master when agent is marked as unreachable or gone Key: MESOS-9546 URL: https://issues.apache.org/jira/browse/MESOS-9546 Project: Mesos Issue Type: Bug Environment: In `Master::markGone` and `Master::_markUnreachable` we call `sendBulkOperationFeedback` which sends `OPERATION_GONE_BY_OPERATOR` and `OPERATION_UNREACHABLE` to the corresponding frameworks, but the operations states are note changed in the `Master::Framework` struct. See also the related issue MESOS-9545 which applies to unreachable operations. Reporter: Vinod Kone -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
Vinod Kone created MESOS-9545: - Summary: Marking an unreachable agent as gone should transition the tasks to terminal state Key: MESOS-9545 URL: https://issues.apache.org/jira/browse/MESOS-9545 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone If an unreachable agent is marked as gone, currently master just marks that agent in the registry but doesn't do anything about its tasks. So the tasks are in UNREACHABLE state in the master forever, until the master fails over. This is not great UX. We should transition these to terminal state instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5916) Improve health checking.
[ https://issues.apache.org/jira/browse/MESOS-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-5916: - Resolution: Fixed Assignee: Alexander Rukletsov Fix Version/s: 1.2.0 Moved unresolved issues to MESOS-7353. > Improve health checking. > > > Key: MESOS-5916 > URL: https://issues.apache.org/jira/browse/MESOS-5916 > Project: Mesos > Issue Type: Epic >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: health-check, mesosphere > Fix For: 1.2.0 > > > This epic aims to provide comprehensive health check support in Mesos > (command, HTTP, TCP) and a unified API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9509) Benchmark command health checks in default executor
Vinod Kone created MESOS-9509: - Summary: Benchmark command health checks in default executor Key: MESOS-9509 URL: https://issues.apache.org/jira/browse/MESOS-9509 Project: Mesos Issue Type: Task Components: executor Reporter: Vinod Kone TCP/HTTP health checks were extensively scale tested as part of https://mesosphere.com/blog/introducing-mesos-native-health-checks-apache-mesos-part-2/. We should do the same for command checks by default executor because it uses a very different mechanism (agent fork/execs the check command as a nested container) and will have very different scalability characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.
[ https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732473#comment-16732473 ] Vinod Kone commented on MESOS-7622: --- [~kaysoky] Did you fix this recently? > Agent can crash if a HTTP executor tries to retry subscription in running > state. > > > Key: MESOS-7622 > URL: https://issues.apache.org/jira/browse/MESOS-7622 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.2.2 >Reporter: Aaron Wood >Priority: Critical > > It is possible that a running executor might retry its subscribe request. > This can lead to a crash if it previously had any launched tasks. Note that > the executor would still be able to subscribe again when the agent process > restarts and is recovering. > {code} > sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave > --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime > --image_providers=docker --image_provisioner_backend=overlay > --containerizers=mesos --launcher_dir=$(pwd) > --executor_environment_variables='{"LD_LIBRARY_PATH": > "/home/aaron/Code/src/mesos/build/src/.libs"}' > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by > aaron > I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0 > I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected > I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state > I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice > `mesos_executors.slice` > I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver > I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: > cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret > I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' > failed; this is the output: > sh: 1: hadoop: not found > I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin > 'hadoop' as it could not be created: Failed to create HDFS client: Failed to > execute 'hadoop version 2>&1'; the command was either not found or exited > with a non-zero exit status: 127 > I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend > 'overlay' > I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on > (1)@127.0.1.1:5051 > I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}" > --executor_registration_timeout="1mins" > --executor_reregistration_timeout="2secs" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname_lookup="true" > --http_command_executor="false" --http_heartbeat_interval="30secs" > --image_providers="docker" --image_provisioner_backend="overlay" > --initialize_driver_logging="true" > --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime" > --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" > --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" > --max_completed_executors_per_framework="150" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs"
[jira] [Commented] (MESOS-9495) Test `MasterTest.CreateVolumesV1AuthorizationFailure` is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732456#comment-16732456 ] Vinod Kone commented on MESOS-9495: --- I'm seeing this quite frequently in ASF CI. Looks like this test was written as part of reservation refinement. [~bmahler] can you get this into resource mgmt backlog? > Test `MasterTest.CreateVolumesV1AuthorizationFailure` is flaky. > --- > > Key: MESOS-9495 > URL: https://issues.apache.org/jira/browse/MESOS-9495 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0 >Reporter: Chun-Hung Hsiao >Priority: Major > Labels: allocator, flaky-test > Attachments: > mesos-ec2-centos-7-CMake.Mesos.MasterTest.CreateVolumesV1AuthorizationFailure-badrun.txt > > > {noformat} > I1219 22:45:59.578233 26107 slave.cpp:1884] Will retry registration in > 2.10132ms if necessary > I1219 22:45:59.578615 26107 master.cpp:6125] Received register agent message > from slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal) > I1219 22:45:59.578830 26107 master.cpp:3871] Authorizing agent with principal > 'test-principal' > I1219 22:45:59.578975 26107 master.cpp:6183] Authorized registration of agent > at slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal) > I1219 22:45:59.579039 26107 master.cpp:6294] Registering agent at > slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal) with id > 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 > I1219 22:45:59.579540 26107 registrar.cpp:495] Applied 1 operations in > 143852ns; attempting to update the registry > I1219 22:45:59.580102 26109 registrar.cpp:552] Successfully updated the > registry in 510208ns > I1219 22:45:59.580312 26109 master.cpp:6342] Admitted agent > 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 at slave(463)@172.16.10.13:35739 > (ip-172-16-10-13.ec2.internal) > I1219 22:45:59.580968 26111 slave.cpp:1884] Will retry registration in > 23.973874ms if necessary > I1219 22:45:59.581447 26111 slave.cpp:1486] Registered with master > master@172.16.10.13:35739; given agent ID > 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 > ... > I1219 22:45:59.580950 26109 master.cpp:6391] Registered agent > 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 at slave(463)@172.16.10.13:35739 > (ip-172-16-10-13.ec2.internal) with disk(reservations: > [(STATIC,role1)]):1024; cpus:2; mem:6796; ports:[31000-32000] > I1219 22:45:59.583326 26109 master.cpp:6125] Received register agent message > from slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal) > I1219 22:45:59.583524 26109 master.cpp:3871] Authorizing agent with principal > 'test-principal' > ... > W1219 22:45:59.584242 26109 master.cpp:6175] Refusing registration of agent > at slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal): > Authorization failure: Authorizer failure > ... > I1219 22:45:59.586944 26113 http.cpp:1185] HTTP POST for /master/api/v1 from > 172.16.10.13:47412 > I1219 22:45:59.587129 26113 http.cpp:682] Processing call CREATE_VOLUMES > /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/master_tests.cpp:9386: > Failure > Mock function called more times than expected - returning default value. > Function call: authorized(@0x7f5066524720 48-byte object 50-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 10-4E 02-48 50-7F > 00-00 E0-4C 02-48 50-7F 00-00 06-00 00-00 50-7F 00-00>) > Returns: Abandoned > Expected: to be called once >Actual: called twice - over-saturated and active > I1219 22:45:59.587761 26113 master.cpp:3811] Authorizing principal > 'test-principal' to create volumes > '[{"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]' > ... > /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/master_tests.cpp:9398: > Failure > Failed to wait 15secs for response{noformat} > This is because we authorize the retried registration before dropping it. > Full log: > [^mesos-ec2-centos-7-CMake.Mesos.MasterTest.CreateVolumesV1AuthorizationFailure-badrun.txt] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub
[ https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732282#comment-16732282 ] Vinod Kone commented on MESOS-9157: --- Is this still an issue? Can we close this as can't repro? > cannot pull docker image from dockerhub > --- > > Key: MESOS-9157 > URL: https://issues.apache.org/jira/browse/MESOS-9157 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.6.1 >Reporter: Michael Bowie >Priority: Blocker > Labels: containerization > > I am not able to pull docker images from docker hub through marathon/mesos. > I get one of two errors: > * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: > time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing > with pull after error: context canceled"` > * `Failed to run docker -H ... Error: No such object: > mesos-d2f333a8-fef2-48fb-8b99-28c52c327790` > However, I can manually ssh into one of the agents and successfully pull the > image from the command line. > Any pointers in the right direction? > Thank you! > Similar Issues: > https://github.com/mesosphere/marathon/issues/3869 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8470) CHECK failure in DRFSorter due to invalid framework id.
[ https://issues.apache.org/jira/browse/MESOS-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732156#comment-16732156 ] Vinod Kone commented on MESOS-8470: --- [~bbannier] Sounds good. Can you please set the fix version above and also paste the commit messages as a comment? > CHECK failure in DRFSorter due to invalid framework id. > --- > > Key: MESOS-8470 > URL: https://issues.apache.org/jira/browse/MESOS-8470 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Chun-Hung Hsiao >Assignee: Benjamin Bannier >Priority: Major > Labels: allocator, mesosphere, techdebt > > A framework registering with a custom {{FrameworkID}} containing slashes such > as {{/foo/bar}} will trigger a CHECK failure at > https://github.com/apache/mesos/blob/177a2221496a2caa5ad25e71c9982ca3eed02fd4/src/master/allocator/sorter/drf/sorter.cpp#L167: > {noformat} > master.cpp:6618] Updating info for framework /foo/bar > sorter.cpp:167] Check failed: clientPath == current->clientPath() (/foo/bar > vs. foo/bar) > {noformat} > The sorter should be defensive with any {{FrameworkID}} containing slashes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8470) CHECK failure in DRFSorter due to invalid framework id.
[ https://issues.apache.org/jira/browse/MESOS-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732151#comment-16732151 ] Vinod Kone commented on MESOS-8470: --- [~bbannier] ^^. Also, any plans to backport this? > CHECK failure in DRFSorter due to invalid framework id. > --- > > Key: MESOS-8470 > URL: https://issues.apache.org/jira/browse/MESOS-8470 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Chun-Hung Hsiao >Assignee: Benjamin Bannier >Priority: Major > Labels: allocator, mesosphere, techdebt > > A framework registering with a custom {{FrameworkID}} containing slashes such > as {{/foo/bar}} will trigger a CHECK failure at > https://github.com/apache/mesos/blob/177a2221496a2caa5ad25e71c9982ca3eed02fd4/src/master/allocator/sorter/drf/sorter.cpp#L167: > {noformat} > master.cpp:6618] Updating info for framework /foo/bar > sorter.cpp:167] Check failed: clientPath == current->clientPath() (/foo/bar > vs. foo/bar) > {noformat} > The sorter should be defensive with any {{FrameworkID}} containing slashes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9459) Reviewbot is not verifying reviews that need verification
[ https://issues.apache.org/jira/browse/MESOS-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731363#comment-16731363 ] Vinod Kone commented on MESOS-9459: --- I see this error in CI. {noformat} = Error response from daemon: conflict: unable to delete e895c0531b9a (cannot be forced) - image is being used by running container cf8595802408 git rev-parse HEAD git clean -fd git reset --hard 1e8ebcb8cf1710052c1ae14e342c1277616fa13d Traceback (most recent call last): File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 341, in main() File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 329, in main review_requests = api(review_requests_url) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 119, in api return json.loads(urllib.request.urlopen(url, data=data).read()) File "/usr/lib/python3.5/json/__init__.py", line 312, in loads s.__class__.__name__)) TypeError: the JSON object must be str, not 'bytes' Build step 'Execute shell' marked build as failure Sending e-mails to: bui...@mesos.apache.org {noformat} > Reviewbot is not verifying reviews that need verification > - > > Key: MESOS-9459 > URL: https://issues.apache.org/jira/browse/MESOS-9459 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Vinod Kone >Assignee: Armand Grillet >Priority: Major > Labels: ci, integration > Fix For: 1.8.0 > > > For example this run of ReviewBot > https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console > says that there are no reviews to be verified, which is false because if we > look at ReviewBoard there are a bunch of reviews that have not been commented > on by ReviewBot since a new diff has been posted. > {noformat} > 12-05-18_23:41:54 - Running > /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py > 0 review requests need verification > {noformat} > I see the the logic of the verify-reviews.py script was changed as part of > the python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 > which likely caused the bug. > As an aside, It's unfortunate that python3 update was bundled with logic > changes in this review. cc [~andschwa] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9459) Reviewbot is not verifying reviews that need verification
Vinod Kone created MESOS-9459: - Summary: Reviewbot is not verifying reviews that need verification Key: MESOS-9459 URL: https://issues.apache.org/jira/browse/MESOS-9459 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Armand Grillet For example this run of ReviewBot https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console says that there are no reviews to be verified, which is false because if we look at ReviewBoard there are a bunch of reviews that have not been commented on by ReviewBot since a new diff has been posted. {noformat} 12-05-18_23:41:54 - Running /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py 0 review requests need verification {noformat} I see the the logic of the verify-reviews.py script was changed as part of the python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 which likely caused the bug. As an aside, It's unfortunate that python3 update was bundled with logic changes in this review. cc [~andschwa] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9083) Test ReservationEndpointsTest.ReserveAndUnreserveNoAuthentication is flaky.
[ https://issues.apache.org/jira/browse/MESOS-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710575#comment-16710575 ] Vinod Kone commented on MESOS-9083: --- Still happening on ASF CI. {code} [ RUN ] ReservationEndpointsTest.ReserveAndUnreserveNoAuthentication I1205 16:30:33.806411 22505 cluster.cpp:173] Creating default 'local' authorizer I1205 16:30:33.809387 22511 master.cpp:413] Master 80f814ea-0afc-4cec-8891-dfe913ca3075 (9b6ccb5930cd) started on 172.17.0.3:36088 I1205 16:30:33.809422 22511 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000 secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="false" --authenticate_http_frameworks="true" --authenticate_http_readonly="t rue" --authenticate_http_readwrite="false" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/7ITn89/credentia ls" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="bas ic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" --work_dir="/tmp/7ITn89/master" --zk_session_timeout="10secs" I1205 16:30:33.809890 22511 master.cpp:467] Master allowing unauthenticated frameworks to register I1205 16:30:33.809912 22511 master.cpp:471] Master only allowing authenticated agents to register I1205 16:30:33.809926 22511 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I1205 16:30:33.809937 22511 credentials.hpp:37] Loading credentials for authentication from '/tmp/7ITn89/credentials' I1205 16:30:33.810329 22511 master.cpp:521] Using default 'crammd5' authenticator I1205 16:30:33.810554 22511 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1205 16:30:33.810809 22511 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1205 16:30:33.810992 22511 master.cpp:602] Authorization enabled W1205 16:30:33.811025 22511 master.cpp:665] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I1205 16:30:33.811547 22510 whitelist_watcher.cpp:77] No whitelist given I1205 16:30:33.811564 22508 hierarchical.cpp:175] Initialized hierarchical allocator process I1205 16:30:33.814721 22509 master.cpp:2105] Elected as the leading master! I1205 16:30:33.814755 22509 master.cpp:1660] Recovering from registrar I1205 16:30:33.814954 22514 registrar.cpp:339] Recovering registrar I1205 16:30:33.815670 22514 registrar.cpp:383] Successfully fetched the registry (0B) in 669952ns I1205 16:30:33.815798 22514 registrar.cpp:487] Applied 1 operations in 39331ns; attempting to update the registry I1205 16:30:33.816577 22508 registrar.cpp:544] Successfully updated the registry in 710912ns I1205 16:30:33.816747 22508 registrar.cpp:416] Successfully recovered registrar I1205 16:30:33.817325 22521 master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister I1205 16:30:33.817361 22517 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover W1205 16:30:33.823312 22505 process.cpp:2829] Attempted to spawn already running process files@172.17.0.3:36088 I1205 16:30:33.824642 22505 containerizer.cpp:305] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } W1205 16:30:33.825306 22505 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges W1205 16:30:33.825335 22505 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges I1205 16:30:33.825368 22505 provisioner.cpp:298] Using default backend 'copy' I1205 16:30:33.827760 22505 cluster.cpp:485] Creating default 'local' authorizer I1205 16:30:33.829742 22510 slave.cpp:267] Mesos agent started on (444)@172.17.0.3:36088 I1205 16:30:33.829778 22510 slave.cpp:268] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://;
[jira] [Created] (MESOS-9458) PersistentVolumeEndpointsTest.StaticReservation is flaky
Vinod Kone created MESOS-9458: - Summary: PersistentVolumeEndpointsTest.StaticReservation is flaky Key: MESOS-9458 URL: https://issues.apache.org/jira/browse/MESOS-9458 Project: Mesos Issue Type: Bug Components: allocation Reporter: Vinod Kone Observed this in ASF CI https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Buildbot-Test/310/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1%20MESOS_TEST_AWAIT_TIMEOUT=60secs,OS=ubuntu:16.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!ubuntu-4)&&(!H21)&&(!H23)&&(!H26)&&(!H27)/consoleText {noformat} [ RUN ] PersistentVolumeEndpointsTest.StaticReservation I1205 11:34:05.896515 22538 cluster.cpp:173] Creating default 'local' authorizer I1205 11:34:05.898870 22542 master.cpp:413] Master 3f2d828b-bff8-461a-98cf-de9163b36657 (488de0351206) started on 172.17.0.2:40803 I1205 11:34:05.898895 22542 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/qOMyLF/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" --work_dir="/tmp/qOMyLF/master" --zk_session_timeout="10secs" I1205 11:34:05.899194 22542 master.cpp:465] Master only allowing authenticated frameworks to register I1205 11:34:05.899205 22542 master.cpp:471] Master only allowing authenticated agents to register I1205 11:34:05.899212 22542 master.cpp:477] Master only allowing authenticated HTTP frameworks to register I1205 11:34:05.899219 22542 credentials.hpp:37] Loading credentials for authentication from '/tmp/qOMyLF/credentials' I1205 11:34:05.899503 22542 master.cpp:521] Using default 'crammd5' authenticator I1205 11:34:05.899674 22542 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1205 11:34:05.899879 22542 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1205 11:34:05.900029 22542 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1205 11:34:05.900211 22542 master.cpp:602] Authorization enabled W1205 11:34:05.900238 22542 master.cpp:665] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I1205 11:34:05.900684 22539 hierarchical.cpp:175] Initialized hierarchical allocator process I1205 11:34:05.900707 22545 whitelist_watcher.cpp:77] No whitelist given I1205 11:34:05.903553 22540 master.cpp:2105] Elected as the leading master! I1205 11:34:05.903587 22540 master.cpp:1660] Recovering from registrar I1205 11:34:05.903753 22551 registrar.cpp:339] Recovering registrar I1205 11:34:05.904373 22551 registrar.cpp:383] Successfully fetched the registry (0B) in 574976ns I1205 11:34:05.904498 22551 registrar.cpp:487] Applied 1 operations in 34823ns; attempting to update the registry I1205 11:34:05.905134 22551 registrar.cpp:544] Successfully updated the registry in 566016ns I1205 11:34:05.905258 22551 registrar.cpp:416] Successfully recovered registrar I1205 11:34:05.905829 22539 master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister I1205 11:34:05.905889 22540 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover W1205 11:34:05.918561 22538 process.cpp:2829] Attempted to spawn already running process files@172.17.0.2:40803 I1205 11:34:05.919775 22538 containerizer.cpp:305] Using isolation { environment_secret,
[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765 ] Vinod Kone edited comment on MESOS-7971 at 12/3/18 8:50 PM: Saw this again. {noformat} 06:14:51 [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove 06:14:51 I1203 06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' authorizer 06:14:51 I1203 06:14:50.633529 19796 master.cpp:413] Master f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 172.17.0.3:44340 06:14:51 I1203 06:14:50.633581 19796 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4vMyjy/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" --work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs" 06:14:51 I1203 06:14:50.634217 19796 master.cpp:465] Master only allowing authenticated frameworks to register 06:14:51 I1203 06:14:50.634236 19796 master.cpp:471] Master only allowing authenticated agents to register 06:14:51 I1203 06:14:50.634253 19796 master.cpp:477] Master only allowing authenticated HTTP frameworks to register 06:14:51 I1203 06:14:50.634270 19796 credentials.hpp:37] Loading credentials for authentication from '/tmp/4vMyjy/credentials' 06:14:51 I1203 06:14:50.634608 19796 master.cpp:521] Using default 'crammd5' authenticator 06:14:51 I1203 06:14:50.634840 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 06:14:51 I1203 06:14:50.635052 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 06:14:51 I1203 06:14:50.635200 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 06:14:51 I1203 06:14:50.635373 19796 master.cpp:602] Authorization enabled 06:14:51 W1203 06:14:50.635457 19796 master.cpp:665] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information 06:14:51 I1203 06:14:50.635991 19800 whitelist_watcher.cpp:77] No whitelist given 06:14:51 I1203 06:14:50.636032 19793 hierarchical.cpp:175] Initialized hierarchical allocator process 06:14:51 I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the leading master! 06:14:51 I1203 06:14:50.638975 19796 master.cpp:1660] Recovering from registrar 06:14:51 I1203 06:14:50.639200 19792 registrar.cpp:339] Recovering registrar 06:14:51 I1203 06:14:50.639927 19792 registrar.cpp:383] Successfully fetched the registry (0B) in 672768ns 06:14:51 I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 48006ns; attempting to update the registry 06:14:51 I1203 06:14:50.640718 19792 registrar.cpp:544] Successfully updated the registry in 582912ns 06:14:51 I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered registrar 06:14:51 I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister 06:14:51 I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover 06:14:51 W1203 06:14:50.647153 19784 process.cpp:2829] Attempted to spawn already running process files@172.17.0.3:44340 06:14:51 I1203 06:14:50.648453 19784 containerizer.cpp:305] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } 06:14:51 W1203 06:14:50.649060 19784 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges 06:14:51 W1203 06:14:50.649088 19784 backend.cpp:76] Failed
[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765 ] Vinod Kone commented on MESOS-7971: --- Saw this again. {code} *06:14:51* [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove*06:14:51* I1203 06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' authorizer*06:14:51* I1203 06:14:50.633529 19796 master.cpp:413] Master f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 172.17.0.3:44340*06:14:51* I1203 06:14:50.633581 19796 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4vMyjy/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" --work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs"*06:14:51* I1203 06:14:50.634217 19796 master.cpp:465] Master only allowing authenticated frameworks to register*06:14:51* I1203 06:14:50.634236 19796 master.cpp:471] Master only allowing authenticated agents to register*06:14:51* I1203 06:14:50.634253 19796 master.cpp:477] Master only allowing authenticated HTTP frameworks to register*06:14:51* I1203 06:14:50.634270 19796 credentials.hpp:37] Loading credentials for authentication from '/tmp/4vMyjy/credentials'*06:14:51* I1203 06:14:50.634608 19796 master.cpp:521] Using default 'crammd5' authenticator*06:14:51* I1203 06:14:50.634840 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly'*06:14:51* I1203 06:14:50.635052 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'*06:14:51* I1203 06:14:50.635200 19796 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'*06:14:51* I1203 06:14:50.635373 19796 master.cpp:602] Authorization enabled*06:14:51* W1203 06:14:50.635457 19796 master.cpp:665] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information*06:14:51* I1203 06:14:50.635991 19800 whitelist_watcher.cpp:77] No whitelist given*06:14:51* I1203 06:14:50.636032 19793 hierarchical.cpp:175] Initialized hierarchical allocator process*06:14:51* I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the leading master!*06:14:51* I1203 06:14:50.638975 19796 master.cpp:1660] Recovering from registrar*06:14:51* I1203 06:14:50.639200 19792 registrar.cpp:339] Recovering registrar*06:14:51* I1203 06:14:50.639927 19792 registrar.cpp:383] Successfully fetched the registry (0B) in 672768ns*06:14:51* I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 48006ns; attempting to update the registry*06:14:51* I1203 06:14:50.640718 19792 registrar.cpp:544] Successfully updated the registry in 582912ns*06:14:51* I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered registrar*06:14:51* I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister*06:14:51* I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover*06:14:51* W1203 06:14:50.647153 19784 process.cpp:2829] Attempted to spawn already running process files@172.17.0.3:44340*06:14:51* I1203 06:14:50.648453 19784 containerizer.cpp:305] Using isolation \{ environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }*06:14:51* W1203 06:14:50.649060 19784 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges*06:14:51* W1203 06:14:50.649088 19784 backend.cpp:76] Failed to
[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery flaky
[ https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707749#comment-16707749 ] Vinod Kone commented on MESOS-8983: --- This is happening on ASF CI. {code} *15:49:24* 3: [ RUN ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery*15:49:24* 3: I1203 15:49:24.425719 24686 cluster.cpp:173] Creating default 'local' authorizer*15:49:24* 3: I1203 15:49:24.430784 24687 master.cpp:413] Master 620b2018-c90f-4b11-bbe3-8fa1c90f204d (5a45e7f918b2) started on 172.17.0.3:42912*15:49:24* 3: I1203 15:49:24.430824 24687 master.cpp:416] Flags at startup: --acls="" --agent_ping_timeout="1secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/PNxXC7/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="2" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/PNxXC7/master" --zk_session_timeout="10secs"*15:49:24* 3: I1203 15:49:24.431120 24687 master.cpp:465] Master only allowing authenticated frameworks to register*15:49:24* 3: I1203 15:49:24.431131 24687 master.cpp:471] Master only allowing authenticated agents to register*15:49:24* 3: I1203 15:49:24.431139 24687 master.cpp:477] Master only allowing authenticated HTTP frameworks to register*15:49:24* 3: I1203 15:49:24.431149 24687 credentials.hpp:37] Loading credentials for authentication from '/tmp/PNxXC7/credentials'*15:49:24* 3: I1203 15:49:24.431355 24687 master.cpp:521] Using default 'crammd5' authenticator*15:49:24* 3: I1203 15:49:24.431514 24687 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly'*15:49:24* 3: I1203 15:49:24.431659 24687 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'*15:49:24* 3: I1203 15:49:24.431778 24687 http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'*15:49:24* 3: I1203 15:49:24.431896 24687 master.cpp:602] Authorization enabled*15:49:24* 3: I1203 15:49:24.432276 24688 hierarchical.cpp:175] Initialized hierarchical allocator process*15:49:24* 3: I1203 15:49:24.432498 24688 whitelist_watcher.cpp:77] No whitelist given*15:49:24* 3: I1203 15:49:24.444337 24690 master.cpp:2105] Elected as the leading master!*15:49:24* 3: I1203 15:49:24.444366 24690 master.cpp:1660] Recovering from registrar*15:49:24* 3: I1203 15:49:24.445142 24687 registrar.cpp:339] Recovering registrar*15:49:24* 3: I1203 15:49:24.445669 24687 registrar.cpp:383] Successfully fetched the registry (0B) in 472064ns*15:49:24* 3: I1203 15:49:24.445785 24687 registrar.cpp:487] Applied 1 operations in 40517ns; attempting to update the registry*15:49:24* 3: I1203 15:49:24.446497 24687 registrar.cpp:544] Successfully updated the registry in 660992ns*15:49:24* 3: I1203 15:49:24.453212 24687 registrar.cpp:416] Successfully recovered registrar*15:49:24* 3: I1203 15:49:24.453722 24692 master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister*15:49:24* 3: I1203 15:49:24.453984 24692 hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to recover*15:49:24* 3: I1203 15:49:24.468710 24686 containerizer.cpp:305] Using isolation \{ environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }*15:49:24* 3: W1203 15:49:24.481513 24686 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges*15:49:24* 3: W1203 15:49:24.481549 24686 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges*15:49:24* 3: I1203 15:49:24.481591 24686 provisioner.cpp:298] Using default backend 'copy'*15:49:24* 3: W1203 15:49:24.498661 24686 process.cpp:2829] Attempted to spawn already running process
[jira] [Assigned] (MESOS-9022) Race condition in task updates could cause missing event in streaming
[ https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9022: - Assignee: Benno Evers Labels: events foundations mesos mesosphere race-condition streaming (was: events mesos mesosphere race-condition streaming) Component/s: HTTP API Oh great. [~bennoe] can you confirm and resolve? > Race condition in task updates could cause missing event in streaming > - > > Key: MESOS-9022 > URL: https://issues.apache.org/jira/browse/MESOS-9022 > Project: Mesos > Issue Type: Bug > Components: HTTP API, master >Affects Versions: 1.6.0 >Reporter: Evelyn Liu >Assignee: Benno Evers >Priority: Blocker > Labels: events, foundations, mesos, mesosphere, race-condition, > streaming > > Master sends update event of {{TASK_STARTING}} when task's latest state is > already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, > {{sendSubscribersUpdate}} is set to {{false}} because of > [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805]. > The subscriber would not receive update event of {{TASK_FAILED}}. > This happened when a task failed very fast. Is there a race condition while > handling task updates? > {{*master log:*}} > {code:java} > I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, > status update state: TASK_STARTING) > I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_STARTING) > I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED > (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update > TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_FAILED) > I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for > status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-2554) Slave flaps when using --slave_subsystems that are not used for isolation.
[ https://issues.apache.org/jira/browse/MESOS-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705406#comment-16705406 ] Vinod Kone commented on MESOS-2554: --- [~jieyu] Is this still an issue? cc [~gilbert] > Slave flaps when using --slave_subsystems that are not used for isolation. > -- > > Key: MESOS-2554 > URL: https://issues.apache.org/jira/browse/MESOS-2554 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.21.0, 0.21.1, 0.22.0 >Reporter: Jie Yu >Priority: Critical > > Say one use --slave_subsystems=cpuacct > However, if he/she does not use cpuacct cgroup for isolation, all processes > forked by the slave (e.g., tasks) will be part of the slave cgroup. This is > not expected. ALso, more importantly, this will cause the slave to flap when > restart because there are task processes in slave's cgroup. > We should add a check during slave startup at least! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5989) Libevent SSL Socket downgrade code accesses uninitialized memory / assumes single peek is sufficient.
[ https://issues.apache.org/jira/browse/MESOS-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705403#comment-16705403 ] Vinod Kone commented on MESOS-5989: --- [~bmahler] Is this still an issue? > Libevent SSL Socket downgrade code accesses uninitialized memory / assumes > single peek is sufficient. > - > > Key: MESOS-5989 > URL: https://issues.apache.org/jira/browse/MESOS-5989 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Priority: Critical > > See the XXX comment below. > https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L912-L920 > {code} > void LibeventSSLSocketImpl::peek_callback( > evutil_socket_t fd, > short what, > void* arg) > { > CHECK(__in_event_loop__); > CHECK(what & EV_READ); > char data[6]; > // Try to peek the first 6 bytes of the message. > ssize_t size = ::recv(fd, data, 6, MSG_PEEK); > // Based on the function 'ssl23_get_client_hello' in openssl, we > // test whether to dispatch to the SSL or non-SSL based accept based > // on the following rules: > // 1. If there are fewer than 3 bytes: non-SSL. > // 2. If the 1st bit of the 1st byte is set AND the 3rd byte is > // equal to SSL2_MT_CLIENT_HELLO: SSL. > // 3. If the 1st byte is equal to SSL3_RT_HANDSHAKE AND the 2nd > // byte is equal to SSL3_VERSION_MAJOR and the 6th byte is > // equal to SSL3_MT_CLIENT_HELLO: SSL. > // 4. Otherwise: non-SSL. > // For an ascii based protocol to falsely get dispatched to SSL it > // needs to: > // 1. Start with an invalid ascii character (0x80). > // 2. OR have the first 2 characters be a SYN followed by ETX, and > // then the 6th character be SOH. > // These conditions clearly do not constitute valid HTTP requests, > // and are unlikely to collide with other existing protocols. > bool ssl = false; // Default to rule 4. > // XXX: data[0] data[1] are guaranteed to be set, but not data[>=2] > if (size < 2) { // Rule 1. > ssl = false; > } else if ((data[0] & 0x80) && data[2] == SSL2_MT_CLIENT_HELLO) { // Rule 2. > ssl = true; > } else if (data[0] == SSL3_RT_HANDSHAKE && > data[1] == SSL3_VERSION_MAJOR && > data[5] == SSL3_MT_CLIENT_HELLO) { // Rule 3. > ssl = true; > } > AcceptRequest* request = reinterpret_cast(arg); > // We call 'event_free()' here because it ensures the event is made > // non-pending and inactive before it gets deallocated. > event_free(request->peek_event); > request->peek_event = nullptr; > if (ssl) { > accept_SSL_callback(request); > } else { > // Downgrade to a non-SSL socket. > Try create = Socket::create(Socket::POLL, fd); > if (create.isError()) { > request->promise.fail(create.error()); > } else { > request->promise.set(create.get()); > } > delete request; > } > } > {code} > This code accesses potentially uninitialized memory. Secondly, the code > assumes that a single peek is sufficient for determining whether the incoming > data is an SSL connection. There seems to be an assumption that in the SSL > path, we are guaranteed to peek a sufficient number of bytes when the socket > is ready to read. It's not clear what is providing this guarantee, or if this > is incorrect. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.
[ https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705400#comment-16705400 ] Vinod Kone commented on MESOS-6632: --- [~kaysoky], [~gilbert]: Is this still an issue? > ContainerLogger might leak FD if container launch fails. > > > Key: MESOS-6632 > URL: https://issues.apache.org/jira/browse/MESOS-6632 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.2, 1.0.1, 1.1.0 >Reporter: Jie Yu >Priority: Critical > > In MesosContainerizer, if logger->prepare() succeeds but its continuation > fails, the pipe fd allocated in the logger will get leaked. We cannot add a > destructor in ContainerLogger::SubprocessInfo to close the fd because > subprocess might close the OWNED fd. > A FD abstraction might help here. In other words, subprocess will no longer > be responsible for closing external FDs, instead, the FD destructor will be > doing so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-5396) After failover, master does not remove agents with same UPID.
[ https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-5396: - Assignee: (was: Neil Conway) > After failover, master does not remove agents with same UPID. > - > > Key: MESOS-5396 > URL: https://issues.apache.org/jira/browse/MESOS-5396 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Priority: Critical > Labels: mesosphere > > Scenario: > * master fails over > * an agent host is restarted; the agent attempts to *register* (not > reregister) with Mesos using the same UPID as the previous agent instance; > this means it will get a new agent ID > * framework isn't notified about the status of the tasks on the *old* agentID > until the {{agent_reregister_timeout}} expires (10 mins) > This isn't necessarily wrong but it is suboptimal: when the agent attempts to > register with the same UPID that was used by the previous agent instance, we > know that a *reregistration* attempt for the old pair will > never be seen. Hence we can declare the old agentID to be gone-forever and > notify frameworks appropriately, without waiting for the full > {{agent_reregister_timeout}} to expire. > Note that we already implement the proposed behavior for the case when the > master does *not* failover > (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming
[ https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705368#comment-16705368 ] Vinod Kone commented on MESOS-9022: --- cc [~greggomann] > Race condition in task updates could cause missing event in streaming > - > > Key: MESOS-9022 > URL: https://issues.apache.org/jira/browse/MESOS-9022 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.6.0 >Reporter: Evelyn Liu >Priority: Blocker > Labels: events, mesos, mesosphere, race-condition, streaming > > Master sends update event of {{TASK_STARTING}} when task's latest state is > already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, > {{sendSubscribersUpdate}} is set to {{false}} because of > [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805]. > The subscriber would not receive update event of {{TASK_FAILED}}. > This happened when a task failed very fast. Is there a race condition while > handling task updates? > {{*master log:*}} > {code:java} > I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, > status update state: TASK_STARTING) > I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING > (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update > TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_STARTING) > I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED > (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update > TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- > I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status > update state: TASK_FAILED) > I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for > status eb091093-d303-4e82-b69f-e2ba1011ba76 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587 > I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for > status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task > f839055c-7a40-4e6c-9f53-22030f388c8c of framework > 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent > d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub
[ https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705365#comment-16705365 ] Vinod Kone commented on MESOS-9157: --- cc [~gilbert] [~abudnik] [~qianzhang] > cannot pull docker image from dockerhub > --- > > Key: MESOS-9157 > URL: https://issues.apache.org/jira/browse/MESOS-9157 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.6.1 >Reporter: Michael Bowie >Priority: Blocker > Labels: containerization > > I am not able to pull docker images from docker hub through marathon/mesos. > I get one of two errors: > * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: > time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing > with pull after error: context canceled"` > * `Failed to run docker -H ... Error: No such object: > mesos-d2f333a8-fef2-48fb-8b99-28c52c327790` > However, I can manually ssh into one of the agents and successfully pull the > image from the command line. > Any pointers in the right direction? > Thank you! > Similar Issues: > https://github.com/mesosphere/marathon/issues/3869 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9247) MasterAPITest.EventAuthorizationFiltering is flaky
[ https://issues.apache.org/jira/browse/MESOS-9247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9247: - Assignee: Till Toenshoff > MasterAPITest.EventAuthorizationFiltering is flaky > -- > > Key: MESOS-9247 > URL: https://issues.apache.org/jira/browse/MESOS-9247 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.7.0 >Reporter: Greg Mann >Assignee: Till Toenshoff >Priority: Major > Labels: flaky, flaky-test, integration, mesosphere > Attachments: MasterAPITest.EventAuthorizationFiltering.txt > > > Saw this failure on a CentOS 6 SSL build in our internal CI. Build log > attached. For some reason, it seems that the initial {{TASK_ADDED}} event is > missed: > {code} > ../../src/tests/api_tests.cpp:2922 > Expected: v1::master::Event::TASK_ADDED > Which is: TASK_ADDED > To be equal to: event->get().type() > Which is: TASK_UPDATED > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700904#comment-16700904 ] Vinod Kone commented on MESOS-7564: --- {quote} 1) The SUBSCRIBE Call is one persistent connection where the executor sends one Call, and receives a stream of Events. There is currently no Executor->Agent traffic except the first request. This connection could probably use heartbeating in both directions. Agent->Executor heartbeats may come in the form of Events. Executor->Agent heartbeats will need to be something else (like the heartbeating suggested here: [https://reviews.apache.org/r/69183/] ). {quote} Do we really need heartbeats in both directions given it is a single connection? I would imagine agent -> executor heartbeat events should be enough like we did with v1 scheduler API? > Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. > - > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Reporter: Anand Mazumdar >Assignee: Joseph Wu >Priority: Critical > Labels: api, mesosphere, v1_api > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699577#comment-16699577 ] Vinod Kone commented on MESOS-8930: --- Still seeing this in CI. [~bmahler] Do we have any abstractions/techniques in place that allows us to ensure the http request is enqueued in a more robust matter? Sounds like the 10ms is sometimes not enough in ASF CI. Kinda unrelated bug here is that the code does a "response->body" on a (possibly pending) future causing it to hang forever. This will block the whole test suite! {code} AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response); // Parse the response. Try responseJSON = JSON::parse(response->body); ASSERT_SOME(responseJSON); {code} I think we should atleast change the `AWAIT_EXPECT_*` above to `AWAIT_ASSERT` so that the rest of the test code is skipped. cc [~greggomann] [~bmahler] > THREADSAFE_SnapshotTimeout is flaky. > > > Key: MESOS-8930 > URL: https://issues.apache.org/jira/browse/MESOS-8930 > Project: Mesos > Issue Type: Bug > Components: test > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky-test, mesosphere > > Observed on ASF CI, might be related to a recent test change > https://reviews.apache.org/r/66831/ > {noformat} > 18:23:31 2: [ RUN ] MetricsTest.THREADSAFE_SnapshotTimeout > 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event > for process 'metrics' with path: '/metrics/snapshot' > 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: > Failure > 18:23:46 2: Failed to wait 15secs for response > 22:57:13 Build timed out (after 300 minutes). Marking the build as failed. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9287) DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky
[ https://issues.apache.org/jira/browse/MESOS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689639#comment-16689639 ] Vinod Kone commented on MESOS-9287: --- Observed this via Windows Reviewbot. {noformat} [ RUN ] DockerFetcherPluginTest.INTERNET_CURL_FetchImage 'hadoop' is not recognized as an internal or external command, operable program or batch file. d:\dcos\mesos\mesos\src\tests\uri_fetcher_tests.cpp(358): error: (fetcher.get()->fetch(uri, dir)).failure(): Collect failed: Unexpected 'curl' output: d:\dcos\mesos\mesos\3rdparty\stout\include\stout\tests\utils.hpp(46): error: TearDownMixin(): Failed to rmdir 'C:\Users\jenkins\AppData\Local\Temp\XpsPZ0': The process cannot access the file because it is being used by another process. [ FAILED ] DockerFetcherPluginTest.INTERNET_CURL_FetchImage (7460 ms) {noformat} > DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky > - > > Key: MESOS-9287 > URL: https://issues.apache.org/jira/browse/MESOS-9287 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0 > Environment: Windows tests on Azure >Reporter: Andrew Schwartzmeyer >Priority: Minor > Labels: ci, flaky, integration > > The test DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky on the CI, > probably due to the 60 second timeout. A 10 minute timeout would probably be > sufficient for slow Azure networks and big Docker images. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9332) Debug container should run as the same user of its parent container by default
[ https://issues.apache.org/jira/browse/MESOS-9332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9332: - Assignee: Qian Zhang > Debug container should run as the same user of its parent container by default > -- > > Key: MESOS-9332 > URL: https://issues.apache.org/jira/browse/MESOS-9332 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Labels: containerizer, mesosphere > > Currently when launching a debug container, by default Mesos agent will use > the executor's user as the debug container's user if the `user` field is not > specified in the debug container's `commandInfo` (see [this > code|https://github.com/apache/mesos/blob/1.7.0/src/slave/http.cpp#L2559] for > details). This is OK for the command task since the command executor's user > is same with command task's user (see [this > code|https://github.com/apache/mesos/blob/1.7.0/src/slave/slave.cpp#L6068:L6070] > for details), so the debug container will be launched as the same user of > the task. But for the task in a task group, the default executor's user is > same with the framework user (see [this > code|https://github.com/apache/mesos/blob/1.7.0/src/slave/slave.cpp#L8959] > for details), so in this case the debug container will be launched as the > same user of the framework rather than the task. So in a scenario that > framework user is a normal user but the task user is root, the debug > container will be launched as the normal which is not desired, the > expectation is the debug container should run as the same user of the > container it debugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9356) Make agent atomically checkpoint operations and resources
[ https://issues.apache.org/jira/browse/MESOS-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9356: - Assignee: Gastón Kleiman > Make agent atomically checkpoint operations and resources > - > > Key: MESOS-9356 > URL: https://issues.apache.org/jira/browse/MESOS-9356 > Project: Mesos > Issue Type: Task >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Major > Labels: agent, mesosphere, operation-feedback > > See > https://docs.google.com/document/d/1HxMBCfzU9OZ-5CxmPG3TG9FJjZ_-xDUteLz64GhnBl0/edit > for more details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9334: - Shepherd: Gilbert Song Assignee: Qian Zhang Sprint: Mesosphere RI-6 Sprint 2018-31 Story Points: 5 > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review
[ https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641991#comment-16641991 ] Vinod Kone commented on MESOS-9253: --- [~ArmandGrillet] Can you please send a review with the above fix? > Reviewbot is failing when posting a review > -- > > Key: MESOS-9253 > URL: https://issues.apache.org/jira/browse/MESOS-9253 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Priority: Critical > > Observed this in CI. > [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console] > > {code} > 09-23-18_02:12:05 - Running > /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py > Checking if review 68640 needs verification > Skipping blocking review 68640 > Checking if review 68641 needs verification > Patch never verified, needs verification > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Verifying review 68641 > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Applying review 68640 > python support/apply-reviews.py -n -r 68640 > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 156, in verify_review > apply_reviews(review_request, reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 120, in apply_reviews > reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 124, in apply_reviews > apply_review(review_request["id"]) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 97, in apply_review > shell("python support/apply-reviews.py -n -r %s" % review_id) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 91, in shell > command, stderr=subprocess.STDOUT, shell=True) > File "/usr/lib/python3.5/subprocess.py", line 626, in check_output > **kwargs).stdout > File "/usr/lib/python3.5/subprocess.py", line 708, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r > 68640' returned non-zero exit status 1 > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 292, in > main() > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 286, in main > verify_review(review_request, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 217, in verify_review > output += "\nFull log: " > TypeError: can't concat bytes to str > Build step 'Execute shell' marked build as failure > Sending e-mails to: bui...@mesos.apache.org > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review
[ https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641939#comment-16641939 ] Vinod Kone commented on MESOS-9253: --- Reviewbot is still failing {noformat} 10-08-18_14:40:29 - Running /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py Checking if review 68929 needs verification Skipping blocking review 68929 Checking if review 68941 needs verification Patch never verified, needs verification Dependent review: [https://reviews.apache.org/api/review-requests/68929/] Verifying review 68941 Dependent review: [https://reviews.apache.org/api/review-requests/68929/] Applying review 68929 python support/apply-reviews.py -n -r 68929 Posting review: Bad patch! Reviews applied: [68941, 68929] Failed command: python support/apply-reviews.py -n -r 68929 Error: Traceback (most recent call last): File "support/apply-reviews.py", line 35, in import urllib.request ImportError: No module named request Full log: [https://builds.apache.org/job/Mesos-Reviewbot/23458/console] 1 review requests need verification {noformat} Maybe it's a python 2 vs 3 issue. [~ArmandGrillet] [~andschwa] Can you take a look? > Reviewbot is failing when posting a review > -- > > Key: MESOS-9253 > URL: https://issues.apache.org/jira/browse/MESOS-9253 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Priority: Critical > > Observed this in CI. > [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console] > > {code} > 09-23-18_02:12:05 - Running > /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py > Checking if review 68640 needs verification > Skipping blocking review 68640 > Checking if review 68641 needs verification > Patch never verified, needs verification > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Verifying review 68641 > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Applying review 68640 > python support/apply-reviews.py -n -r 68640 > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 156, in verify_review > apply_reviews(review_request, reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 120, in apply_reviews > reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 124, in apply_reviews > apply_review(review_request["id"]) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 97, in apply_review > shell("python support/apply-reviews.py -n -r %s" % review_id) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 91, in shell > command, stderr=subprocess.STDOUT, shell=True) > File "/usr/lib/python3.5/subprocess.py", line 626, in check_output > **kwargs).stdout > File "/usr/lib/python3.5/subprocess.py", line 708, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r > 68640' returned non-zero exit status 1 > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 292, in > main() > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 286, in main > verify_review(review_request, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 217, in verify_review > output += "\nFull log: " > TypeError: can't concat bytes to str > Build step 'Execute shell' marked build as failure > Sending e-mails to: bui...@mesos.apache.org > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9295) Nested container launch could fail if the agent upgrade with new cgroup subsystems.
[ https://issues.apache.org/jira/browse/MESOS-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9295: - Assignee: Gilbert Song > Nested container launch could fail if the agent upgrade with new cgroup > subsystems. > --- > > Key: MESOS-9295 > URL: https://issues.apache.org/jira/browse/MESOS-9295 > Project: Mesos > Issue Type: Bug >Reporter: Gilbert Song >Assignee: Gilbert Song >Priority: Major > > Nested container launch could fail if the agent upgrade with new cgroup > subsystems, because the new cgroup subsystems do not exist on parent > container's cgroup hierarchy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID
[ https://issues.apache.org/jira/browse/MESOS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9293: - Assignee: Gastón Kleiman > OperationStatus messages sent to framework should include both agent ID and > resource provider ID > > > Key: MESOS-9293 > URL: https://issues.apache.org/jira/browse/MESOS-9293 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0 >Reporter: James DeFelice >Assignee: Gastón Kleiman >Priority: Major > Labels: mesosphere, operation-feedback > > Normally, frameworks are expected to checkpoint agent ID and resource > provider ID before accepting an offer with an OfferOperation. From this > expectation comes the requirement in the v1 scheduler API that a framework > must provide the agent ID and resource provider ID when acknowledging an > offer operation status update. However, this expectation breaks down: > 1. the framework might lose its checkpointed data; it no longer remembers the > agent ID or the resource provider ID > 2. even if the framework checkpoints data, it could be sent a stale update: > maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If > a framework deleted its checkpointed data after sending the ACK (that's > dropped) then upon replay of the status update it no longer has the agent ID > or resource provider ID for the operation. > An easy remedy would be to add the agent ID and resource provider ID to the > OperationStatus message received by the scheduler so that a framework can > build a proper ACK for the update, even if it doesn't have access to its > previously checkpointed information. > I'm filing this as a BUG because there's no way to reliably use the offer > operation status API until this has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review
[ https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626200#comment-16626200 ] Vinod Kone commented on MESOS-9253: --- cc [~andschwa] [~bbannier] > Reviewbot is failing when posting a review > -- > > Key: MESOS-9253 > URL: https://issues.apache.org/jira/browse/MESOS-9253 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Priority: Critical > > Observed this in CI. > [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console] > > {code} > 09-23-18_02:12:05 - Running > /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py > Checking if review 68640 needs verification > Skipping blocking review 68640 > Checking if review 68641 needs verification > Patch never verified, needs verification > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Verifying review 68641 > Dependent review: [https://reviews.apache.org/api/review-requests/68640/] > Applying review 68640 > python support/apply-reviews.py -n -r 68640 > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 156, in verify_review > apply_reviews(review_request, reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 120, in apply_reviews > reviews, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 124, in apply_reviews > apply_review(review_request["id"]) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 97, in apply_review > shell("python support/apply-reviews.py -n -r %s" % review_id) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 91, in shell > command, stderr=subprocess.STDOUT, shell=True) > File "/usr/lib/python3.5/subprocess.py", line 626, in check_output > **kwargs).stdout > File "/usr/lib/python3.5/subprocess.py", line 708, in run > output=stdout, stderr=stderr) > subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r > 68640' returned non-zero exit status 1 > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 292, in > main() > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 286, in main > verify_review(review_request, handler) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 217, in verify_review > output += "\nFull log: " > TypeError: can't concat bytes to str > Build step 'Execute shell' marked build as failure > Sending e-mails to: bui...@mesos.apache.org > Finished: FAILURE > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9253) Reviewbot is failing when posting a review
Vinod Kone created MESOS-9253: - Summary: Reviewbot is failing when posting a review Key: MESOS-9253 URL: https://issues.apache.org/jira/browse/MESOS-9253 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Observed this in CI. [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console] {code} 09-23-18_02:12:05 - Running /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py Checking if review 68640 needs verification Skipping blocking review 68640 Checking if review 68641 needs verification Patch never verified, needs verification Dependent review: [https://reviews.apache.org/api/review-requests/68640/] Verifying review 68641 Dependent review: [https://reviews.apache.org/api/review-requests/68640/] Applying review 68640 python support/apply-reviews.py -n -r 68640 Traceback (most recent call last): File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 156, in verify_review apply_reviews(review_request, reviews, handler) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 120, in apply_reviews reviews, handler) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 124, in apply_reviews apply_review(review_request["id"]) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 97, in apply_review shell("python support/apply-reviews.py -n -r %s" % review_id) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 91, in shell command, stderr=subprocess.STDOUT, shell=True) File "/usr/lib/python3.5/subprocess.py", line 626, in check_output **kwargs).stdout File "/usr/lib/python3.5/subprocess.py", line 708, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r 68640' returned non-zero exit status 1 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 292, in main() File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 286, in main verify_review(review_request, handler) File "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", line 217, in verify_review output += "\nFull log: " TypeError: can't concat bytes to str Build step 'Execute shell' marked build as failure Sending e-mails to: bui...@mesos.apache.org Finished: FAILURE {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9232) verify-reviews.py broken after enabling python3 support scripts
[ https://issues.apache.org/jira/browse/MESOS-9232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9232: - Assignee: Andrew Schwartzmeyer > verify-reviews.py broken after enabling python3 support scripts > --- > > Key: MESOS-9232 > URL: https://issues.apache.org/jira/browse/MESOS-9232 > Project: Mesos > Issue Type: Bug > Components: reviewbot, test >Affects Versions: 1.8.0 >Reporter: Benjamin Bannier >Assignee: Andrew Schwartzmeyer >Priority: Blocker > > Reviewbot is failing since {{support/verify-reviews.py}} was upgraded to use > the python3 instead of the python3 implementation. I see this was completely > refactored in {{590a75d0c9d61b0b07f8a3807225c40eb8189a0b}} and replaced the > existing impl with {{9c7eb909aad99e6ea6de0b1fd2a55a798764b00b}}. > We already fixed how the script gets invoked by Jenkins (it uses a completely > different way to pass arguments), but now see failures like > {noformat} > 09-14-18_08:43:03 - Running > /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py > 0 review requests need verification > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 250, in verification_needed_write > with open(parameters.out_file, 'w') as f: > AttributeError: 'Namespace' object has no attribute 'out_file' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 301, in > main() > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 297, in main > verification_needed_write(review_ids, parameters) > File > "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py", > line 253, in verification_needed_write > print("Failed opening file '%s' for writing" % parameters.out_file) > AttributeError: 'Namespace' object has no attribute 'out_file' > Build step 'Execute shell' marked build as failure > Sending e-mails to: bui...@mesos.apache.org > Finished: FAILURE > {noformat} > It looks like the script would need some additional modifications and > possible tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9210) Mesos v1 scheduler library does not properly handle SUBSCRIBE retries
Vinod Kone created MESOS-9210: - Summary: Mesos v1 scheduler library does not properly handle SUBSCRIBE retries Key: MESOS-9210 URL: https://issues.apache.org/jira/browse/MESOS-9210 Project: Mesos Issue Type: Bug Affects Versions: 1.6.1, 1.5.1, 1.7.0 Reporter: Vinod Kone Assignee: Till Toenshoff After the authentication related refactor done as part of [https://reviews.apache.org/r/62594/,] the state of the scheduler is checked in `send` ([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234)] but it is changed in `_send` ([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234).] As a result, we can have 2 SUBSCRIBE calls in flight at the same time on the same connection! This is not good and not spec compliant of a HTTP client that is expecting a streaming response. We need to fix the library to either drop the retried SUBSCRIBE call if one is in progress (as it was before the refactor) or close the old connection and start a new connection to send the retried SUBSCRIBE call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598016#comment-16598016 ] Vinod Kone commented on MESOS-8568: --- [~qianzhang] Can you please set the affects and target versions? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.
[ https://issues.apache.org/jira/browse/MESOS-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9191: - Shepherd: Qian Zhang Assignee: Andrei Budnik Sprint: Mesosphere Sprint 2018-28 [~abudnik] Would you have cycles in the next sprint work on this? > Docker command executor may stuck at infinite unkillable loop. > -- > > Key: MESOS-9191 > URL: https://issues.apache.org/jira/browse/MESOS-9191 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Blocker > Labels: containerizer > > Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the > behavior of docker command executor to discard the future of docker stop was > changed. If there is a new killTask() invoked and there is an existing docker > stop in pending state, the old one would call discard and then execute the > new one. This is ok for most of cases. > However, docker stop could take long (depends on grace period and whether the > application could handle SIGTERM). If the framework retry killTask more > frequently than grace period (depends on killpolicy API, env var, or agent > flags), then the executor may be stuck forever with unkillable tasks. Because > everytime before the docker stop finishes, the future of docker stop is > discarded by the new incoming killTask. > We should consider re-use grace period before calling discard() to a pending > docker stop future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9185) An attempt to remove or destroy container in composing containerizer leads to segfault
[ https://issues.apache.org/jira/browse/MESOS-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595174#comment-16595174 ] Vinod Kone commented on MESOS-9185: --- What versions are affected by this bug? Should this be backported? > An attempt to remove or destroy container in composing containerizer leads to > segfault > -- > > Key: MESOS-9185 > URL: https://issues.apache.org/jira/browse/MESOS-9185 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: mesosphere > Fix For: 1.8.0 > > > `LAUNCH_NESTED_CONTAINER` and `LAUNCH_NESTED_CONTAINER_SESSION` leads to > segfault in the agent when the parent container is unknown to the composing > containerizer. If the parent container cannot be found during an attempt to > launch a nested container via `ComposingContainerizerProcess::launch()`, the > composing container returns an error without cleaning up the container. On > `launch()` failures, the agent calls `destroy()` which accesses uninitialized > `containerizer` field. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9181) Fix the comment in JNI libraries regarding weak reference and GC
Vinod Kone created MESOS-9181: - Summary: Fix the comment in JNI libraries regarding weak reference and GC Key: MESOS-9181 URL: https://issues.apache.org/jira/browse/MESOS-9181 Project: Mesos Issue Type: Documentation Reporter: Vinod Kone Our JNI libraries for MesosSchedulerDriver, v0Mesos and v1Mesos all use weak global references to the underlying Java objects, but they incorrectly state that this will prevent JVM from GC'ing it. We need to fix these coments. e.g., [https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L213] See the JNI spec for details: [https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513 ] Vinod Kone commented on MESOS-8568: --- Great repro! One orthogonal question though, it seems unfortunate that IOSwitchboard takes 5s to complete its cleanup for a container that has failed to launch. IIRC there was a 5s timeout in IOSwitchboard for some unexpected corner cases which is what we seem to be hitting here, but this is an *expected* case in some sense. Is there anyway we can speed that up? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-9177: - Shepherd: Alexander Rukletsov Assignee: Benno Evers Sprint: Mesosphere Sprint 2018-27 Story Points: 3 Target Version/s: 1.7.0 > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 > _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ > @ 0x7f36812215ac > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ > @ 0x7f36821f3541 process::ProcessBase::consume() > @ 0x7f3682209fbc process::ProcessManager::resume() > @ 0x7f368220fa76 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f367eefc2b0 (unknown) > @ 0x7f367e71ae25 start_thread > @ 0x7f367e444bad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588096#comment-16588096 ] Vinod Kone commented on MESOS-4065: --- Looks like we need to do a ZK upgrade to at least 3.5.4 to get this. > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0, 1.2.2 >Reporter: James DeFelice >Priority: Major > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8729) Libprocess: deadlock in process::finalize
[ https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-8729: - Shepherd: Benjamin Mahler Assignee: Andrei Budnik Sprint: Mesosphere Sprint 2018-28 Story Points: 3 > Libprocess: deadlock in process::finalize > - > > Key: MESOS-8729 > URL: https://issues.apache.org/jira/browse/MESOS-8729 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.6.0 > Environment: The issue has been reproduced on Ubuntu 16.04, master > branch, commit `42848653b2`. >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: deadlock, libprocess > Attachments: deadlock.txt > > > Since we are calling > [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157] > before returning from the IOSwitchboard's main function, we expect that all > http responses are going to be sent back to clients before IOSwitchboard > terminates. However, after [adding|https://reviews.apache.org/r/66147/] > `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in > `libprocess::finalize()`. See attached stacktrace. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7076) libprocess tests fail when using libevent 2.1.8
[ https://issues.apache.org/jira/browse/MESOS-7076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-7076: - Assignee: Till Toenshoff > libprocess tests fail when using libevent 2.1.8 > --- > > Key: MESOS-7076 > URL: https://issues.apache.org/jira/browse/MESOS-7076 > Project: Mesos > Issue Type: Bug > Components: build, libprocess, test > Environment: macOS 10.12.3, libevent 2.1.8 (installed via Homebrew) >Reporter: Jan Schlicht >Assignee: Till Toenshoff >Priority: Critical > Labels: ci > > Running {{libprocess-tests}} on Mesos compiled with {{--enable-libevent > --enable-ssl}} on an operating system using libevent 2.1.8, SSL related tests > fail like > {noformat} > [ RUN ] SSLTest.SSLSocket > I0207 15:20:46.017881 2528580544 openssl.cpp:419] CA file path is > unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > I0207 15:20:46.017904 2528580544 openssl.cpp:424] CA directory path > unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR= > I0207 15:20:46.017918 2528580544 openssl.cpp:429] Will not verify peer > certificate! > NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > I0207 15:20:46.017923 2528580544 openssl.cpp:435] Will only verify peer > certificate if presented! > NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate > verification > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0207 15:20:46.033001 2528580544 openssl.cpp:419] CA file path is > unspecified! NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE= > I0207 15:20:46.033179 2528580544 openssl.cpp:424] CA directory path > unspecified! NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR= > I0207 15:20:46.033196 2528580544 openssl.cpp:429] Will not verify peer > certificate! > NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification > I0207 15:20:46.033201 2528580544 openssl.cpp:435] Will only verify peer > certificate if presented! > NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate > verification > ../../../3rdparty/libprocess/src/tests/ssl_tests.cpp:257: Failure > Failed to wait 15secs for Socket(socket.get()).recv() > [ FAILED ] SSLTest.SSLSocket (15196 ms) > {noformat} > Tests failing are > {noformat} > SSLTest.SSLSocket > SSLTest.NoVerifyBadCA > SSLTest.VerifyCertificate > SSLTest.ProtocolMismatch > SSLTest.ECDHESupport > SSLTest.PeerAddress > SSLTest.HTTPSGet > SSLTest.HTTPSPost > SSLTest.SilentSocket > SSLTest.ShutdownThenSend > SSLVerifyIPAdd/SSLTest.BasicSameProcess/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.BasicSameProcess/1, where GetParam() = "true" > SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.BasicSameProcessUnix/1, where GetParam() = "true" > SSLVerifyIPAdd/SSLTest.RequireCertificate/0, where GetParam() = "false" > SSLVerifyIPAdd/SSLTest.RequireCertificate/1, where GetParam() = "true" > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)