[jira] [Assigned] (MESOS-10199) Mesos doesn't set correct client request headers for HTTP requests

2020-11-18 Thread Vinod Kone (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-10199:
--

Assignee: Abdul Qadeer

> Mesos doesn't set correct client request headers for HTTP requests
> --
>
> Key: MESOS-10199
> URL: https://issues.apache.org/jira/browse/MESOS-10199
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, libprocess, master
>Reporter: Abdul Qadeer
>Assignee: Abdul Qadeer
>Priority: Major
>
>  The agents are not able to contact/register with master as the requests 
> don't set 'Host' parameter and nginx is required to return 400 for such 
> requests per [RFC|https://tools.ietf.org/html/rfc7230#section-5.4] specs :
> {noformat}
> *7 client sent invalid host header while reading client request headers, 
> client: x.x.x.x, server: , request: "POST 
> /master/mesos.internal.ReregisterSlaveMessage HTTP/1.1", host: ""{noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-30 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097036#comment-17097036
 ] 

Vinod Kone commented on MESOS-8038:
---

Thanks [~cf.natali] for the repro and analysis.

The above log lines you pasted in the comment doesn't capture everything that 
transpired, you would need to do a grep like this to get the whole picture. 

{quote}
grep -E 
"task-650af3bd-3f5b-4e17-9d34-4642480b4da0|:36541|6f446173-2bba-4cc4-bc15-c956bc159d4e"
 mesos_agent.log
{quote}

But, anyway, I think your observations are largely correct. When a container is 
in the process of being destroyed, the agent does short-circuit to send the 
terminal update to the master causing the resources to be released and offered 
and used by some other task. 

I remember discussions around this behavior in the past, but not sure where we 
landed in terms of the long term solution. Right now, we err on the side of 
releasing the resources incase the cgroup gets stuck in destroying instead of 
hoarding it. If we do decide to change this code to always wait for the cgroup 
destruction to be finished (or update to be finished) there's a possibility 
that resources are locked forever incase of bugs (either in mesos or kernel) in 
the destruction path. I can't remember if we have seen this behavior in 
production clusters before. 

[~abudnik] [~greggomann] thoughts on fixing this?




> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-6084) Deprecate and remove the included MPI framework

2020-03-02 Thread Vinod Kone (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-6084:
-

Assignee: Vinod Kone

> Deprecate and remove the included MPI framework
> ---
>
> Key: MESOS-6084
> URL: https://issues.apache.org/jira/browse/MESOS-6084
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 1.0.0
>Reporter: Joseph Wu
>Assignee: Vinod Kone
>Priority: Minor
>  Labels: mpi
>
> The Mesos codebase still includes code for an 
> [MPI|http://www.mcs.anl.gov/research/projects/mpi/] framework.  This code has 
> been untouched and probably not used since around Mesos 0.9.0.  Since we 
> don't support this code anymore, we should deprecate and remove it.
> The code is located here:
> https://github.com/apache/mesos/tree/db4c8a0e9eaf27f3e2d42a620a5e612863cbf9ea/mpi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10092) Cannot pull image from docker registry which does not reply with 'scope'/'service' in WWW-Authenticate header

2020-02-25 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044770#comment-17044770
 ] 

Vinod Kone commented on MESOS-10092:


Upto 1.7 should be fine I think.

> Cannot pull image from docker registry which does not reply with 
> 'scope'/'service' in WWW-Authenticate header
> -
>
> Key: MESOS-10092
> URL: https://issues.apache.org/jira/browse/MESOS-10092
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
> Fix For: 1.8.2, 1.9.1, 1.10.0
>
>
> This problem was encountered when trying to specify container image 
> nvcr.io/nvidia/tensorflow:19.12-tf1-py3
> When initiating Docker Registry authentication 
> (https://docs.docker.com/registry/spec/auth/token/) with nvcr.io, Mesos URI 
> fetcher receives 'WWW-Authenticate' header without 'service' and 'scope' 
> params, and fails here:
> https://github.com/apache/mesos/blob/1e9b121273a6d9248a78ab44798bd4c1138c31ee/src/uri/fetchers/docker.cpp#L1083
> This is an example of an unsuccessful request made by Mesos:
> {code}
> curl -s -S -L -i --raw --http1.1 -H "Accept: 
> application/vnd.docker.distribution.manifest.v2+json,application/vnd.docker.distribution.manifest.v1+json,application/vnd.docker.distribution.manifest.v1+prettyjws"
>  -y 60 https://nvcr.io/v2/nvidia/tensorflow/manifests/19.08-py3
> HTTP/1.1 401 Unauthorized
> Content-Type: text/html
> Date: Wed, 22 Jan 2020 19:01:57 GMT
> Server: nginx/1.14.2
> Www-Authenticate: Bearer 
> realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push;
> Content-Length: 195
> Connection: keep-alive
> 
> 401 Authorization Required
> 
> 401 Authorization Required
> nginx/1.14.2
> 
> 
> {code}
> At the same time, docker is perfectly capable of pulling this image.
> Note that the document "Token Authentication Specification" 
> (https://docs.docker.com/registry/spec/auth/token/), on which the Mesos 
> implementation is based, is vague on the issue of registries that do not 
> provide  'scope'/'service' in WWW-Authenticate header.
> What Docker does differently (at the very least, in the case of nvcr.io):
> It sends the initial request not to the maniferst/blob URI, but to the 
> repository root URI (http:://nvcr.io/v2 in this case):
> {code}
> GET /v2/ HTTP/1.1
> Host: nvcr.io
> User-Agent: docker/18.03.1-ce go/go1.9.5 git-commit/9ee9f402cd 
> kernel/4.15.0-60-generic os/linux arch/amd64 
> UpstreamClient(Docker-Client/18.09.7 \(linux\))
> {code}
> To this, it receives response with a "realm" that contains no query arguments:
> {code}
> HTTP/1.1 401 Unauthorized
> Connection: close
> Content-Length: 195
> Content-Type: text/html
> Date: Wed, 29 Jan 2020 12:22:43 GMT
> Server: nginx/1.14.2
> Www-Authenticate: Bearer realm="https://nvcr.io/proxy_auth
> {code}
> Then, it composes the scope using the image ref and a hardcoded "pull" 
> action: 
> https://github.com/docker/distribution/blob/a8371794149d1d95f1e846744b05c87f2f825e5a/registry/client/auth/session.go#L174
> (in a full accordance with this spec: 
> https://docs.docker.com/registry/spec/auth/scope/)
> and sends the following request to  https://nvcr.io/proxy_auth :
> {code}
> GET /proxy_auth?scope=repository%3Anvidia%2Ftensorflow%3Apull HTTP/1.1
> Host: nvcr.io
> User-Agent: Go-http-client/1.1
> {code}
> (Note that 'push' is absent from the scope)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10092) Cannot pull image from docker registry which does not reply with 'scope'/'service' in WWW-Authenticate header

2020-02-25 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044728#comment-17044728
 ] 

Vinod Kone commented on MESOS-10092:


[~asekretenko] Should this be resolved? Also, is this being backported?

> Cannot pull image from docker registry which does not reply with 
> 'scope'/'service' in WWW-Authenticate header
> -
>
> Key: MESOS-10092
> URL: https://issues.apache.org/jira/browse/MESOS-10092
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Critical
>
> This problem was encountered when trying to specify container image 
> nvcr.io/nvidia/tensorflow:19.12-tf1-py3
> When initiating Docker Registry authentication 
> (https://docs.docker.com/registry/spec/auth/token/) with nvcr.io, Mesos URI 
> fetcher receives 'WWW-Authenticate' header without 'service' and 'scope' 
> params, and fails here:
> https://github.com/apache/mesos/blob/1e9b121273a6d9248a78ab44798bd4c1138c31ee/src/uri/fetchers/docker.cpp#L1083
> This is an example of an unsuccessful request made by Mesos:
> {code}
> curl -s -S -L -i --raw --http1.1 -H "Accept: 
> application/vnd.docker.distribution.manifest.v2+json,application/vnd.docker.distribution.manifest.v1+json,application/vnd.docker.distribution.manifest.v1+prettyjws"
>  -y 60 https://nvcr.io/v2/nvidia/tensorflow/manifests/19.08-py3
> HTTP/1.1 401 Unauthorized
> Content-Type: text/html
> Date: Wed, 22 Jan 2020 19:01:57 GMT
> Server: nginx/1.14.2
> Www-Authenticate: Bearer 
> realm="https://nvcr.io/proxy_auth?scope=repository:nvidia/tensorflow:pull,push;
> Content-Length: 195
> Connection: keep-alive
> 
> 401 Authorization Required
> 
> 401 Authorization Required
> nginx/1.14.2
> 
> 
> {code}
> At the same time, docker is perfectly capable of pulling this image.
> Note that the document "Token Authentication Specification" 
> (https://docs.docker.com/registry/spec/auth/token/), on which the Mesos 
> implementation is based, is vague on the issue of registries that do not 
> provide  'scope'/'service' in WWW-Authenticate header.
> What Docker does differently (at the very least, in the case of nvcr.io):
> It sends the initial request not to the maniferst/blob URI, but to the 
> repository root URI (http:://nvcr.io/v2 in this case):
> {code}
> GET /v2/ HTTP/1.1
> Host: nvcr.io
> User-Agent: docker/18.03.1-ce go/go1.9.5 git-commit/9ee9f402cd 
> kernel/4.15.0-60-generic os/linux arch/amd64 
> UpstreamClient(Docker-Client/18.09.7 \(linux\))
> {code}
> To this, it receives response with a "realm" that contains no query arguments:
> {code}
> HTTP/1.1 401 Unauthorized
> Connection: close
> Content-Length: 195
> Content-Type: text/html
> Date: Wed, 29 Jan 2020 12:22:43 GMT
> Server: nginx/1.14.2
> Www-Authenticate: Bearer realm="https://nvcr.io/proxy_auth
> {code}
> Then, it composes the scope using the image ref and a hardcoded "pull" 
> action: 
> https://github.com/docker/distribution/blob/a8371794149d1d95f1e846744b05c87f2f825e5a/registry/client/auth/session.go#L174
> (in a full accordance with this spec: 
> https://docs.docker.com/registry/spec/auth/scope/)
> and sends the following request to  https://nvcr.io/proxy_auth :
> {code}
> GET /proxy_auth?scope=repository%3Anvidia%2Ftensorflow%3Apull HTTP/1.1
> Host: nvcr.io
> User-Agent: Go-http-client/1.1
> {code}
> (Note that 'push' is absent from the scope)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-4659) Avoid leaving orphan task after framework failure + master failover

2020-02-17 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038595#comment-17038595
 ] 

Vinod Kone commented on MESOS-4659:
---

I dont have the bandwidth right now, but happy to review the code if you work 
on a patch. Please see instructions here: 
https://mesos.readthedocs.io/en/latest/submitting-a-patch/

> Avoid leaving orphan task after framework failure + master failover
> ---
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Major
>  Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-6352) Expose information about unreachable agents via operator API

2020-01-23 Thread Vinod Kone (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-6352:
-

Assignee: (was: Abhishek Dasgupta)

> Expose information about unreachable agents via operator API
> 
>
> Key: MESOS-6352
> URL: https://issues.apache.org/jira/browse/MESOS-6352
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> Operators would probably find information about the set of unreachable agents 
> useful. Two main use cases I can see: (a) identifying which agents are 
> currently unreachable and when they were marked unreachable, (b) 
> understanding the size/content of the registry as a way to debug registry 
> perf issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9923) AgentAPITest.GetStateWithNonTerminalCompletedTask is flaky

2019-08-29 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918740#comment-16918740
 ] 

Vinod Kone commented on MESOS-9923:
---

Observed this on ASF CI when testing 1.9.0-rc2

{code}
3: [ RUN  ] ContentType/AgentAPITest.GetStateWithNonTerminalCompletedTask/0
3: I0828 21:20:00.647260 17669 cluster.cpp:177] Creating default 'local' 
authorizer
3: I0828 21:20:00.655491 17681 master.cpp:440] Master 
cff62302-83f2-4586-b6a6-ec603af07f35 (5ca4a76bb68c) started on 172.17.0.3:46115
3: I0828 21:20:00.655534 17681 master.cpp:443] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/49Bxak/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/49Bxak/master" --zk_session_timeout="10secs"
3: I0828 21:20:00.656090 17681 master.cpp:492] Master only allowing 
authenticated frameworks to register
3: I0828 21:20:00.656103 17681 master.cpp:498] Master only allowing 
authenticated agents to register
3: I0828 21:20:00.656111 17681 master.cpp:504] Master only allowing 
authenticated HTTP frameworks to register
3: I0828 21:20:00.656119 17681 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/49Bxak/credentials'
3: I0828 21:20:00.656491 17681 master.cpp:548] Using default 'crammd5' 
authenticator
3: I0828 21:20:00.656787 17681 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
3: I0828 21:20:00.657025 17681 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
3: I0828 21:20:00.657196 17681 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
3: I0828 21:20:00.657344 17681 master.cpp:629] Authorization enabled
3: I0828 21:20:00.657789 17676 hierarchical.cpp:474] Initialized hierarchical 
allocator process
3: I0828 21:20:00.658103 17677 whitelist_watcher.cpp:77] No whitelist given
3: I0828 21:20:00.664515 17681 master.cpp:2170] Elected as the leading master!
3: I0828 21:20:00.664557 17681 master.cpp:1666] Recovering from registrar
3: I0828 21:20:00.665055 17681 registrar.cpp:339] Recovering registrar
3: I0828 21:20:00.666002 17676 registrar.cpp:383] Successfully fetched the 
registry (0B) in 896us
3: I0828 21:20:00.666203 17676 registrar.cpp:487] Applied 1 operations in 
62949ns; attempting to update the registry
3: I0828 21:20:00.667132 17676 registrar.cpp:544] Successfully updated the 
registry in 852224ns
3: I0828 21:20:00.667313 17676 registrar.cpp:416] Successfully recovered 
registrar
3: I0828 21:20:00.667974 17676 master.cpp:1819] Recovered 0 agents from the 
registry (143B); allowing 10mins for agents to reregister
3: I0828 21:20:00.668090 17685 hierarchical.cpp:513] Skipping recovery of 
hierarchical allocator: nothing to recover
3: W0828 21:20:00.687932 17669 process.cpp:2877] Attempted to spawn already 
running process files@172.17.0.3:46115
3: I0828 21:20:00.689092 17669 cluster.cpp:518] Creating default 'local' 
authorizer
3: W0828 21:20:00.692358 17669 process.cpp:2877] Attempted to spawn already 
running process version@172.17.0.3:46115
3: I0828 21:20:00.692720 17684 slave.cpp:267] Mesos agent started on 
(901)@172.17.0.3:46115
3: I0828 21:20:00.692745 17684 slave.cpp:268] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/49Bxak/YlM9y5/store/appc" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="false" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 
--authentication_timeout_max="1mins" --authentication_timeout_min="5secs" 

[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky

2019-08-28 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918041#comment-16918041
 ] 

Vinod Kone commented on MESOS-8983:
---

Seen this again when testing 1.9.0-RC2.

{code}
13:32:33 3: [ RUN  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery
13:32:33 3: I0828 18:32:33.580678 20801 cluster.cpp:177] Creating default 
'local' authorizer
13:32:33 3: I0828 18:32:33.587858 20824 master.cpp:440] Master 
3de64da7-619c-4652-9d33-3fe2ca2a3d5f (b766865f9da3) started on 172.17.0.2:42011
13:32:33 3: I0828 18:32:33.587904 20824 master.cpp:443] Flags at startup: 
--acls="" --agent_ping_timeout="1secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/sIRhDp/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="2" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/sIRhDp/master" --zk_session_timeout="10secs"
13:32:33 3: I0828 18:32:33.588558 20824 master.cpp:492] Master only allowing 
authenticated frameworks to register
13:32:33 3: I0828 18:32:33.588574 20824 master.cpp:498] Master only allowing 
authenticated agents to register
13:32:33 3: I0828 18:32:33.588587 20824 master.cpp:504] Master only allowing 
authenticated HTTP frameworks to register
13:32:33 3: I0828 18:32:33.588599 20824 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/sIRhDp/credentials'
13:32:33 3: I0828 18:32:33.588999 20824 master.cpp:548] Using default 'crammd5' 
authenticator
13:32:33 3: I0828 18:32:33.589262 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
13:32:33 3: I0828 18:32:33.589529 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
13:32:33 3: I0828 18:32:33.589697 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
13:32:33 3: I0828 18:32:33.589866 20824 master.cpp:629] Authorization enabled
13:32:33 3: I0828 18:32:33.590817 20823 whitelist_watcher.cpp:77] No whitelist 
given
13:32:33 3: I0828 18:32:33.594827 20816 master.cpp:2170] Elected as the leading 
master!
13:32:33 3: I0828 18:32:33.594887 20816 master.cpp:1666] Recovering from 
registrar
13:32:33 3: I0828 18:32:33.595124 20808 hierarchical.cpp:474] Initialized 
hierarchical allocator process
13:32:33 3: I0828 18:32:33.595382 20808 registrar.cpp:339] Recovering registrar
13:32:33 3: I0828 18:32:33.596575 20808 registrar.cpp:383] Successfully fetched 
the registry (0B) in 1.14688ms
13:32:33 3: I0828 18:32:33.596779 20808 registrar.cpp:487] Applied 1 operations 
in 63194ns; attempting to update the registry
13:32:33 3: I0828 18:32:33.597638 20819 registrar.cpp:544] Successfully updated 
the registry in 788224ns
13:32:33 3: I0828 18:32:33.597805 20819 registrar.cpp:416] Successfully 
recovered registrar
13:32:33 3: I0828 18:32:33.598423 20819 master.cpp:1819] Recovered 0 agents 
from the registry (144B); allowing 10mins for agents to reregister
13:32:33 3: I0828 18:32:33.598599 20813 hierarchical.cpp:513] Skipping recovery 
of hierarchical allocator: nothing to recover
13:32:33 3: I0828 18:32:33.614511 20801 containerizer.cpp:318] Using isolation 
{ environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
13:32:33 3: W0828 18:32:33.615756 20801 backend.cpp:76] Failed to create 
'overlay' backend: OverlayBackend requires root privileges
13:32:33 3: W0828 18:32:33.615855 20801 backend.cpp:76] Failed to create 'aufs' 
backend: AufsBackend requires root privileges
13:32:33 3: W0828 18:32:33.615934 20801 backend.cpp:76] Failed to create 'bind' 
backend: BindBackend requires root privileges
13:32:33 3: I0828 18:32:33.616178 20801 provisioner.cpp:300] Using 

[jira] [Created] (MESOS-9955) Automate publishing SNAPSHOT JAR

2019-08-28 Thread Vinod Kone (Jira)
Vinod Kone created MESOS-9955:
-

 Summary: Automate publishing SNAPSHOT JAR
 Key: MESOS-9955
 URL: https://issues.apache.org/jira/browse/MESOS-9955
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone


Currently snapshot jars are manually published by a committer by running 
support/snapshot.sh. Instead, we should have Jenkins periodically build and 
publish the snapshot jar.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906473#comment-16906473
 ] 

Vinod Kone commented on MESOS-9545:
---

[~greggomann] Lets backport this to older releases.

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906276#comment-16906276
 ] 

Vinod Kone commented on MESOS-9936:
---

[~Fcomte] That's pretty weird and unexpected. Can you share gdb stack trace 
during one of these long recovery periods?

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.1
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9921) Mesos UI should display TaskStatus Reason in Tasks table

2019-08-01 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9921:
-

 Summary: Mesos UI should display TaskStatus Reason in Tasks table
 Key: MESOS-9921
 URL: https://issues.apache.org/jira/browse/MESOS-9921
 Project: Mesos
  Issue Type: Improvement
  Components: webui
Reporter: Vinod Kone


Tasks table shows "State" but it would be useful for at-a-glance debugging to 
also show the "Reason" in either the same or different column. Especially 
important for completed tasks table.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-6566) The Docker executor should not leak task env variables in the Docker command cmd line.

2019-07-12 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883994#comment-16883994
 ] 

Vinod Kone commented on MESOS-6566:
---

See the description in MESOS-6951 for a potential solution using `--env` 
argument.

> The Docker executor should not leak task env variables in the Docker command 
> cmd line.
> --
>
> Key: MESOS-6566
> URL: https://issues.apache.org/jira/browse/MESOS-6566
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, security
>Reporter: Gastón Kleiman
>Assignee: Till Toenshoff
>Priority: Major
>
> Task environment variables are sensitive, as they might contain secrets.
> The Docker executor starts tasks by executing a {{docker run}} command, and 
> it includes the env variables in the cmd line of the docker command, exposing 
> them to all the users in the machine:
> {code}
> $ ./src/mesos-execute --command="sleep 200" --containerizer=docker 
> --docker_image=alpine --env='{"foo": "bar"}' --master=10.0.2.15:5050 
> --name=test
> $ ps aux | grep bar
> [...] docker -H unix:///var/run/docker.sock run [...] -e foo=bar [...] alpine 
> -c sleep 200
> $
> {code}
> The Docker executor could pass Docker the {{--env-file}} flag, pointing it to 
> a file with the environment variables.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-7473) Use "-dev" prerelease label for version during development

2019-07-09 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-7473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-7473:
-

Assignee: (was: Neil Conway)

> Use "-dev" prerelease label for version during development
> --
>
> Key: MESOS-7473
> URL: https://issues.apache.org/jira/browse/MESOS-7473
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Priority: Major
>  Labels: mesosphere
>
> Prior discussion:
> https://lists.apache.org/thread.html/6e291c504fd44b79e452744b80073cb33adc1be85c17e22bbca35a6c@%3Cdev.mesos.apache.org%3E
> https://lists.apache.org/thread.html/eb526c9295b3cf8e4efc7e0a7d2dacabb61ab5ed867a05e7d913d3fb@%3Cdev.mesos.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9868) NetworkInfo from the agent /state endpoint is not correct.

2019-07-08 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9868:
-

Assignee: Qian Zhang

> NetworkInfo from the agent /state endpoint is not correct.
> --
>
> Key: MESOS-9868
> URL: https://issues.apache.org/jira/browse/MESOS-9868
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: containerization
>
> NetworkInfo from the agent /state endpoint is not correct, which is also 
> different from the networkInfo of /containers endpoint. Some frameworks rely 
> on the state endpoint to get the ip address for other containers to run.
> agent's state endpoint
> {noformat}
> {
> "state": "TASK_RUNNING",
> "timestamp": 1561574343.1521769,
> "container_status": {
> "container_id": {
> "value": "9a2633be-d2e5-4636-9ad4-7b2fc669da99",
> "parent": {
> "value": "45ebab16-9b4b-416e-a7f2-4833fd4ed8ff"
> }
> },
> "network_infos": [
> {
> "ip_addresses": [
> {
> "protocol": "IPv4",
> "ip_address": "172.31.10.35"
> }
> ]
> }
> ]
> },
> "healthy": true
> }
> {noformat}
> agent's /containers endpoint
> {noformat}
> "status": {
> "container_id": {
> "value": "5ffc9df2-3be6-4879-8b2d-2fde3f0477e0"
> },
> "executor_pid": 16063,
> "network_infos": [
> {
> "ip_addresses": [
> {
> "ip_address": "9.0.35.71",
> "protocol": "IPv4"
> }
> ],
> "name": "dcos"
> }
> ]
> }
> {noformat}
> The ip addresses are different^^.
> The container is in RUNNING state and is running correctly. Just the state 
> endpoint is not correct. One thing to notice is that the state endpoint used 
> to show the correct IP. After there was an agent restart and master leader 
> re-election, the IP address in the state endpoint was changed.
> Here is the checkpoint CNI network information
> {noformat}
> OK-23:37:48-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c
>  # cat 
> /var/run/mesos/isolators/network/cni/45ebab16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/network.conf
>  
> {"args":{"org.apache.mesos":{"network_info":{"name":"dcos"}}},"chain":"M-DCOS","delegate":{"bridge":"m-dcos","hairpinMode":true,"ipMasq":false,"ipam":{"dataDir":"/var/run/dcos/cni/networks","routes":[{"dst":"0.0.0.0/0"}],"subnet":"9.0.73.0/25","type":"host-local"},"isGateway":true,"mtu":1420,"type":"bridge"},"excludeDevices":["m-dcos"],"name":"dcos","type":"mesos-cni-port-mapper"}
> {noformat}
> {noformat}
> OK-01:30:05-root@int-mountvolumeagent2-soak113s:/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S4/frameworks/26ffb84c-81ba-4b3b-989b-9c6560e51fa1-0171/executors/k8s-clusters.kc02__etcd__b50dc403-30d1-4b54-a367-332fb3621030/runs/latest/tasks/k8s-clusters.kc02__etcd-2-peer__5b6aa5fc-e113-4021-9db8-b63e0c8d1f6c
>  # cat 
> /var/run/mesos/isolators/network/cni/45eb16-9b4b-416e-a7f2-4833fd4ed8ff/dcos/eth0/network.info
> {"dns":{},"ip4":{"gateway":"9.0.73.1","ip":"9.0.73.65/25","routes":[{"dst":"0.0.0.0/0","gw":"9.0.73.1"}]}}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8500) Enhanced support for multi-role scalibility

2019-06-14 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8500:
-

Assignee: Andrei Sekretenko  (was: Kapil Arya)

> Enhanced support for multi-role scalibility
> ---
>
> Key: MESOS-8500
> URL: https://issues.apache.org/jira/browse/MESOS-8500
> Project: Mesos
>  Issue Type: Epic
>Reporter: Kapil Arya
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: mesosphere, resource-management
>
> CC: [~bmahler]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9784) Server side SSL Certificate Validation

2019-05-14 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9784:
-

 Summary: Server side SSL Certificate Validation
 Key: MESOS-9784
 URL: https://issues.apache.org/jira/browse/MESOS-9784
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone
Assignee: Benno Evers






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`

2019-05-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831691#comment-16831691
 ] 

Vinod Kone commented on MESOS-9761:
---

The columns there: "Guarantee" and "Limit" are currently reflecting Quota 
settings set via quota endpoints. While a reservation for a role technically 
guarantees that amount to that role (and which is where I assume your confusion 
stems from) that's currently not the intention of that column. There are plans 
to improve the quota page in the near future. cc [~mzhu] [~bmahler]

> Mesos UI does not properly account for resources set via `--default-role`
> -
>
> Key: MESOS-9761
> URL: https://issues.apache.org/jira/browse/MESOS-9761
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: resource-management, ui
> Attachments: default_role_ui.png
>
>
> In our cluster, we have two agents configured with  
> "--default_role=slave_public" and 64 cpus each, for a total of 128 cpus 
> allocated to this role. The right side of the screenshot shows one of them.
> However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor 
> "Limit" does show any resources for this role.
> See attached screenshot for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9739) When recovered agent marked gone, retain agent ID

2019-04-25 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826320#comment-16826320
 ] 

Vinod Kone commented on MESOS-9739:
---

The marked agent is already retained in the registry right? Right now if a gone 
agent attempts to reregister, master refuses it and shuts it down. Any 
reconciliation requests should've been answered with TASK_GONE_BY_OPERATOR 
already. So not sure if there is more to do?

> When recovered agent marked gone, retain agent ID
> -
>
> Key: MESOS-9739
> URL: https://issues.apache.org/jira/browse/MESOS-9739
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> When a recovered agent is marked gone, we could retain its agent ID so that 
> if it attempts to reregister, we could send task status updates for its tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9123) Add metric role consumed quota.

2019-04-11 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9123:
-

Assignee: Meng Zhu  (was: Till Toenshoff)

> Add metric role consumed quota.
> ---
>
> Key: MESOS-9123
> URL: https://issues.apache.org/jira/browse/MESOS-9123
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Meng Zhu
>Priority: Critical
>  Labels: allocator, mesosphere, metrics, resource-management
>
> Currently, quota related metrics exposes quota guarantee and allocated quota. 
> We should expose "consumed" which is allocated quota plus unallocated 
> reservations. We already have this info in the allocator as 
> `consumedQuotaScalarQuantities`, just needs to expose it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9687) Add the glog patch to pass microseconds via the LogSink interface.

2019-04-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812705#comment-16812705
 ] 

Vinod Kone commented on MESOS-9687:
---

Any plans to backport this?

> Add the glog patch to pass microseconds via the LogSink interface.
> --
>
> Key: MESOS-9687
> URL: https://issues.apache.org/jira/browse/MESOS-9687
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
> Fix For: 1.8.0
>
>
> Currently, custom LogSink implementations in the modules (for example, this 
> one:
>  [https://github.com/dcos/dcos-mesos-modules/blob/master/logsink/logsink.hpp] 
> )
>  are logging `00` instead of microseconds in the timestamp - simply 
> because the LogSink interface in glog has no place for microseconds.
> The proposed glog fix is here: [https://github.com/google/glog/pull/441]
> Getting this into glog release might take a long time (they released 0.4.0 
> recently, but the previous release 0.3.5 was two years ago), therefore it 
> makes sense to add this patch into Mesos build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812550#comment-16812550
 ] 

Vinod Kone edited comment on MESOS-6285 at 4/8/19 5:27 PM:
---

We already limit the number of completed tasks per executor (200, not 
configurable), completed executors per framework (150, configurable) and max 
frameworks (50, not configurable) in memory. I don't think there's much value 
in storing metadata information about more than these 
tasks/executors/frameworks on the disk? If yes, we need to figure out how to GC 
a task/executor/framework once it goes out of the in-memory circular buffers / 
bounded hashmaps holding these.


was (Author: vinodkone):
We already limit the number of completed tasks per executor (200, not 
configurable) and completed executors per framework (150, configurable) in 
memory. I don't think there's much value in storing metadata information about 
more than these tasks/executors on the disk? If yes, we need to figure out how 
to GC a task/executor once it goes out of the in-memory circular buffers / 
bounded hashmaps holding these.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812635#comment-16812635
 ] 

Vinod Kone commented on MESOS-6285:
---

Note that we currently read the executor state from disk for *all* completed 
executors in `state.cpp`. We can improve this to only read completed executor 
information until we reach the completed executors per framework limit. Same 
with completed tasks and completed frameworks.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812550#comment-16812550
 ] 

Vinod Kone commented on MESOS-6285:
---

We already limit the number of completed tasks per executor (200, not 
configurable) and completed executors per framework (150, configurable) in 
memory. I don't think there's much value in storing metadata information about 
more than these tasks/executors on the disk? If yes, we need to figure out how 
to GC a task/executor once it goes out of the in-memory circular buffers / 
bounded hashmaps holding these.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812533#comment-16812533
 ] 

Vinod Kone commented on MESOS-6285:
---

Raising the priority to Critical because we have seen this happen in a 
production cluster.

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9693) Add master validation for SeccompInfo.

2019-04-05 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810961#comment-16810961
 ] 

Vinod Kone commented on MESOS-9693:
---

In addition to the points raised above, there is also an upgrade compatibility 
issue with implementing this.

If a framework's task doesn't work when seccomp is enabled (e.g., a kubelet 
task that needs to run as unconfined so that it can launch k8s pods that are 
seccomp confined by docker seccomp profile), then the framework needs to be 
first upgraded to use seccomp unconfined option. Now if this framework was 
running already on non-seccomp enabled cluster, the upgraded framework needs to 
still keep running even with seccomp disabled. After framework upgrade, mesos 
agent can be upgraded to enable seccomp and this won't affect the framework. So 
Mesos cannot reject such a task but just ignore it.

[~gilbert] [~abudnik] Should we close this as "Won't do"?

> Add master validation for SeccompInfo.
> --
>
> Key: MESOS-9693
> URL: https://issues.apache.org/jira/browse/MESOS-9693
> Project: Mesos
>  Issue Type: Task
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>
> 1. if seccomp is not enabled, we should return failure if any fw specify 
> seccompInfo and return appropriate status update.
> 2. at most one field of profile_name and unconfined should be set. better to 
> validate in master



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6934) Support pulling Docker images with V2 Schema 2 image manifest

2019-03-29 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805022#comment-16805022
 ] 

Vinod Kone commented on MESOS-6934:
---

[~gilbert] Can you post the review chain here?

> Support pulling Docker images with V2 Schema 2 image manifest
> -
>
> Key: MESOS-6934
> URL: https://issues.apache.org/jira/browse/MESOS-6934
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: https://reviews.apache.org/r/70288/
> https://reviews.apache.org/r/70289/
> https://reviews.apache.org/r/70290/
> https://reviews.apache.org/r/70291/
>Reporter: Ilya Pronin
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
>
> MESOS-3505 added support for pulling Docker images by their digest to the 
> Mesos Containerizer provisioner. However currently it only works with images 
> that were pushed with Docker 1.9 and older or with Registry 2.2.1 and older. 
> Newer versions use Schema 2 manifests by default. Because of CAS constraints 
> the registry does not convert those manifests on-the-fly to Schema 1 when 
> they are being pulled by digest.
> Compatibility details are documented here: 
> https://docs.docker.com/registry/compatibility/
> Image Manifest V2, Schema 2 is documented here: 
> https://docs.docker.com/registry/spec/manifest-v2-2/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6934) Support pulling Docker images with V2 Schema 2 image manifest

2019-03-29 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-6934:
-

Assignee: Gilbert Song  (was: Ilya Pronin)

> Support pulling Docker images with V2 Schema 2 image manifest
> -
>
> Key: MESOS-6934
> URL: https://issues.apache.org/jira/browse/MESOS-6934
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: https://reviews.apache.org/r/70288/
> https://reviews.apache.org/r/70289/
> https://reviews.apache.org/r/70290/
> https://reviews.apache.org/r/70291/
>Reporter: Ilya Pronin
>Assignee: Gilbert Song
>Priority: Major
>  Labels: containerization
>
> MESOS-3505 added support for pulling Docker images by their digest to the 
> Mesos Containerizer provisioner. However currently it only works with images 
> that were pushed with Docker 1.9 and older or with Registry 2.2.1 and older. 
> Newer versions use Schema 2 manifests by default. Because of CAS constraints 
> the registry does not convert those manifests on-the-fly to Schema 1 when 
> they are being pulled by digest.
> Compatibility details are documented here: 
> https://docs.docker.com/registry/compatibility/
> Image Manifest V2, Schema 2 is documented here: 
> https://docs.docker.com/registry/spec/manifest-v2-2/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-2842) Master crashes when framework changes principal on re-registration

2019-03-28 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804274#comment-16804274
 ] 

Vinod Kone commented on MESOS-2842:
---

Yes, this ticket should just focus on disallowing or ignoring principal changes 
on re-registration.

> Master crashes when framework changes principal on re-registration
> --
>
> Key: MESOS-2842
> URL: https://issues.apache.org/jira/browse/MESOS-2842
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Sekretenko
>Priority: Critical
>  Labels: foundations, security
>
> The master should be updated to avoid crashing when a framework re-registers 
> with a different principal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.

2019-03-25 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801146#comment-16801146
 ] 

Vinod Kone commented on MESOS-9672:
---

I guess we would still need this incase the pid re-use happens even without an 
agent reboot (highly unlikely but technically possible).

> Docker containerizer should ignore pids of executors that do not pass the 
> connection check.
> ---
>
> Key: MESOS-9672
> URL: https://issues.apache.org/jira/browse/MESOS-9672
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Meng Zhu
>Priority: Major
>  Labels: containerization
>
> When recovering executors with a tracked pid we first try to establish a 
> connection to its libprocess address to avoid reaping an irrelevant process:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054
> If the connection fails to establish, we should not track its pid: 
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071
> One trouble this might cause is that if the pid is being used by another 
> executor, this could lead to duplicate pid error and lead the agent into a 
> crash loop:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9672) Docker containerizer should ignore pids of executors that do not pass the connection check.

2019-03-22 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799331#comment-16799331
 ] 

Vinod Kone commented on MESOS-9672:
---

Not sure if this is still needed after 
https://issues.apache.org/jira/browse/MESOS-9501

> Docker containerizer should ignore pids of executors that do not pass the 
> connection check.
> ---
>
> Key: MESOS-9672
> URL: https://issues.apache.org/jira/browse/MESOS-9672
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Meng Zhu
>Priority: Major
>  Labels: containerization
>
> When recovering executors with a tracked pid we first try to establish a 
> connection to its libprocess address to avoid reaping an irrelevant process:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1019-L1054
> If the connection fails to establish, we should not track its pid: 
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1071
> One trouble this might cause is that if the pid is being used by another 
> executor, this could lead to duplicate pid error and lead the agent into a 
> crash loop:
> https://github.com/apache/mesos/blob/4580834471fb3bc0b95e2b96e04a63d34faef724/src/slave/containerizer/docker.cpp#L1066-L1068



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4719) Add allocator metric for number of offers each role / framework received.

2019-03-19 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796509#comment-16796509
 ] 

Vinod Kone commented on MESOS-4719:
---

[~greggomann] Don't we have this now? cc [~gkleiman]

> Add allocator metric for number of offers each role / framework received.
> -
>
> Key: MESOS-4719
> URL: https://issues.apache.org/jira/browse/MESOS-4719
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
>
> A counter for the number of allocations to a framework can be used to monitor 
> allocation progress, e.g., when agents are added to a cluster, and as other 
> frameworks are added or removed.
> Currently, an offer by the hierarchical allocator to a framework consists of 
> a list of resources on possibly many agents. Resources might be offered in 
> order to satisfy outstanding quota or for fairness. To capture allocations on 
> fine granularity we should not count the number of offers, but instead the 
> pieces making up that offer, as such a metric would better resolve the effect 
> of changes (e.g., adding/removing a framework).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9500) spark submit with docker image on mesos cluster fails.

2019-03-19 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796507#comment-16796507
 ] 

Vinod Kone commented on MESOS-9500:
---

[~atheethkaup] Can you paste all the relevant agent log lines related to this 
task? It's hard to tell from just the line you posted above. We need log lines 
from the task launch all the way to task failing.

> spark submit with docker image on mesos cluster fails.
> --
>
> Key: MESOS-9500
> URL: https://issues.apache.org/jira/browse/MESOS-9500
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.7.0
>Reporter: atheeth kaup
>Priority: Critical
>
> We have 3 node  cluster with mesos(V 1.7), spark 2.4 and docker 18.06.1 
> installed. one is master and other two are agents. while doing spark submit 
> job fails. UI shows only one task(Driver) launching on one of the slaves in 
> the failed state.
>  
> command:
> spark-submit \
>  --master mesos://.32:7077 \
>  --deploy-mode cluster \
>  --class com.learning.spark.WordCount \
>  --conf 
> spark.mesos.executor.docker.image=mesosphere/spark:2.4.0-2.2.1-3-hadoop-2.7 \
>  --conf spark.master.rest.enabled=true \
>  /home/mapr/mesos/wordcount.jar hdfs://***.36:8020/user/mapr/sparkL/input.txt 
> hdfs://***.36:8020/user/output
>  
> Error in one of the Logs:
>  
> Running on machine: **-i0058
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid 
> [file:line|file://line/]] msg
> W1221 16:51:23.857431 17978 state.cpp:478] Failed to find executor forked pid 
> file 
> '/home/**/mesos/mesos-1.7.0/build/workDir/meta/slaves/822a5d52-b8ba-459f-ade2-7f3a2ebd240f-S0/frameworks/77c39bdf-09e3-4cb9-9026-21e900d08318-0007/executors/driver-20181221112019-0006/runs/7c1399ca-4e0a-4bd9-b02e-9c5ca3854c77/pids/forked.pid'
>  
> Below is the only property that we have set on all the nodes and have started 
> the dispatcher:
> *export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/libmesos.so*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9651) Design for docker registry v2 schema2 basic support.

2019-03-19 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9651:
-

Assignee: Qian Zhang

> Design for docker registry v2 schema2 basic support.
> 
>
> Key: MESOS-9651
> URL: https://issues.apache.org/jira/browse/MESOS-9651
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9448) Semantics of RECONCILE_OPERATIONS framework API call are incorrect

2019-03-14 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792789#comment-16792789
 ] 

Vinod Kone commented on MESOS-9448:
---

[~greggomann] Can we close this as a dup of MESOS-9318?

> Semantics of RECONCILE_OPERATIONS framework API call are incorrect
> --
>
> Key: MESOS-9448
> URL: https://issues.apache.org/jira/browse/MESOS-9448
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Reporter: Benjamin Bannier
>Priority: Major
>
> The typical pattern in the framework HTTP API is that frameworks send calls 
> to which the master responds with {{Accepted}} responses and which trigger 
> events. The only designed exception to this are {{SUBSCRIBE}} calls to which 
> the master responds with an {{Ok}} response containing the assigned framework 
> ID. This is even codified in {{src/scheduler.cpp:646ff}},
> {code}
> if (response->code == process::http::Status::OK) {
>   // Only SUBSCRIBE call should get a "200 OK" response.
>   CHECK_EQ(Call::SUBSCRIBE, call.type());
> {code}
> Currently, the handling of {{RECONCILE_OPERATIONS}} calls does not follow 
> this pattern. Instead of sending events, the master immediately responds with 
> a {{Ok}} and a list of operations. This e.g., leads to assertion failures in 
> above hard check whenever one uses the {{Scheduler::send}} instead of 
> {{Scheduler::call}}. One can reproduce this by modifying the existing tests 
> in {{src/operation_reconciliation_tests.cpp}},
> {code}
> mesos.send({createCallReconcileOperations(frameworkId, {operation})}); // ADD 
> THIS.
> const Future result =
>   mesos.call({createCallReconcileOperations(frameworkId, {operation})});
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8257) Unified Containerizer "leaks" a target container mount path to the host FS when the target resolves to an absolute path

2019-03-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792153#comment-16792153
 ] 

Vinod Kone commented on MESOS-8257:
---

[~jasonlai] [~jieyu] Is there more to be done here?

> Unified Containerizer "leaks" a target container mount path to the host FS 
> when the target resolves to an absolute path
> ---
>
> Key: MESOS-8257
> URL: https://issues.apache.org/jira/browse/MESOS-8257
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Jason Lai
>Assignee: Jason Lai
>Priority: Critical
>  Labels: bug, containerizer, mountpath
>
> If a target path under the root FS provisioned from an image resolves to an 
> absolute path, it will not appear in the container root FS after 
> {{pivot_root(2)}} is called.
> A typical example is that when the target path is under {{/var/run}} (e.g. 
> {{/var/run/some-dir}}), which is usually a symlink to an absolute path of 
> {{/run}} in Debian images, the target path will get resolved as and created 
> at {{/run/some-dir}} in the host root FS, after the container root FS gets 
> provisioned. The target path will get unmounted after {{pivot_root(2)}} as it 
> is part of the old root (host FS).
> A workaround is to use {{/run}} instead of {{/var/run}}, but absolute 
> symlinks need to be resolved within the scope of the container root FS path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-4599) ReviewBot should re-verify a review chain if any of the reviews is updated

2019-02-26 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-4599:
-

Assignee: Vinod Kone

> ReviewBot should re-verify a review chain if any of the reviews is updated
> --
>
> Key: MESOS-4599
> URL: https://issues.apache.org/jira/browse/MESOS-4599
> Project: Mesos
>  Issue Type: Improvement
>  Components: reviewbot
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>Priority: Major
>  Labels: integration, newbie++
>
> Currently reviewbot only re-verifies a review chain if the last review in the 
> chain is updated (new diff or new depends on field). It should also re-verify 
> if one of the dependent reviews in the chain is updated!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9592) Mesos Websitebot is flaky

2019-02-21 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9592:
-

 Summary: Mesos Websitebot is flaky
 Key: MESOS-9592
 URL: https://issues.apache.org/jira/browse/MESOS-9592
 Project: Mesos
  Issue Type: Bug
  Components: project website
Reporter: Vinod Kone


Mesos Websitebot Jenkins job is sometimes failing during the endpoint 
documentation generation face. It looks like it is timing out on getting a 
response from the /health endpoint of the master.

Example failing build: 
https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Websitebot/1899/

{code}
01:20:30 make[2]: Leaving directory '/mesos/build/src'
01:20:30 make[1]: Leaving directory '/mesos/build/src'
01:20:30 /mesos
01:20:41 Timeout attempting to hit url: http://127.0.0.1:5050/health

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.

2019-02-20 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773095#comment-16773095
 ] 

Vinod Kone commented on MESOS-8930:
---

Saw this when testing 1.7.2 rc.

{code}
2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
2: I0219 23:34:37.010373 23554 process.cpp:3588] Handling HTTP event for 
process 'metrics' with path: '/metrics/snapshot'
2: I0219 23:34:37.062614 23555 process.cpp:3588] Handling HTTP event for 
process 'metrics' with path: '/metrics/snapshot'
2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: Failure
{code}

> THREADSAFE_SnapshotTimeout is flaky.
> 
>
> Key: MESOS-8930
> URL: https://issues.apache.org/jira/browse/MESOS-8930
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.2
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test, foundations, mesosphere
>
> Observed on ASF CI, might be related to a recent test change 
> https://reviews.apache.org/r/66831/
> {noformat}
> 18:23:31 2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
> 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: 
> Failure
> 18:23:46 2: Failed to wait 15secs for response
> 22:57:13 Build timed out (after 300 minutes). Marking the build as failed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769813#comment-16769813
 ] 

Vinod Kone commented on MESOS-8887:
---

Landed on master:

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


Backported to 1.7.x

commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


> Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
> 
>
> Key: MESOS-8887
> URL: https://issues.apache.org/jira/browse/MESOS-8887
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.3, 1.5.2, 1.6.1, 1.7.1
>Reporter: Gilbert Song
>Assignee: Vinod Kone
>Priority: Major
>  Labels: foundations, mesosphere, partition, registry
>
> Unreachable agents will be gc-ed by the master registry after 
> `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the 
> GC happens, the agent will be removed from the master's unreachable agent 
> list, but its corresponding tasks are still in UNREACHABLE state in the 
> framework struct (though removed from `slaves.unreachableTasks`). We should 
> instead remove those tasks from everywhere or transition those tasks to a 
> terminal state, either TASK_LOST or TASK_GONE (further discussion is needed 
> to define the semantic).
> This improvement relates to how do we want to couple the update of task with 
> the GC of agent. Right now they are somewhat decoupled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769822#comment-16769822
 ] 

Vinod Kone commented on MESOS-8892:
---

Observed this on 1.6.x branch

{code}
[ RUN  ] MasterSlaveReconciliationTest.ReconcileDroppedOperation
I0215 21:36:18.921594  4052 cluster.cpp:172] Creating default 'local' authorizer
I0215 21:36:18.922894  4057 master.cpp:465] Master 
21d3c979-83c3-4141-9a3a-635fd550d45a (ip-172-16-10-236.ec2.internal) started on 
172.16.10.236:36326
I0215 21:36:18.922915  4057 master.cpp:468] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator
="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwri
te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/exYTvt/credentials" 
--filter_gpu_resources="true" --framework_s
orter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize=
"true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per
_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memo
ry" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --reg
istry_strict="false" --require_agent_domain="false" --role_sorter="drf" 
--root_submissions="true" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/exY
Tvt/master" --zk_session_timeout="10secs"
I0215 21:36:18.923121  4057 master.cpp:517] Master only allowing authenticated 
frameworks to register
I0215 21:36:18.923393  4057 master.cpp:523] Master only allowing authenticated 
agents to register
I0215 21:36:18.923408  4057 master.cpp:529] Master only allowing authenticated 
HTTP frameworks to register
I0215 21:36:18.923414  4057 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/exYTvt/credentials'
I0215 21:36:18.923651  4057 master.cpp:573] Using default 'crammd5' 
authenticator
I0215 21:36:18.923777  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0215 21:36:18.923904  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0215 21:36:18.924266  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0215 21:36:18.924465  4057 master.cpp:654] Authorization enabled
I0215 21:36:18.924823  4056 hierarchical.cpp:179] Initialized hierarchical 
allocator process
I0215 21:36:18.927826  4058 whitelist_watcher.cpp:77] No whitelist given
I0215 21:36:18.928741  4054 master.cpp:2176] Elected as the leading master!
I0215 21:36:18.928759  4054 master.cpp:1711] Recovering from registrar
I0215 21:36:18.928800  4054 registrar.cpp:339] Recovering registrar
I0215 21:36:18.929002  4054 registrar.cpp:383] Successfully fetched the 
registry (0B) in 132096ns
I0215 21:36:18.929033  4054 registrar.cpp:487] Applied 1 operations in 7184ns; 
attempting to update the registry
I0215 21:36:18.929154  4058 registrar.cpp:544] Successfully updated the 
registry in 108032ns
I0215 21:36:18.929232  4058 registrar.cpp:416] Successfully recovered registrar
I0215 21:36:18.929361  4055 master.cpp:1825] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0215 21:36:18.929415  4055 hierarchical.cpp:217] Skipping recovery of 
hierarchical allocator: nothing to recover
W0215 21:36:18.931118  4052 process.cpp:2829] Attempted to spawn already 
running process files@172.16.10.236:36326
I0215 21:36:18.931596  4052 containerizer.cpp:300] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0215 21:36:18.934453  4052 linux_launcher.cpp:147] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0215 21:36:18.934859  4052 provisioner.cpp:299] Using default backend 'aufs'
I0215 21:36:18.935410  4052 cluster.cpp:460] Creating default 'local' authorizer
I0215 21:36:18.936164  4060 slave.cpp:259] Mesos agent started on 
(230)@172.16.10.236:36326
W0215 21:36:18.936399  4052 process.cpp:2829] Attempted to spawn already 
running process version@172.16.10.236:36326
I0215 21:36:18.936187  4060 slave.cpp:260] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/exYTvt/GHfic5/store/appc" --authenticate
_http_executors="true" 

[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769824#comment-16769824
 ] 

Vinod Kone commented on MESOS-8892:
---

[~bbannier] Can we backport this test fix to 1.6.x branch?

> MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
> 
>
> Key: MESOS-8892
> URL: https://issues.apache.org/jira/browse/MESOS-8892
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.6.0
>Reporter: Greg Mann
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
> Attachments: 
> MasterSlaveReconciliationTest.ReconcileDroppedOperation.txt
>
>
> This was observed on a Debian 9 SSL/GRPC-enabled build. It appears that a 
> poorly-timed {{UpdateSlaveMessage}} leads to the operation reconciliation 
> occurring before the expectation for the {{ReconcileOperationsMessage}} is 
> registered:
> {code}
> I0508 00:11:09.700815 22498 master.cpp:4362] Processing ACCEPT call for 
> offers: [ f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-O0 ] on agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost) for framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) 
> at scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309
> I0508 00:11:09.700870 22498 master.cpp:3602] Authorizing principal 
> 'test-principal' to reserve resources 'cpus(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):2; 
> mem(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):1024; disk(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> ports(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):[31000-32000]'
> I0508 00:11:09.701228 22493 master.cpp:4725] Applying RESERVE operation for 
> resources 
> [{"allocation_info":{"role":"default-role"},"name":"cpus","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"mem","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"disk","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"type":"RANGES"}]
>  from framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) at 
> scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 to agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost)
> I0508 00:11:09.701498 22493 master.cpp:11265] Sending operation '' (uuid: 
> 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) to agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost)
> I0508 00:11:09.701627 22494 slave.cpp:1564] Forwarding agent update 
> {"operations":{},"resource_version_uuid":{"value":"0HeA06ftS6m76SNoNZNPag=="},"slave_id":{"value":"f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0"},"update_oversubscribed_resources":true}
> I0508 00:11:09.701848 22494 master.cpp:7800] Received update of agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost) with total oversubscribed resources {}
> W0508 00:11:09.701905 22494 master.cpp:7974] Performing explicit 
> reconciliation with agent for known operation 
> 81dffb62-6e75-4c6c-a97b-41c92c58d6a7 since it was not present in original 
> reconciliation message from agent
> I0508 00:11:09.702085 22494 master.cpp:11015] Updating the state of operation 
> '' (uuid: 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) for framework 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (latest state: OPERATION_PENDING, 
> status update state: OPERATION_DROPPED)
> I0508 00:11:09.702239 22491 hierarchical.cpp:925] Updated allocation of 
> framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- on agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 from cpus(allocated: default-role):2; 
> mem(allocated: default-role):1024; disk(allocated: default-role):1024; 
> ports(allocated: default-role):[31000-32000] to disk(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> cpus(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):2; mem(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> ports(allocated: default-role)(reservations: 
> 

[jira] [Comment Edited] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769813#comment-16769813
 ] 

Vinod Kone edited comment on MESOS-8887 at 2/15/19 10:38 PM:
-

Landed on master:
---
commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907



---
Backported to 1.7.x
---
commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907



was (Author: vinodkone):
Landed on master:

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


Backported to 1.7.x

commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed 

[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769727#comment-16769727
 ] 

Vinod Kone commented on MESOS-8750:
---

[~megha.sharma] [~xujyan] Why was this not backported to older versions?

> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.6.0
>
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9576) Provide a configuration option to disallow logrotate stdout/stderr options in task env

2019-02-14 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9576:
-

 Summary: Provide a configuration option to disallow logrotate 
stdout/stderr options in task env
 Key: MESOS-9576
 URL: https://issues.apache.org/jira/browse/MESOS-9576
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Joseph Wu


See MESOS-9564 for context.

The configuration option could be module flag for the logrotate module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.

2019-02-13 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9143:
-

Assignee: Meng Zhu
Story Points: 3

> MasterQuotaTest.RemoveSingleQuota is flaky.
> ---
>
> Key: MESOS-9143
> URL: https://issues.apache.org/jira/browse/MESOS-9143
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Meng Zhu
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere, resource-management
> Attachments: RemoveSingleQuota-badrun.txt
>
>
> {noformat}
> ../../src/tests/master_quota_tests.cpp:493
> Value of: metrics.at(metricKey).isNone()
>   Actual: false
> Expected: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8950) Framework operations can make resources unallocatable

2019-02-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767498#comment-16767498
 ] 

Vinod Kone edited comment on MESOS-8950 at 2/13/19 7:30 PM:


Resolving as Won't Fix for now.


was (Author: vinodkone):
Resolving is Won't Fix for now.

> Framework operations can make resources unallocatable
> -
>
> Key: MESOS-8950
> URL: https://issues.apache.org/jira/browse/MESOS-8950
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Bannier
>Priority: Minor
>
> The allocator does not offer {{cpus}} or {{mem}} resources smaller than 
> certain, fixed sizes. For framework operations, we do not enforce the same 
> minimum size constraints which can lead the resources becoming unavailable 
> for any future allocations. This behavior seems most pronounced when a 
> framework can register in many roles.
> Example: 
> * A single multirole framework which can register in any role, e.g., in a 
> certain role subhierarchy. 
> * Single agent with {{cpus:1.5*MIN_CPUS}} and {{mem:1.5*MIN_MEM}}.
> * Framework is offered all resources and performs a {{RESERVE}} on 
> {{cpus:0.5*MIN_CPUS}}. It then changes its role.
> * Same framework behavior in next two offer cycles. All {{cpus}} are then 
> reserved for different roles in unallocatable amounts.
> * Last offer will be just for {{mem:1.5*MIN_MEM}}, framework reserves 0.6 of 
> these to another role. This fragements the {{mem}} resources as well.
> * No allocatable resources left in cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8887) Improve the master registry GC on task state transitioning.

2019-02-06 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8887:
-

Assignee: Vinod Kone

> Improve the master registry GC on task state transitioning.
> ---
>
> Key: MESOS-8887
> URL: https://issues.apache.org/jira/browse/MESOS-8887
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Gilbert Song
>Assignee: Vinod Kone
>Priority: Major
>  Labels: mesosphere, partition, registry
>
> Unreachable agents will be gc-ed by the master registry after 
> `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the 
> GC happens, the agent will be removed from the master's unreachable agent 
> list, but its corresponding tasks are still in UNREACHABLE state in the 
> framework struct (though removed from `slaves.unreachableTasks`). We should 
> instead remove those tasks from everywhere or transition those tasks to a 
> terminal state, either TASK_LOST or TASK_GONE (further discussion is needed 
> to define the semantic).
> This improvement relates to how do we want to couple the update of task with 
> the GC of agent. Right now they are somewhat decoupled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8096) Enqueueing events in MockHTTPScheduler can lead to segfaults.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761903#comment-16761903
 ] 

Vinod Kone commented on MESOS-8096:
---

Observed this with 
LauncherAndIsolationParam/PersistentVolumeDefaultExecutor.ROOT_TaskGroupsSharingViaSandboxVolumes/2

{code}
...
...
I0206 05:23:37.884572 19578 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) 
for task producer o
f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent
I0206 05:23:37.884624 19578 slave.cpp:5808] Forwarding the update TASK_FINISHED 
(Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of 
framework ffd3400c-13b0-4d
40-b63a-f4d3efc720de- to master@172.16.10.36:45979
I0206 05:23:37.884678 19578 slave.cpp:5701] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
2612f9b7-a190-4924-b40a-8193bced2dd8) for tas
k producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.884764 19578 master.cpp:8516] Status update TASK_FINISHED 
(Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task producer of 
framework ffd3400c-13b0-4d40-b63a
-f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at 
slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal)
I0206 05:23:37.884784 19578 master.cpp:8573] Forwarding status update 
TASK_FINISHED (Status UUID: 2612f9b7-a190-4924-b40a-8193bced2dd8) for task 
producer of framework ffd3400c-13b
0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.884881 19578 master.cpp:11210] Updating the state of task 
producer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- (latest state: 
TASK_FINISHED, status updat
e state: TASK_FINISHED)
I0206 05:23:37.885048 19577 hierarchical.cpp:1230] Recovered cpus(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.1; 
mem(allocated: default-role)
(reservations: [(DYNAMIC,default-role,test-principal)]):32; disk(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):32 (total: 
cpus:1.7; mem:928; disk
:928; ports:[31000-32000]; cpus(reservations: 
[(DYNAMIC,default-role,test-principal)]):0.3; mem(reservations: 
[(DYNAMIC,default-role,test-principal)]):96; disk(reservations: [(DYN
AMIC,default-role,test-principal)]):95; disk(reservations: 
[(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1, 
allocated: disk(allocated: default-role)(rese
rvations: 
[(DYNAMIC,default-role,test-principal)])[executor:executor_volume_path]:1; 
disk(allocated: default-role)(reservations: 
[(DYNAMIC,default-role,test-principal)]):63; mem(a
llocated: default-role)(reservations: 
[(DYNAMIC,default-role,test-principal)]):64; cpus(allocated: 
default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):0.2) on age
nt ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 from framework 
ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885195 19572 scheduler.cpp:845] Enqueuing event UPDATE received 
from http://172.16.10.36:45979/master/api/v1/scheduler
I0206 05:23:37.885380 19571 scheduler.cpp:248] Sending ACKNOWLEDGE call to 
http://172.16.10.36:45979/master/api/v1/scheduler
I0206 05:23:37.885645 19572 task_status_update_manager.cpp:328] Received task 
status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) 
for task consumer of 
framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885682 19572 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) 
for task consumer o
f framework ffd3400c-13b0-4d40-b63a-f4d3efc720de- to the agent
I0206 05:23:37.885735 19572 slave.cpp:5808] Forwarding the update TASK_FINISHED 
(Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of 
framework ffd3400c-13b0-4d
40-b63a-f4d3efc720de- to master@172.16.10.36:45979
I0206 05:23:37.885792 19572 slave.cpp:5701] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
2dd9e000-d74f-4d94-ad72-0b313492) for tas
k consumer of framework ffd3400c-13b0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885802 19578 process.cpp:3588] Handling HTTP event for process 
'master' with path: '/master/api/v1/scheduler'
I0206 05:23:37.885885 19578 master.cpp:8516] Status update TASK_FINISHED 
(Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task consumer of 
framework ffd3400c-13b0-4d40-b63a
-f4d3efc720de- from agent ffd3400c-13b0-4d40-b63a-f4d3efc720de-S0 at 
slave(1170)@172.16.10.36:45979 (ip-172-16-10-36.ec2.internal)
I0206 05:23:37.885905 19578 master.cpp:8573] Forwarding status update 
TASK_FINISHED (Status UUID: 2dd9e000-d74f-4d94-ad72-0b313492) for task 
consumer of framework ffd3400c-13b
0-4d40-b63a-f4d3efc720de-
I0206 05:23:37.885991 19578 master.cpp:11210] Updating the state of task 

[jira] [Commented] (MESOS-8796) Some GroupTest.* are flaky on Mac.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761769#comment-16761769
 ] 

Vinod Kone commented on MESOS-8796:
---

Saw this again on internal CI (on Mac).
{code}
[ RUN  ] GroupTest.GroupPathWithRestrictivePerms
I0205 21:14:33.530055 296834496 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 50946
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui
ld
2019-02-05 21:14:33,530:8369(0x736ae000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 
watcher=0x1145565d0 sessionId=0 s
essionPasswd= context=0x7fb3e0c9bc90 flags=0
2019-02-05 21:14:33,530:8369(0x73fcf000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50946]
2019-02-05 21:14:33,532:8369(0x73fcf000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:50946], 
sessionId=0x168c13aa8b9, negotiated timeou
t=1
2019-02-05 
21:14:36,875:8369(0x73fcf000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@774: Client 
environment:user.name=jenkins
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@782: Client 
environment:user.home=/Users/jenkins
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@log_env@794: Client 
environment:user.dir=/Users/jenkins/workspace/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mac/mesos/bui
ld
2019-02-05 21:14:36,878:8369(0x7341f000):ZOO_INFO@zookeeper_init@827: 
Initiating client connection, host=127.0.0.1:50946 sessionTimeout=1 
watcher=0x1145565d0 sessionId=0 s
essionPasswd= context=0x7fb3e0a4db10 flags=0
2019-02-05 21:14:36,879:8369(0x74767000):ZOO_INFO@check_events@1764: 
initiated connection to server [127.0.0.1:50946]
2019-02-05 21:14:36,880:8369(0x74767000):ZOO_INFO@check_events@1811: 
session establishment complete on server [127.0.0.1:50946], 
sessionId=0x168c13aa8b90001, negotiated timeou
t=1
I0205 21:14:36.880167 55189504 group.cpp:341] Group process 
(zookeeper-group(48)@10.0.49.4:65013) connected to ZooKeeper
I0205 21:14:36.880213 55189504 group.cpp:831] Syncing group operations: queue 
size (joins, cancels, datas) = (1, 0, 0)
I0205 21:14:36.880225 55189504 group.cpp:395] Authenticating with ZooKeeper 
using digest
2019-02-05 
21:14:40,222:8369(0x74767000):ZOO_INFO@auth_completion_func@1327: 
Authentication scheme digest succeeded
I0205 21:14:40.24 55189504 group.cpp:419] Trying to create path 
'/read-only' in ZooKeeper
2019-02-05 21:14:40,223:8369(0x736ae000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@765: Client 
environment:os.arch=18.2.0
2019-02-05 21:14:40,224:8369(0x736ae000):ZOO_INFO@log_env@766: Client 
environment:os.version=Darwin Kernel Version 18.2.0: Mon Nov 12 20:24:46 PST 
2018; root:xnu-4903.231.4~2/
RELEASE_X86_64
2019-02-05 

[jira] [Commented] (MESOS-8266) MasterMaintenanceTest.AcceptInvalidInverseOffer is flaky.

2019-02-06 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761770#comment-16761770
 ] 

Vinod Kone commented on MESOS-8266:
---

Observed this on internal CI.

{code}
[ RUN  ] MasterMaintenanceTest.AcceptInvalidInverseOffer
I0206 05:13:46.592031 27319 cluster.cpp:174] Creating default 'local' authorizer
I0206 05:13:46.593217 27341 master.cpp:414] Master 
9ee5ab9a-1898-4ba6-a7f3-0093d03b19f8 (ip-172-16-10-145.ec2.internal) started on 
172.16.10.145:36957
I0206 05:13:46.593240 27341 master.cpp:417] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator
="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwri
te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/cBTYhp/credentials" 
--filter_gpu_resources="true" --framework_s
orter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize=
"true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream
_subscribers="1000" --max_unreachable_tasks_per_framework="1000" 
--memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" 
--port="5050" --publish_per_framework_me
trics="true" --quiet="false" --recovery_agent_removal_limit="100%" 
--registry="in_memory" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age
="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="t
rue" --version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/cBTYhp/master" --zk_session_timeout="10secs"
I0206 05:13:46.593377 27341 master.cpp:466] Master only allowing authenticated 
frameworks to register
I0206 05:13:46.593385 27341 master.cpp:472] Master only allowing authenticated 
agents to register
I0206 05:13:46.593391 27341 master.cpp:478] Master only allowing authenticated 
HTTP frameworks to register
I0206 05:13:46.593397 27341 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/cBTYhp/credentials'
I0206 05:13:46.593485 27341 master.cpp:522] Using default 'crammd5' 
authenticator
I0206 05:13:46.593521 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0206 05:13:46.593560 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0206 05:13:46.593582 27341 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0206 05:13:46.593605 27341 master.cpp:603] Authorization enabled
I0206 05:13:46.594100 27340 hierarchical.cpp:176] Initialized hierarchical 
allocator process
I0206 05:13:46.594298 27341 whitelist_watcher.cpp:77] No whitelist given
I0206 05:13:46.594842 27344 master.cpp:2103] Elected as the leading master!
I0206 05:13:46.594856 27344 master.cpp:1638] Recovering from registrar
I0206 05:13:46.594935 27344 registrar.cpp:339] Recovering registrar
I0206 05:13:46.595073 27344 registrar.cpp:383] Successfully fetched the 
registry (0B) in 115968ns
I0206 05:13:46.595101 27344 registrar.cpp:487] Applied 1 operations in 6424ns; 
attempting to update the registry
I0206 05:13:46.595223 27344 registrar.cpp:544] Successfully updated the 
registry in 105984ns
I0206 05:13:46.595314 27344 registrar.cpp:416] Successfully recovered registrar
I0206 05:13:46.595392 27344 master.cpp:1752] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0206 05:13:46.595446 27344 hierarchical.cpp:216] Skipping recovery of 
hierarchical allocator: nothing to recover
W0206 05:13:46.595887 27319 process.cpp:2829] Attempted to spawn already 
running process version@172.16.10.145:36957
I0206 05:13:46.597141 27319 sched.cpp:232] Version: 1.8.0
I0206 05:13:46.597421 27345 sched.cpp:336] New master detected at 
master@172.16.10.145:36957
I0206 05:13:46.597458 27345 sched.cpp:401] Authenticating with master 
master@172.16.10.145:36957
I0206 05:13:46.597509 27345 sched.cpp:408] Using default CRAM-MD5 authenticatee
I0206 05:13:46.597611 27345 authenticatee.cpp:121] Creating new client SASL 
connection
I0206 05:13:46.597707 27345 master.cpp:9902] Authenticating 
scheduler-6e5ae29d-e284-4d9b-bbc2-2df8747428fd@172.16.10.145:36957
I0206 05:13:46.597754 27345 authenticator.cpp:414] Starting authentication 
session for crammd5-authenticatee(459)@172.16.10.145:36957
I0206 05:13:46.597805 27345 authenticator.cpp:98] Creating new server SASL 
connection

[jira] [Created] (MESOS-9552) Tasks in unreachable state are not answered during implicit reconciliation

2019-02-02 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9552:
-

 Summary: Tasks in unreachable state are not answered during 
implicit reconciliation
 Key: MESOS-9552
 URL: https://issues.apache.org/jira/browse/MESOS-9552
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


Implicit reconciliation only answers about tasks in `pendingTasks` and `tasks` 
in the `Framework` struct. But it ignores tasks in `unreachableTasks` list.

Even during explicit reconciliation master doesn't look at the 
`unreachableTasks` map, but it answers it correctly, in case the agent id is 
set, because the corresponding agent in in unreachable list. If instead master 
looks into `unreachableTasks` map it could answer irrespective of the agent id 
being set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9547) Removing non-checkpointing framework on the master does not properly clean up all data structures

2019-01-31 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9547:
-

 Summary: Removing non-checkpointing framework on the master does 
not properly clean up all data structures
 Key: MESOS-9547
 URL: https://issues.apache.org/jira/browse/MESOS-9547
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


When an agent is disconnected, non-checkpointing frameworks on it are removed 
via `removeFramework(Slave*, Framework*)`.

But looks like this function only cleans up active tasks and executors in the 
slave struct. It doesn't cleanup `pendingTasks` or `killedTasks` for example. 

It also doesn't cleanup `operations`, but not sure if it's intentional. 

There are a bunch of `*Resources` variables in the struct, that probably should 
be updated?

It's also worthwhile auditing `removeFramework(Framework*)` to see if it's 
leaking any resources as well.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9546) Operation status is not updated in master when agent is marked as unreachable or gone

2019-01-31 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9546:
-

 Summary: Operation status is not updated in master when agent is 
marked as unreachable or gone
 Key: MESOS-9546
 URL: https://issues.apache.org/jira/browse/MESOS-9546
 Project: Mesos
  Issue Type: Bug
 Environment: In `Master::markGone` and `Master::_markUnreachable` we 
call `sendBulkOperationFeedback` which sends `OPERATION_GONE_BY_OPERATOR` and 
`OPERATION_UNREACHABLE` to the corresponding frameworks, but the operations 
states are note changed in the `Master::Framework` struct.

See also the related issue MESOS-9545 which applies to unreachable operations.
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-01-31 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9545:
-

 Summary: Marking an unreachable agent as gone should transition 
the tasks to terminal state
 Key: MESOS-9545
 URL: https://issues.apache.org/jira/browse/MESOS-9545
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone


If an unreachable agent is marked as gone, currently master just marks that 
agent in the registry but doesn't do anything about its tasks. So the tasks are 
in UNREACHABLE state in the master forever, until the master fails over. This 
is not great UX. We should transition these to terminal state instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5916) Improve health checking.

2019-01-03 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-5916:
-

   Resolution: Fixed
 Assignee: Alexander Rukletsov
Fix Version/s: 1.2.0

Moved unresolved issues to MESOS-7353.

> Improve health checking.
> 
>
> Key: MESOS-5916
> URL: https://issues.apache.org/jira/browse/MESOS-5916
> Project: Mesos
>  Issue Type: Epic
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: health-check, mesosphere
> Fix For: 1.2.0
>
>
> This epic aims to provide comprehensive health check support in Mesos 
> (command, HTTP, TCP) and a unified API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9509) Benchmark command health checks in default executor

2019-01-03 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9509:
-

 Summary: Benchmark command health checks in default executor
 Key: MESOS-9509
 URL: https://issues.apache.org/jira/browse/MESOS-9509
 Project: Mesos
  Issue Type: Task
  Components: executor
Reporter: Vinod Kone


TCP/HTTP health checks were extensively scale tested as part of 
https://mesosphere.com/blog/introducing-mesos-native-health-checks-apache-mesos-part-2/.
 

We should do the same for command checks by default executor because it uses a 
very different mechanism (agent fork/execs the check command as a nested 
container) and will have very different scalability characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2019-01-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732473#comment-16732473
 ] 

Vinod Kone commented on MESOS-7622:
---

[~kaysoky] Did you fix this recently?

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.2.2
>Reporter: Aaron Wood
>Priority: Critical
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" 

[jira] [Commented] (MESOS-9495) Test `MasterTest.CreateVolumesV1AuthorizationFailure` is flaky.

2019-01-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732456#comment-16732456
 ] 

Vinod Kone commented on MESOS-9495:
---

I'm seeing this quite frequently in ASF CI. Looks like this test was written as 
part of reservation refinement. [~bmahler] can you get this into resource mgmt 
backlog?

> Test `MasterTest.CreateVolumesV1AuthorizationFailure` is flaky.
> ---
>
> Key: MESOS-9495
> URL: https://issues.apache.org/jira/browse/MESOS-9495
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.7.0
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: allocator, flaky-test
> Attachments: 
> mesos-ec2-centos-7-CMake.Mesos.MasterTest.CreateVolumesV1AuthorizationFailure-badrun.txt
>
>
> {noformat}
> I1219 22:45:59.578233 26107 slave.cpp:1884] Will retry registration in 
> 2.10132ms if necessary
> I1219 22:45:59.578615 26107 master.cpp:6125] Received register agent message 
> from slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal)
> I1219 22:45:59.578830 26107 master.cpp:3871] Authorizing agent with principal 
> 'test-principal'
> I1219 22:45:59.578975 26107 master.cpp:6183] Authorized registration of agent 
> at slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal)
> I1219 22:45:59.579039 26107 master.cpp:6294] Registering agent at 
> slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal) with id 
> 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0
> I1219 22:45:59.579540 26107 registrar.cpp:495] Applied 1 operations in 
> 143852ns; attempting to update the registry
> I1219 22:45:59.580102 26109 registrar.cpp:552] Successfully updated the 
> registry in 510208ns
> I1219 22:45:59.580312 26109 master.cpp:6342] Admitted agent 
> 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 at slave(463)@172.16.10.13:35739 
> (ip-172-16-10-13.ec2.internal)
> I1219 22:45:59.580968 26111 slave.cpp:1884] Will retry registration in 
> 23.973874ms if necessary
> I1219 22:45:59.581447 26111 slave.cpp:1486] Registered with master 
> master@172.16.10.13:35739; given agent ID 
> 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0
> ...
> I1219 22:45:59.580950 26109 master.cpp:6391] Registered agent 
> 85292fcc-b698-4377-9faa-f76b0ccd4ee5-S0 at slave(463)@172.16.10.13:35739 
> (ip-172-16-10-13.ec2.internal) with disk(reservations: 
> [(STATIC,role1)]):1024; cpus:2; mem:6796; ports:[31000-32000]
> I1219 22:45:59.583326 26109 master.cpp:6125] Received register agent message 
> from slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal)
> I1219 22:45:59.583524 26109 master.cpp:3871] Authorizing agent with principal 
> 'test-principal'
> ...
> W1219 22:45:59.584242 26109 master.cpp:6175] Refusing registration of agent 
> at slave(463)@172.16.10.13:35739 (ip-172-16-10-13.ec2.internal): 
> Authorization failure: Authorizer failure
> ...
> I1219 22:45:59.586944 26113 http.cpp:1185] HTTP POST for /master/api/v1 from 
> 172.16.10.13:47412
> I1219 22:45:59.587129 26113 http.cpp:682] Processing call CREATE_VOLUMES
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/master_tests.cpp:9386:
>  Failure
> Mock function called more times than expected - returning default value.
> Function call: authorized(@0x7f5066524720 48-byte object  50-7F 00-00 00-00 00-00 00-00 00-00 07-00 00-00 00-00 00-00 10-4E 02-48 50-7F 
> 00-00 E0-4C 02-48 50-7F 00-00 06-00 00-00 50-7F 00-00>)
>   Returns: Abandoned
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> I1219 22:45:59.587761 26113 master.cpp:3811] Authorizing principal 
> 'test-principal' to create volumes 
> '[{"disk":{"persistence":{"id":"id1","principal":"test-principal"},"volume":{"container_path":"path1","mode":"RW"}},"name":"disk","reservations":[{"role":"role1","type":"STATIC"}],"scalar":{"value":64.0},"type":"SCALAR"}]'
> ...
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/master_tests.cpp:9398:
>  Failure
> Failed to wait 15secs for response{noformat}
> This is because we authorize the retried registration before dropping it.
> Full log: 
> [^mesos-ec2-centos-7-CMake.Mesos.MasterTest.CreateVolumesV1AuthorizationFailure-badrun.txt]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub

2019-01-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732282#comment-16732282
 ] 

Vinod Kone commented on MESOS-9157:
---

Is this still an issue? Can we close this as can't repro?

> cannot pull docker image from dockerhub
> ---
>
> Key: MESOS-9157
> URL: https://issues.apache.org/jira/browse/MESOS-9157
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.6.1
>Reporter: Michael Bowie
>Priority: Blocker
>  Labels: containerization
>
> I am not able to pull docker images from docker hub through marathon/mesos. 
> I get one of two errors:
>  * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: 
> time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing 
> with pull after error: context canceled"`
>  * `Failed to run docker -H ... Error: No such object: 
> mesos-d2f333a8-fef2-48fb-8b99-28c52c327790`
> However, I can manually ssh into one of the agents and successfully pull the 
> image from the command line. 
> Any pointers in the right direction?
> Thank you!
> Similar Issues:
> https://github.com/mesosphere/marathon/issues/3869



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8470) CHECK failure in DRFSorter due to invalid framework id.

2019-01-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732156#comment-16732156
 ] 

Vinod Kone commented on MESOS-8470:
---

[~bbannier] Sounds good.

Can you please set the fix version above and also paste the commit messages as 
a comment?

> CHECK failure in DRFSorter due to invalid framework id.
> ---
>
> Key: MESOS-8470
> URL: https://issues.apache.org/jira/browse/MESOS-8470
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Chun-Hung Hsiao
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: allocator, mesosphere, techdebt
>
> A framework registering with a custom {{FrameworkID}} containing slashes such 
> as {{/foo/bar}} will trigger a CHECK failure at 
> https://github.com/apache/mesos/blob/177a2221496a2caa5ad25e71c9982ca3eed02fd4/src/master/allocator/sorter/drf/sorter.cpp#L167:
> {noformat}
> master.cpp:6618] Updating info for framework /foo/bar 
> sorter.cpp:167] Check failed: clientPath == current->clientPath() (/foo/bar 
> vs. foo/bar)
> {noformat}
> The sorter should be defensive with any {{FrameworkID}} containing slashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8470) CHECK failure in DRFSorter due to invalid framework id.

2019-01-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732151#comment-16732151
 ] 

Vinod Kone commented on MESOS-8470:
---

[~bbannier] ^^.

Also, any plans to backport this?

> CHECK failure in DRFSorter due to invalid framework id.
> ---
>
> Key: MESOS-8470
> URL: https://issues.apache.org/jira/browse/MESOS-8470
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Chun-Hung Hsiao
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: allocator, mesosphere, techdebt
>
> A framework registering with a custom {{FrameworkID}} containing slashes such 
> as {{/foo/bar}} will trigger a CHECK failure at 
> https://github.com/apache/mesos/blob/177a2221496a2caa5ad25e71c9982ca3eed02fd4/src/master/allocator/sorter/drf/sorter.cpp#L167:
> {noformat}
> master.cpp:6618] Updating info for framework /foo/bar 
> sorter.cpp:167] Check failed: clientPath == current->clientPath() (/foo/bar 
> vs. foo/bar)
> {noformat}
> The sorter should be defensive with any {{FrameworkID}} containing slashes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9459) Reviewbot is not verifying reviews that need verification

2018-12-31 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731363#comment-16731363
 ] 

Vinod Kone commented on MESOS-9459:
---

I see this error in CI.

{noformat}
=
Error response from daemon: conflict: unable to delete e895c0531b9a (cannot be 
forced) - image is being used by running container cf8595802408
git rev-parse HEAD
git clean -fd
git reset --hard 1e8ebcb8cf1710052c1ae14e342c1277616fa13d

Traceback (most recent call last):
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 341, in 
main()
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 329, in main
review_requests = api(review_requests_url)
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 119, in api
return json.loads(urllib.request.urlopen(url, data=data).read())
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
Build step 'Execute shell' marked build as failure
Sending e-mails to: bui...@mesos.apache.org

{noformat}

> Reviewbot is not verifying reviews that need verification
> -
>
> Key: MESOS-9459
> URL: https://issues.apache.org/jira/browse/MESOS-9459
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
>Reporter: Vinod Kone
>Assignee: Armand Grillet
>Priority: Major
>  Labels: ci, integration
> Fix For: 1.8.0
>
>
> For example this run of ReviewBot 
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console
>  says that there are no reviews to be verified, which is false because if we 
> look at ReviewBoard there are a bunch of reviews that have not been commented 
> on by ReviewBot since a new diff has been posted.
> {noformat}
> 12-05-18_23:41:54 - Running 
> /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
> 0 review requests need verification
> {noformat}
> I see the the logic of the verify-reviews.py script was changed as part of 
> the python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 
> which likely caused the bug. 
> As an aside, It's unfortunate that python3 update was bundled with logic 
> changes in this review. cc [~andschwa]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9459) Reviewbot is not verifying reviews that need verification

2018-12-06 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9459:
-

 Summary: Reviewbot is not verifying reviews that need verification
 Key: MESOS-9459
 URL: https://issues.apache.org/jira/browse/MESOS-9459
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Armand Grillet


For example this run of ReviewBot 
https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23594/console 
says that there are no reviews to be verified, which is false because if we 
look at ReviewBoard there are a bunch of reviews that have not been commented 
on by ReviewBot since a new diff has been posted.

{noformat}
12-05-18_23:41:54 - Running 
/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
0 review requests need verification
{noformat}

I see the the logic of the verify-reviews.py script was changed as part of the 
python3 transition here: https://reviews.apache.org/r/68619/diff/1#27 which 
likely caused the bug. 

As an aside, It's unfortunate that python3 update was bundled with logic 
changes in this review. cc [~andschwa]




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9083) Test ReservationEndpointsTest.ReserveAndUnreserveNoAuthentication is flaky.

2018-12-05 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710575#comment-16710575
 ] 

Vinod Kone commented on MESOS-9083:
---

Still happening on ASF CI.

{code}
[ RUN  ] ReservationEndpointsTest.ReserveAndUnreserveNoAuthentication
I1205 16:30:33.806411 22505 cluster.cpp:173] Creating default 'local' authorizer
I1205 16:30:33.809387 22511 master.cpp:413] Master 
80f814ea-0afc-4cec-8891-dfe913ca3075 (9b6ccb5930cd) started on 172.17.0.3:36088
I1205 16:30:33.809422 22511 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000
secs" --allocator="hierarchical" --authenticate_agents="true" 
--authenticate_frameworks="false" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="t
rue" --authenticate_http_readwrite="false" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/7ITn89/credentia
ls" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="bas
ic" --initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/7ITn89/master" --zk_session_timeout="10secs"
I1205 16:30:33.809890 22511 master.cpp:467] Master allowing unauthenticated 
frameworks to register
I1205 16:30:33.809912 22511 master.cpp:471] Master only allowing authenticated 
agents to register
I1205 16:30:33.809926 22511 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I1205 16:30:33.809937 22511 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/7ITn89/credentials'
I1205 16:30:33.810329 22511 master.cpp:521] Using default 'crammd5' 
authenticator
I1205 16:30:33.810554 22511 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1205 16:30:33.810809 22511 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1205 16:30:33.810992 22511 master.cpp:602] Authorization enabled
W1205 16:30:33.811025 22511 master.cpp:665] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I1205 16:30:33.811547 22510 whitelist_watcher.cpp:77] No whitelist given
I1205 16:30:33.811564 22508 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1205 16:30:33.814721 22509 master.cpp:2105] Elected as the leading master!
I1205 16:30:33.814755 22509 master.cpp:1660] Recovering from registrar
I1205 16:30:33.814954 22514 registrar.cpp:339] Recovering registrar
I1205 16:30:33.815670 22514 registrar.cpp:383] Successfully fetched the 
registry (0B) in 669952ns
I1205 16:30:33.815798 22514 registrar.cpp:487] Applied 1 operations in 39331ns; 
attempting to update the registry
I1205 16:30:33.816577 22508 registrar.cpp:544] Successfully updated the 
registry in 710912ns
I1205 16:30:33.816747 22508 registrar.cpp:416] Successfully recovered registrar
I1205 16:30:33.817325 22521 master.cpp:1774] Recovered 0 agents from the 
registry (135B); allowing 10mins for agents to reregister
I1205 16:30:33.817361 22517 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W1205 16:30:33.823312 22505 process.cpp:2829] Attempted to spawn already 
running process files@172.17.0.3:36088
I1205 16:30:33.824642 22505 containerizer.cpp:305] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
W1205 16:30:33.825306 22505 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W1205 16:30:33.825335 22505 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I1205 16:30:33.825368 22505 provisioner.cpp:298] Using default backend 'copy'
I1205 16:30:33.827760 22505 cluster.cpp:485] Creating default 'local' authorizer
I1205 16:30:33.829742 22510 slave.cpp:267] Mesos agent started on 
(444)@172.17.0.3:36088
I1205 16:30:33.829778 22510 slave.cpp:268] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 

[jira] [Created] (MESOS-9458) PersistentVolumeEndpointsTest.StaticReservation is flaky

2018-12-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9458:
-

 Summary: PersistentVolumeEndpointsTest.StaticReservation is flaky
 Key: MESOS-9458
 URL: https://issues.apache.org/jira/browse/MESOS-9458
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Vinod Kone


Observed this in ASF CI 
https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Buildbot-Test/310/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1%20MESOS_TEST_AWAIT_TIMEOUT=60secs,OS=ubuntu:16.04,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!ubuntu-4)&&(!H21)&&(!H23)&&(!H26)&&(!H27)/consoleText

{noformat}
[ RUN  ] PersistentVolumeEndpointsTest.StaticReservation
I1205 11:34:05.896515 22538 cluster.cpp:173] Creating default 'local' authorizer
I1205 11:34:05.898870 22542 master.cpp:413] Master 
3f2d828b-bff8-461a-98cf-de9163b36657 (488de0351206) started on 172.17.0.2:40803
I1205 11:34:05.898895 22542 master.cpp:416] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/qOMyLF/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/qOMyLF/master" --zk_session_timeout="10secs"
I1205 11:34:05.899194 22542 master.cpp:465] Master only allowing authenticated 
frameworks to register
I1205 11:34:05.899205 22542 master.cpp:471] Master only allowing authenticated 
agents to register
I1205 11:34:05.899212 22542 master.cpp:477] Master only allowing authenticated 
HTTP frameworks to register
I1205 11:34:05.899219 22542 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/qOMyLF/credentials'
I1205 11:34:05.899503 22542 master.cpp:521] Using default 'crammd5' 
authenticator
I1205 11:34:05.899674 22542 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1205 11:34:05.899879 22542 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1205 11:34:05.900029 22542 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1205 11:34:05.900211 22542 master.cpp:602] Authorization enabled
W1205 11:34:05.900238 22542 master.cpp:665] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I1205 11:34:05.900684 22539 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1205 11:34:05.900707 22545 whitelist_watcher.cpp:77] No whitelist given
I1205 11:34:05.903553 22540 master.cpp:2105] Elected as the leading master!
I1205 11:34:05.903587 22540 master.cpp:1660] Recovering from registrar
I1205 11:34:05.903753 22551 registrar.cpp:339] Recovering registrar
I1205 11:34:05.904373 22551 registrar.cpp:383] Successfully fetched the 
registry (0B) in 574976ns
I1205 11:34:05.904498 22551 registrar.cpp:487] Applied 1 operations in 34823ns; 
attempting to update the registry
I1205 11:34:05.905134 22551 registrar.cpp:544] Successfully updated the 
registry in 566016ns
I1205 11:34:05.905258 22551 registrar.cpp:416] Successfully recovered registrar
I1205 11:34:05.905829 22539 master.cpp:1774] Recovered 0 agents from the 
registry (135B); allowing 10mins for agents to reregister
I1205 11:34:05.905889 22540 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
W1205 11:34:05.918561 22538 process.cpp:2829] Attempted to spawn already 
running process files@172.17.0.2:40803
I1205 11:34:05.919775 22538 containerizer.cpp:305] Using isolation { 
environment_secret, 

[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765
 ] 

Vinod Kone edited comment on MESOS-7971 at 12/3/18 8:50 PM:


Saw this again.

{noformat}
06:14:51 [ RUN  ] 
PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
06:14:51 I1203 06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' 
authorizer
06:14:51 I1203 06:14:50.633529 19796 master.cpp:413] Master 
f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 172.17.0.3:44340
06:14:51 I1203 06:14:50.633581 19796 master.cpp:416] Flags at startup: 
--acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/4vMyjy/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs"
06:14:51 I1203 06:14:50.634217 19796 master.cpp:465] Master only allowing 
authenticated frameworks to register
06:14:51 I1203 06:14:50.634236 19796 master.cpp:471] Master only allowing 
authenticated agents to register
06:14:51 I1203 06:14:50.634253 19796 master.cpp:477] Master only allowing 
authenticated HTTP frameworks to register
06:14:51 I1203 06:14:50.634270 19796 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/4vMyjy/credentials'
06:14:51 I1203 06:14:50.634608 19796 master.cpp:521] Using default 'crammd5' 
authenticator
06:14:51 I1203 06:14:50.634840 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
06:14:51 I1203 06:14:50.635052 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
06:14:51 I1203 06:14:50.635200 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
06:14:51 I1203 06:14:50.635373 19796 master.cpp:602] Authorization enabled
06:14:51 W1203 06:14:50.635457 19796 master.cpp:665] The '--roles' flag is 
deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade 
notes for more information
06:14:51 I1203 06:14:50.635991 19800 whitelist_watcher.cpp:77] No whitelist 
given
06:14:51 I1203 06:14:50.636032 19793 hierarchical.cpp:175] Initialized 
hierarchical allocator process
06:14:51 I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the leading 
master!
06:14:51 I1203 06:14:50.638975 19796 master.cpp:1660] Recovering from registrar
06:14:51 I1203 06:14:50.639200 19792 registrar.cpp:339] Recovering registrar
06:14:51 I1203 06:14:50.639927 19792 registrar.cpp:383] Successfully fetched 
the registry (0B) in 672768ns
06:14:51 I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 
48006ns; attempting to update the registry
06:14:51 I1203 06:14:50.640718 19792 registrar.cpp:544] Successfully updated 
the registry in 582912ns
06:14:51 I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered 
registrar
06:14:51 I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 agents from 
the registry (135B); allowing 10mins for agents to reregister
06:14:51 I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
06:14:51 W1203 06:14:50.647153 19784 process.cpp:2829] Attempted to spawn 
already running process files@172.17.0.3:44340
06:14:51 I1203 06:14:50.648453 19784 containerizer.cpp:305] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
06:14:51 W1203 06:14:50.649060 19784 backend.cpp:76] Failed to create 'aufs' 
backend: AufsBackend requires root privileges
06:14:51 W1203 06:14:50.649088 19784 backend.cpp:76] Failed 

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765
 ] 

Vinod Kone commented on MESOS-7971:
---

Saw this again.

{code}
*06:14:51* [ RUN  ] 
PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove*06:14:51* I1203 
06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' 
authorizer*06:14:51* I1203 06:14:50.633529 19796 master.cpp:413] Master 
f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 
172.17.0.3:44340*06:14:51* I1203 06:14:50.633581 19796 master.cpp:416] Flags at 
startup: --acls="" --agent_ping_timeout="15secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1000secs" 
--allocator="hierarchical" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4vMyjy/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs"*06:14:51* I1203 
06:14:50.634217 19796 master.cpp:465] Master only allowing authenticated 
frameworks to register*06:14:51* I1203 06:14:50.634236 19796 master.cpp:471] 
Master only allowing authenticated agents to register*06:14:51* I1203 
06:14:50.634253 19796 master.cpp:477] Master only allowing authenticated HTTP 
frameworks to register*06:14:51* I1203 06:14:50.634270 19796 
credentials.hpp:37] Loading credentials for authentication from 
'/tmp/4vMyjy/credentials'*06:14:51* I1203 06:14:50.634608 19796 master.cpp:521] 
Using default 'crammd5' authenticator*06:14:51* I1203 06:14:50.634840 19796 
http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 
'mesos-master-readonly'*06:14:51* I1203 06:14:50.635052 19796 http.cpp:1042] 
Creating default 'basic' HTTP authenticator for realm 
'mesos-master-readwrite'*06:14:51* I1203 06:14:50.635200 19796 http.cpp:1042] 
Creating default 'basic' HTTP authenticator for realm 
'mesos-master-scheduler'*06:14:51* I1203 06:14:50.635373 19796 master.cpp:602] 
Authorization enabled*06:14:51* W1203 06:14:50.635457 19796 master.cpp:665] The 
'--roles' flag is deprecated. This flag will be removed in the future. See the 
Mesos 0.27 upgrade notes for more information*06:14:51* I1203 06:14:50.635991 
19800 whitelist_watcher.cpp:77] No whitelist given*06:14:51* I1203 
06:14:50.636032 19793 hierarchical.cpp:175] Initialized hierarchical allocator 
process*06:14:51* I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the 
leading master!*06:14:51* I1203 06:14:50.638975 19796 master.cpp:1660] 
Recovering from registrar*06:14:51* I1203 06:14:50.639200 19792 
registrar.cpp:339] Recovering registrar*06:14:51* I1203 06:14:50.639927 19792 
registrar.cpp:383] Successfully fetched the registry (0B) in 672768ns*06:14:51* 
I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 48006ns; 
attempting to update the registry*06:14:51* I1203 06:14:50.640718 19792 
registrar.cpp:544] Successfully updated the registry in 582912ns*06:14:51* 
I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered 
registrar*06:14:51* I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 
agents from the registry (135B); allowing 10mins for agents to 
reregister*06:14:51* I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping 
recovery of hierarchical allocator: nothing to recover*06:14:51* W1203 
06:14:50.647153 19784 process.cpp:2829] Attempted to spawn already running 
process files@172.17.0.3:44340*06:14:51* I1203 06:14:50.648453 19784 
containerizer.cpp:305] Using isolation \{ environment_secret, posix/cpu, 
posix/mem, filesystem/posix, network/cni }*06:14:51* W1203 06:14:50.649060 
19784 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires 
root privileges*06:14:51* W1203 06:14:50.649088 19784 backend.cpp:76] Failed to 

[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707749#comment-16707749
 ] 

Vinod Kone commented on MESOS-8983:
---

This is happening on ASF CI.

{code}
*15:49:24* 3: [ RUN  ] 
SlaveRecoveryTest/0.PingTimeoutDuringRecovery*15:49:24* 3: I1203 
15:49:24.425719 24686 cluster.cpp:173] Creating default 'local' 
authorizer*15:49:24* 3: I1203 15:49:24.430784 24687 master.cpp:413] Master 
620b2018-c90f-4b11-bbe3-8fa1c90f204d (5a45e7f918b2) started on 
172.17.0.3:42912*15:49:24* 3: I1203 15:49:24.430824 24687 master.cpp:416] Flags 
at startup: --acls="" --agent_ping_timeout="1secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1secs" 
--allocator="hierarchical" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/PNxXC7/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="2" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/PNxXC7/master" --zk_session_timeout="10secs"*15:49:24* 3: 
I1203 15:49:24.431120 24687 master.cpp:465] Master only allowing authenticated 
frameworks to register*15:49:24* 3: I1203 15:49:24.431131 24687 master.cpp:471] 
Master only allowing authenticated agents to register*15:49:24* 3: I1203 
15:49:24.431139 24687 master.cpp:477] Master only allowing authenticated HTTP 
frameworks to register*15:49:24* 3: I1203 15:49:24.431149 24687 
credentials.hpp:37] Loading credentials for authentication from 
'/tmp/PNxXC7/credentials'*15:49:24* 3: I1203 15:49:24.431355 24687 
master.cpp:521] Using default 'crammd5' authenticator*15:49:24* 3: I1203 
15:49:24.431514 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'*15:49:24* 3: I1203 
15:49:24.431659 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'*15:49:24* 3: I1203 
15:49:24.431778 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'*15:49:24* 3: I1203 
15:49:24.431896 24687 master.cpp:602] Authorization enabled*15:49:24* 3: I1203 
15:49:24.432276 24688 hierarchical.cpp:175] Initialized hierarchical allocator 
process*15:49:24* 3: I1203 15:49:24.432498 24688 whitelist_watcher.cpp:77] No 
whitelist given*15:49:24* 3: I1203 15:49:24.444337 24690 master.cpp:2105] 
Elected as the leading master!*15:49:24* 3: I1203 15:49:24.444366 24690 
master.cpp:1660] Recovering from registrar*15:49:24* 3: I1203 15:49:24.445142 
24687 registrar.cpp:339] Recovering registrar*15:49:24* 3: I1203 
15:49:24.445669 24687 registrar.cpp:383] Successfully fetched the registry (0B) 
in 472064ns*15:49:24* 3: I1203 15:49:24.445785 24687 registrar.cpp:487] Applied 
1 operations in 40517ns; attempting to update the registry*15:49:24* 3: I1203 
15:49:24.446497 24687 registrar.cpp:544] Successfully updated the registry in 
660992ns*15:49:24* 3: I1203 15:49:24.453212 24687 registrar.cpp:416] 
Successfully recovered registrar*15:49:24* 3: I1203 15:49:24.453722 24692 
master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins 
for agents to reregister*15:49:24* 3: I1203 15:49:24.453984 24692 
hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to 
recover*15:49:24* 3: I1203 15:49:24.468710 24686 containerizer.cpp:305] Using 
isolation \{ environment_secret, posix/cpu, posix/mem, filesystem/posix, 
network/cni }*15:49:24* 3: W1203 15:49:24.481513 24686 backend.cpp:76] Failed 
to create 'aufs' backend: AufsBackend requires root privileges*15:49:24* 3: 
W1203 15:49:24.481549 24686 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges*15:49:24* 3: I1203 15:49:24.481591 24686 
provisioner.cpp:298] Using default backend 'copy'*15:49:24* 3: W1203 
15:49:24.498661 24686 process.cpp:2829] Attempted to spawn already running 
process 

[jira] [Assigned] (MESOS-9022) Race condition in task updates could cause missing event in streaming

2018-11-30 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9022:
-

   Assignee: Benno Evers
 Labels: events foundations mesos mesosphere race-condition streaming  
(was: events mesos mesosphere race-condition streaming)
Component/s: HTTP API

Oh great. [~bennoe] can you confirm and resolve?

> Race condition in task updates could cause missing event in streaming
> -
>
> Key: MESOS-9022
> URL: https://issues.apache.org/jira/browse/MESOS-9022
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Affects Versions: 1.6.0
>Reporter: Evelyn Liu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: events, foundations, mesos, mesosphere, race-condition, 
> streaming
>
> Master sends update event of {{TASK_STARTING}} when task's latest state is 
> already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, 
> {{sendSubscribersUpdate}} is set to {{false}} because of 
> [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805].
>  The subscriber would not receive update event of {{TASK_FAILED}}.
> This happened when a task failed very fast. Is there a race condition while 
> handling task updates?
> {{*master log:*}}
> {code:java}
> I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, 
> status update state: TASK_STARTING)
>  I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_STARTING)
>  I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED 
> (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update 
> TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_FAILED)
>  I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-2554) Slave flaps when using --slave_subsystems that are not used for isolation.

2018-11-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705406#comment-16705406
 ] 

Vinod Kone commented on MESOS-2554:
---

[~jieyu] Is this still an issue? cc [~gilbert]

> Slave flaps when using --slave_subsystems that are not used for isolation.
> --
>
> Key: MESOS-2554
> URL: https://issues.apache.org/jira/browse/MESOS-2554
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.21.1, 0.22.0
>Reporter: Jie Yu
>Priority: Critical
>
> Say one use --slave_subsystems=cpuacct
> However, if he/she does not use cpuacct cgroup for isolation, all processes 
> forked by the slave (e.g., tasks) will be part of the slave cgroup. This is 
> not expected. ALso, more importantly, this will cause the slave to flap when 
> restart because there are task processes in slave's cgroup.
> We should add a check during slave startup at least!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5989) Libevent SSL Socket downgrade code accesses uninitialized memory / assumes single peek is sufficient.

2018-11-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705403#comment-16705403
 ] 

Vinod Kone commented on MESOS-5989:
---

[~bmahler] Is this still an issue?

> Libevent SSL Socket downgrade code accesses uninitialized memory / assumes 
> single peek is sufficient.
> -
>
> Key: MESOS-5989
> URL: https://issues.apache.org/jira/browse/MESOS-5989
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Critical
>
> See the XXX comment below.
> https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L912-L920
> {code}
> void LibeventSSLSocketImpl::peek_callback(
> evutil_socket_t fd,
> short what,
> void* arg)
> {
>   CHECK(__in_event_loop__);
>   CHECK(what & EV_READ);
>   char data[6];
>   // Try to peek the first 6 bytes of the message.
>   ssize_t size = ::recv(fd, data, 6, MSG_PEEK);
>   // Based on the function 'ssl23_get_client_hello' in openssl, we
>   // test whether to dispatch to the SSL or non-SSL based accept based
>   // on the following rules:
>   //   1. If there are fewer than 3 bytes: non-SSL.
>   //   2. If the 1st bit of the 1st byte is set AND the 3rd byte is
>   //  equal to SSL2_MT_CLIENT_HELLO: SSL.
>   //   3. If the 1st byte is equal to SSL3_RT_HANDSHAKE AND the 2nd
>   //  byte is equal to SSL3_VERSION_MAJOR and the 6th byte is
>   //  equal to SSL3_MT_CLIENT_HELLO: SSL.
>   //   4. Otherwise: non-SSL.
>   // For an ascii based protocol to falsely get dispatched to SSL it
>   // needs to:
>   //   1. Start with an invalid ascii character (0x80).
>   //   2. OR have the first 2 characters be a SYN followed by ETX, and
>   //  then the 6th character be SOH.
>   // These conditions clearly do not constitute valid HTTP requests,
>   // and are unlikely to collide with other existing protocols.
>   bool ssl = false; // Default to rule 4.
>   // XXX: data[0] data[1] are guaranteed to be set, but not data[>=2]
>   if (size < 2) { // Rule 1.
> ssl = false;
>   } else if ((data[0] & 0x80) && data[2] == SSL2_MT_CLIENT_HELLO) { // Rule 2.
> ssl = true;
>   } else if (data[0] == SSL3_RT_HANDSHAKE &&
>  data[1] == SSL3_VERSION_MAJOR &&
>  data[5] == SSL3_MT_CLIENT_HELLO) { // Rule 3.
> ssl = true;
>   }
>   AcceptRequest* request = reinterpret_cast(arg);
>   // We call 'event_free()' here because it ensures the event is made
>   // non-pending and inactive before it gets deallocated.
>   event_free(request->peek_event);
>   request->peek_event = nullptr;
>   if (ssl) {
> accept_SSL_callback(request);
>   } else {
> // Downgrade to a non-SSL socket.
> Try create = Socket::create(Socket::POLL, fd);
> if (create.isError()) {
>   request->promise.fail(create.error());
> } else {
>   request->promise.set(create.get());
> }
> delete request;
>   }
> }
> {code}
> This code accesses potentially uninitialized memory. Secondly, the code 
> assumes that a single peek is sufficient for determining whether the incoming 
> data is an SSL connection. There seems to be an assumption that in the SSL 
> path, we are guaranteed to peek a sufficient number of bytes when the socket 
> is ready to read. It's not clear what is providing this guarantee, or if this 
> is incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2018-11-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705400#comment-16705400
 ] 

Vinod Kone commented on MESOS-6632:
---

[~kaysoky], [~gilbert]: Is this still an issue?

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-5396) After failover, master does not remove agents with same UPID.

2018-11-30 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-5396:
-

Assignee: (was: Neil Conway)

> After failover, master does not remove agents with same UPID.
> -
>
> Key: MESOS-5396
> URL: https://issues.apache.org/jira/browse/MESOS-5396
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Neil Conway
>Priority: Critical
>  Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to *register* (not 
> reregister) with Mesos using the same UPID as the previous agent instance; 
> this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* agentID 
> until the {{agent_reregister_timeout}} expires (10 mins)
> This isn't necessarily wrong but it is suboptimal: when the agent attempts to 
> register with the same UPID that was used by the previous agent instance, we 
> know that a *reregistration* attempt for the old  pair will 
> never be seen. Hence we can declare the old agentID to be gone-forever and 
> notify frameworks appropriately, without waiting for the full 
> {{agent_reregister_timeout}} to expire.
> Note that we already implement the proposed behavior for the case when the 
> master does *not* failover 
> (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming

2018-11-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705368#comment-16705368
 ] 

Vinod Kone commented on MESOS-9022:
---

cc [~greggomann]

> Race condition in task updates could cause missing event in streaming
> -
>
> Key: MESOS-9022
> URL: https://issues.apache.org/jira/browse/MESOS-9022
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Evelyn Liu
>Priority: Blocker
>  Labels: events, mesos, mesosphere, race-condition, streaming
>
> Master sends update event of {{TASK_STARTING}} when task's latest state is 
> already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, 
> {{sendSubscribersUpdate}} is set to {{false}} because of 
> [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805].
>  The subscriber would not receive update event of {{TASK_FAILED}}.
> This happened when a task failed very fast. Is there a race condition while 
> handling task updates?
> {{*master log:*}}
> {code:java}
> I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, 
> status update state: TASK_STARTING)
>  I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_STARTING)
>  I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED 
> (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update 
> TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_FAILED)
>  I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub

2018-11-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705365#comment-16705365
 ] 

Vinod Kone commented on MESOS-9157:
---

cc [~gilbert] [~abudnik] [~qianzhang]

> cannot pull docker image from dockerhub
> ---
>
> Key: MESOS-9157
> URL: https://issues.apache.org/jira/browse/MESOS-9157
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.6.1
>Reporter: Michael Bowie
>Priority: Blocker
>  Labels: containerization
>
> I am not able to pull docker images from docker hub through marathon/mesos. 
> I get one of two errors:
>  * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: 
> time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing 
> with pull after error: context canceled"`
>  * `Failed to run docker -H ... Error: No such object: 
> mesos-d2f333a8-fef2-48fb-8b99-28c52c327790`
> However, I can manually ssh into one of the agents and successfully pull the 
> image from the command line. 
> Any pointers in the right direction?
> Thank you!
> Similar Issues:
> https://github.com/mesosphere/marathon/issues/3869



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9247) MasterAPITest.EventAuthorizationFiltering is flaky

2018-11-28 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9247:
-

Assignee: Till Toenshoff

> MasterAPITest.EventAuthorizationFiltering is flaky
> --
>
> Key: MESOS-9247
> URL: https://issues.apache.org/jira/browse/MESOS-9247
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: flaky, flaky-test, integration, mesosphere
> Attachments: MasterAPITest.EventAuthorizationFiltering.txt
>
>
> Saw this failure on a CentOS 6 SSL build in our internal CI. Build log 
> attached. For some reason, it seems that the initial {{TASK_ADDED}} event is 
> missed:
> {code}
> ../../src/tests/api_tests.cpp:2922
>   Expected: v1::master::Event::TASK_ADDED
>   Which is: TASK_ADDED
> To be equal to: event->get().type()
>   Which is: TASK_UPDATED
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2018-11-27 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700904#comment-16700904
 ] 

Vinod Kone commented on MESOS-7564:
---

{quote}

1) The SUBSCRIBE Call is one persistent connection where the executor sends one 
Call, and receives a stream of Events. There is currently no Executor->Agent 
traffic except the first request. This connection could probably use 
heartbeating in both directions. Agent->Executor heartbeats may come in the 
form of Events. Executor->Agent heartbeats will need to be something else (like 
the heartbeating suggested here: [https://reviews.apache.org/r/69183/] ).

{quote}

Do we really need heartbeats in both directions given it is a single 
connection? I would imagine agent -> executor heartbeat events should be enough 
like we did with v1 scheduler API?

 

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Anand Mazumdar
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: api, mesosphere, v1_api
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8930) THREADSAFE_SnapshotTimeout is flaky.

2018-11-26 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699577#comment-16699577
 ] 

Vinod Kone commented on MESOS-8930:
---

Still seeing this in CI.

 

[~bmahler] Do we have any abstractions/techniques in place that allows us to 
ensure the http request is enqueued in a more robust matter? Sounds like the 
10ms is sometimes not enough in ASF CI.

 

Kinda unrelated bug here is that the code does a "response->body" on a 
(possibly pending) future causing it to hang forever. This will block the whole 
test suite!

{code}

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);

 

  // Parse the response.

  Try responseJSON = JSON::parse(response->body);

  ASSERT_SOME(responseJSON);

{code}

 

I think we should atleast change the `AWAIT_EXPECT_*` above to `AWAIT_ASSERT` 
so that the rest of the test code is skipped. cc [~greggomann] [~bmahler]

 

> THREADSAFE_SnapshotTimeout is flaky.
> 
>
> Key: MESOS-8930
> URL: https://issues.apache.org/jira/browse/MESOS-8930
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test, mesosphere
>
> Observed on ASF CI, might be related to a recent test change 
> https://reviews.apache.org/r/66831/
> {noformat}
> 18:23:31 2: [ RUN  ] MetricsTest.THREADSAFE_SnapshotTimeout
> 18:23:31 2: I0516 18:23:31.747611 16246 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:31 2: I0516 18:23:31.796871 16251 process.cpp:3583] Handling HTTP event 
> for process 'metrics' with path: '/metrics/snapshot'
> 18:23:46 2: /tmp/SRC/3rdparty/libprocess/src/tests/metrics_tests.cpp:425: 
> Failure
> 18:23:46 2: Failed to wait 15secs for response
> 22:57:13 Build timed out (after 300 minutes). Marking the build as failed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9287) DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky

2018-11-16 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689639#comment-16689639
 ] 

Vinod Kone commented on MESOS-9287:
---

Observed this via Windows Reviewbot.

 

{noformat}

[ RUN      ] DockerFetcherPluginTest.INTERNET_CURL_FetchImage

'hadoop' is not recognized as an internal or external command,

operable program or batch file.

d:\dcos\mesos\mesos\src\tests\uri_fetcher_tests.cpp(358): error: 
(fetcher.get()->fetch(uri, dir)).failure(): Collect failed: Unexpected 'curl' 
output: 

d:\dcos\mesos\mesos\3rdparty\stout\include\stout\tests\utils.hpp(46): error: 
TearDownMixin(): Failed to rmdir 'C:\Users\jenkins\AppData\Local\Temp\XpsPZ0': 
The process cannot access the file because it is being used by another process.

[  FAILED  ] DockerFetcherPluginTest.INTERNET_CURL_FetchImage (7460 ms)

{noformat}

> DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky
> -
>
> Key: MESOS-9287
> URL: https://issues.apache.org/jira/browse/MESOS-9287
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
> Environment: Windows tests on Azure
>Reporter: Andrew Schwartzmeyer
>Priority: Minor
>  Labels: ci, flaky, integration
>
> The test DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky on the CI, 
> probably due to the 60 second timeout. A 10 minute timeout would probably be 
> sufficient for slow Azure networks and big Docker images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9332) Debug container should run as the same user of its parent container by default

2018-10-26 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9332:
-

Assignee: Qian Zhang

> Debug container should run as the same user of its parent container by default
> --
>
> Key: MESOS-9332
> URL: https://issues.apache.org/jira/browse/MESOS-9332
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerizer, mesosphere
>
> Currently when launching a debug container, by default Mesos agent will use 
> the executor's user as the debug container's user if the `user` field is not 
> specified in the debug container's `commandInfo` (see [this 
> code|https://github.com/apache/mesos/blob/1.7.0/src/slave/http.cpp#L2559] for 
> details). This is OK for the command task since the command executor's user 
> is same with command task's user (see [this 
> code|https://github.com/apache/mesos/blob/1.7.0/src/slave/slave.cpp#L6068:L6070]
>  for details), so the debug container will be launched as the same user of 
> the task. But for the task in a task group, the default executor's user is 
> same with the framework user (see [this 
> code|https://github.com/apache/mesos/blob/1.7.0/src/slave/slave.cpp#L8959] 
> for details), so in this case the debug container will be launched as the 
> same user of the framework rather than the task. So in a scenario that 
> framework user is a normal user but the task user is root, the debug 
> container will be launched as the normal which is not desired, the 
> expectation is the debug container should run as the same user of the 
> container it debugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9356) Make agent atomically checkpoint operations and resources

2018-10-25 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9356:
-

Assignee: Gastón Kleiman

> Make agent atomically checkpoint operations and resources
> -
>
> Key: MESOS-9356
> URL: https://issues.apache.org/jira/browse/MESOS-9356
> Project: Mesos
>  Issue Type: Task
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: agent, mesosphere, operation-feedback
>
> See 
> https://docs.google.com/document/d/1HxMBCfzU9OZ-5CxmPG3TG9FJjZ_-xDUteLz64GhnBl0/edit
>  for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

2018-10-22 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9334:
-

Shepherd: Gilbert Song
Assignee: Qian Zhang
  Sprint: Mesosphere RI-6 Sprint 2018-31
Story Points: 5

> Container stuck at ISOLATING state due to libevent poll never returns
> -
>
> Key: MESOS-9334
> URL: https://issues.apache.org/jira/browse/MESOS-9334
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] 
> Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 
> from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted 
> '/proc/5244/ns/net' to 
> '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' 
> for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] 
> Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  In the above logs, the state of container 
> `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 
> 09:13:23, but did not transitioned to any other states until it was destroyed 
> due to the executor registration timeout (10 mins). And the destroy can never 
> complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review

2018-10-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641991#comment-16641991
 ] 

Vinod Kone commented on MESOS-9253:
---

[~ArmandGrillet] Can you please send a review with the above fix?

> Reviewbot is failing when posting a review
> --
>
> Key: MESOS-9253
> URL: https://issues.apache.org/jira/browse/MESOS-9253
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Priority: Critical
>
> Observed this in CI.
> [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console]
>  
> {code}
> 09-23-18_02:12:05 - Running 
> /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
> Checking if review 68640 needs verification
> Skipping blocking review 68640
> Checking if review 68641 needs verification
> Patch never verified, needs verification
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Verifying review 68641
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Applying review 68640
> python support/apply-reviews.py -n -r 68640
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 156, in verify_review
> apply_reviews(review_request, reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 120, in apply_reviews
> reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 124, in apply_reviews
> apply_review(review_request["id"])
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 97, in apply_review
> shell("python support/apply-reviews.py -n -r %s" % review_id)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 91, in shell
> command, stderr=subprocess.STDOUT, shell=True)
>   File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.5/subprocess.py", line 708, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r 
> 68640' returned non-zero exit status 1
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 292, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 286, in main
> verify_review(review_request, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 217, in verify_review
> output += "\nFull log: "
> TypeError: can't concat bytes to str
> Build step 'Execute shell' marked build as failure
> Sending e-mails to: bui...@mesos.apache.org
> Finished: FAILURE
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review

2018-10-08 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641939#comment-16641939
 ] 

Vinod Kone commented on MESOS-9253:
---

Reviewbot is still failing

{noformat}
10-08-18_14:40:29 - Running 
/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
Checking if review 68929 needs verification
Skipping blocking review 68929
Checking if review 68941 needs verification
Patch never verified, needs verification
Dependent review: 
[https://reviews.apache.org/api/review-requests/68929/]
Verifying review 68941
Dependent review: 
[https://reviews.apache.org/api/review-requests/68929/]
Applying review 68929
python support/apply-reviews.py -n -r 68929
Posting review: Bad patch!

Reviews applied: [68941, 68929]

Failed command: python support/apply-reviews.py -n -r 68929

Error:
Traceback (most recent call last):
  File "support/apply-reviews.py", line 35, in 
import urllib.request
ImportError: No module named request

Full log: 
[https://builds.apache.org/job/Mesos-Reviewbot/23458/console]
1 review requests need verification
{noformat}

Maybe it's a python 2 vs 3 issue. [~ArmandGrillet] [~andschwa]  Can you  take a 
look?

 

 

> Reviewbot is failing when posting a review
> --
>
> Key: MESOS-9253
> URL: https://issues.apache.org/jira/browse/MESOS-9253
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Priority: Critical
>
> Observed this in CI.
> [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console]
>  
> {code}
> 09-23-18_02:12:05 - Running 
> /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
> Checking if review 68640 needs verification
> Skipping blocking review 68640
> Checking if review 68641 needs verification
> Patch never verified, needs verification
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Verifying review 68641
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Applying review 68640
> python support/apply-reviews.py -n -r 68640
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 156, in verify_review
> apply_reviews(review_request, reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 120, in apply_reviews
> reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 124, in apply_reviews
> apply_review(review_request["id"])
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 97, in apply_review
> shell("python support/apply-reviews.py -n -r %s" % review_id)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 91, in shell
> command, stderr=subprocess.STDOUT, shell=True)
>   File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.5/subprocess.py", line 708, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r 
> 68640' returned non-zero exit status 1
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 292, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 286, in main
> verify_review(review_request, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 217, in verify_review
> output += "\nFull log: "
> TypeError: can't concat bytes to str
> Build step 'Execute shell' marked build as failure
> Sending e-mails to: bui...@mesos.apache.org
> Finished: FAILURE
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9295) Nested container launch could fail if the agent upgrade with new cgroup subsystems.

2018-10-05 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9295:
-

Assignee: Gilbert Song

> Nested container launch could fail if the agent upgrade with new cgroup 
> subsystems.
> ---
>
> Key: MESOS-9295
> URL: https://issues.apache.org/jira/browse/MESOS-9295
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Major
>
> Nested container launch could fail if the agent upgrade with new cgroup 
> subsystems, because the new cgroup subsystems do not exist on parent 
> container's cgroup hierarchy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID

2018-10-04 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9293:
-

Assignee: Gastón Kleiman

> OperationStatus messages sent to framework should include both agent ID and 
> resource provider ID
> 
>
> Key: MESOS-9293
> URL: https://issues.apache.org/jira/browse/MESOS-9293
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0
>Reporter: James DeFelice
>Assignee: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere, operation-feedback
>
> Normally, frameworks are expected to checkpoint agent ID and resource 
> provider ID before accepting an offer with an OfferOperation. From this 
> expectation comes the requirement in the v1 scheduler API that a framework 
> must provide the agent ID and resource provider ID when acknowledging an 
> offer operation status update. However, this expectation breaks down:
> 1. the framework might lose its checkpointed data; it no longer remembers the 
> agent ID or the resource provider ID
> 2. even if the framework checkpoints data, it could be sent a stale update: 
> maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If 
> a framework deleted its checkpointed data after sending the ACK (that's 
> dropped) then upon replay of the status update it no longer has the agent ID 
> or resource provider ID for the operation.
> An easy remedy would be to add the agent ID and resource provider ID to the 
> OperationStatus message received by the scheduler so that a framework can 
> build a proper ACK for the update, even if it doesn't have access to its 
> previously checkpointed information.
> I'm filing this as a BUG because there's no way to reliably use the offer 
> operation status API until this has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9253) Reviewbot is failing when posting a review

2018-09-24 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626200#comment-16626200
 ] 

Vinod Kone commented on MESOS-9253:
---

cc [~andschwa] [~bbannier]

> Reviewbot is failing when posting a review
> --
>
> Key: MESOS-9253
> URL: https://issues.apache.org/jira/browse/MESOS-9253
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Priority: Critical
>
> Observed this in CI.
> [https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console]
>  
> {code}
> 09-23-18_02:12:05 - Running 
> /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
> Checking if review 68640 needs verification
> Skipping blocking review 68640
> Checking if review 68641 needs verification
> Patch never verified, needs verification
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Verifying review 68641
> Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
> Applying review 68640
> python support/apply-reviews.py -n -r 68640
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 156, in verify_review
> apply_reviews(review_request, reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 120, in apply_reviews
> reviews, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 124, in apply_reviews
> apply_review(review_request["id"])
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 97, in apply_review
> shell("python support/apply-reviews.py -n -r %s" % review_id)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 91, in shell
> command, stderr=subprocess.STDOUT, shell=True)
>   File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
> **kwargs).stdout
>   File "/usr/lib/python3.5/subprocess.py", line 708, in run
> output=stdout, stderr=stderr)
> subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r 
> 68640' returned non-zero exit status 1
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 292, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 286, in main
> verify_review(review_request, handler)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 217, in verify_review
> output += "\nFull log: "
> TypeError: can't concat bytes to str
> Build step 'Execute shell' marked build as failure
> Sending e-mails to: bui...@mesos.apache.org
> Finished: FAILURE
>  {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9253) Reviewbot is failing when posting a review

2018-09-24 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9253:
-

 Summary: Reviewbot is failing when posting a review
 Key: MESOS-9253
 URL: https://issues.apache.org/jira/browse/MESOS-9253
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


Observed this in CI.

[https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Reviewbot/23425/console]

 

{code}
09-23-18_02:12:05 - Running 
/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
Checking if review 68640 needs verification
Skipping blocking review 68640
Checking if review 68641 needs verification
Patch never verified, needs verification
Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
Verifying review 68641
Dependent review: [https://reviews.apache.org/api/review-requests/68640/]
Applying review 68640
python support/apply-reviews.py -n -r 68640
Traceback (most recent call last):
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 156, in verify_review
apply_reviews(review_request, reviews, handler)
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 120, in apply_reviews
reviews, handler)
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 124, in apply_reviews
apply_review(review_request["id"])
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 97, in apply_review
shell("python support/apply-reviews.py -n -r %s" % review_id)
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 91, in shell
command, stderr=subprocess.STDOUT, shell=True)
  File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'python support/apply-reviews.py -n -r 
68640' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 292, in 
main()
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 286, in main
verify_review(review_request, handler)
  File 
"/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
 line 217, in verify_review
output += "\nFull log: "
TypeError: can't concat bytes to str
Build step 'Execute shell' marked build as failure
Sending e-mails to: bui...@mesos.apache.org
Finished: FAILURE
 {code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9232) verify-reviews.py broken after enabling python3 support scripts

2018-09-14 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9232:
-

Assignee: Andrew Schwartzmeyer

> verify-reviews.py broken after enabling python3 support scripts
> ---
>
> Key: MESOS-9232
> URL: https://issues.apache.org/jira/browse/MESOS-9232
> Project: Mesos
>  Issue Type: Bug
>  Components: reviewbot, test
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Andrew Schwartzmeyer
>Priority: Blocker
>
> Reviewbot is failing since {{support/verify-reviews.py}} was upgraded to use 
> the python3 instead of the python3 implementation. I see this was completely 
> refactored in {{590a75d0c9d61b0b07f8a3807225c40eb8189a0b}} and replaced the 
> existing impl with {{9c7eb909aad99e6ea6de0b1fd2a55a798764b00b}}.
> We already fixed how the script gets invoked by Jenkins (it uses a completely 
> different way to pass arguments), but now see failures like
> {noformat}
> 09-14-18_08:43:03 - Running 
> /home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py
> 0 review requests need verification
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 250, in verification_needed_write
> with open(parameters.out_file, 'w') as f:
> AttributeError: 'Namespace' object has no attribute 'out_file'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 301, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 297, in main
> verification_needed_write(review_ids, parameters)
>   File 
> "/home/jenkins/jenkins-slave/workspace/Mesos-Reviewbot/support/verify-reviews.py",
>  line 253, in verification_needed_write
> print("Failed opening file '%s' for writing" % parameters.out_file)
> AttributeError: 'Namespace' object has no attribute 'out_file'
> Build step 'Execute shell' marked build as failure
> Sending e-mails to: bui...@mesos.apache.org
> Finished: FAILURE
> {noformat}
> It looks like the script would need some additional modifications and 
> possible tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9210) Mesos v1 scheduler library does not properly handle SUBSCRIBE retries

2018-09-05 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9210:
-

 Summary: Mesos v1 scheduler library does not properly handle 
SUBSCRIBE retries
 Key: MESOS-9210
 URL: https://issues.apache.org/jira/browse/MESOS-9210
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.6.1, 1.5.1, 1.7.0
Reporter: Vinod Kone
Assignee: Till Toenshoff


After the authentication related refactor done as part of 
[https://reviews.apache.org/r/62594/,] the state of the scheduler is checked in 
`send` 
([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234)]
  but it is changed in `_send` 
([https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L234).]
 As a result, we can have 2 SUBSCRIBE calls in flight at the same time on the 
same connection! This is not good and not spec compliant of a HTTP client that 
is expecting a streaming response.

We need to fix the library to either drop the retried SUBSCRIBE call if one is 
in progress (as it was before the refactor) or close the old connection and 
start a new connection to send the retried SUBSCRIBE call.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-30 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598016#comment-16598016
 ] 

Vinod Kone commented on MESOS-8568:
---

[~qianzhang] Can you please set the affects and target versions?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.

2018-08-29 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9191:
-

Shepherd: Qian Zhang
Assignee: Andrei Budnik
  Sprint: Mesosphere Sprint 2018-28

[~abudnik] Would you have cycles in the next sprint work on this?

> Docker command executor may stuck at infinite unkillable loop.
> --
>
> Key: MESOS-9191
> URL: https://issues.apache.org/jira/browse/MESOS-9191
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: containerizer
>
> Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the 
> behavior of docker command executor to discard the future of docker stop was 
> changed. If there is a new killTask() invoked and there is an existing docker 
> stop in pending state, the old one would call discard and then execute the 
> new one. This is ok for most of cases.
> However, docker stop could take long (depends on grace period and whether the 
> application could handle SIGTERM). If the framework retry killTask more 
> frequently than grace period (depends on killpolicy API, env var, or agent 
> flags), then the executor may be stuck forever with unkillable tasks. Because 
> everytime before the docker stop finishes, the future of docker stop is 
> discarded by the new incoming killTask.
> We should consider re-use grace period before calling discard() to a pending 
> docker stop future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9185) An attempt to remove or destroy container in composing containerizer leads to segfault

2018-08-28 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595174#comment-16595174
 ] 

Vinod Kone commented on MESOS-9185:
---

What versions are affected by this bug? Should this be backported?

> An attempt to remove or destroy container in composing containerizer leads to 
> segfault
> --
>
> Key: MESOS-9185
> URL: https://issues.apache.org/jira/browse/MESOS-9185
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.8.0
>
>
> `LAUNCH_NESTED_CONTAINER` and `LAUNCH_NESTED_CONTAINER_SESSION` leads to 
> segfault in the agent when the parent container is unknown to the composing 
> containerizer. If the parent container cannot be found during an attempt to 
> launch a nested container via `ComposingContainerizerProcess::launch()`, the 
> composing container returns an error without cleaning up the container. On 
> `launch()` failures, the agent calls `destroy()` which accesses uninitialized 
> `containerizer` field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9181) Fix the comment in JNI libraries regarding weak reference and GC

2018-08-23 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9181:
-

 Summary: Fix the comment in JNI libraries regarding weak reference 
and GC
 Key: MESOS-9181
 URL: https://issues.apache.org/jira/browse/MESOS-9181
 Project: Mesos
  Issue Type: Documentation
Reporter: Vinod Kone


Our JNI libraries for MesosSchedulerDriver, v0Mesos and v1Mesos all use weak 
global references to the underlying Java objects, but they incorrectly state 
that this will prevent JVM from GC'ing it. We need to fix these coments.

e.g., 
[https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L213]

 

See the JNI spec for details: 
[https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513
 ] 

Vinod Kone commented on MESOS-8568:
---

Great repro!

One orthogonal question though, it seems unfortunate that IOSwitchboard takes 
5s to complete its cleanup for a container that has failed to launch. IIRC 
there was a 5s timeout in IOSwitchboard for some unexpected corner cases which 
is what we seem to be hitting here, but this is an *expected* case in some 
sense.  Is there anyway we can speed that up?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-22 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-9177:
-

Shepherd: Alexander Rukletsov
Assignee: Benno Evers
  Sprint: Mesosphere Sprint 2018-27
Story Points: 3
Target Version/s: 1.7.0

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @ 0x7f36812215ac 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
>  @ 0x7f36821f3541 process::ProcessBase::consume()
>  @ 0x7f3682209fbc process::ProcessManager::resume()
>  @ 0x7f368220fa76 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @ 0x7f367eefc2b0 (unknown)
>  @ 0x7f367e71ae25 start_thread
>  @ 0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2018-08-21 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588096#comment-16588096
 ] 

Vinod Kone commented on MESOS-4065:
---

Looks like we need to do a ZK upgrade to at least 3.5.4 to get this.

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0, 1.2.2
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8729) Libprocess: deadlock in process::finalize

2018-08-20 Thread Vinod Kone (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8729:
-

Shepherd: Benjamin Mahler
Assignee: Andrei Budnik
  Sprint: Mesosphere Sprint 2018-28
Story Points: 3

> Libprocess: deadlock in process::finalize
> -
>
> Key: MESOS-8729
> URL: https://issues.apache.org/jira/browse/MESOS-8729
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.6.0
> Environment: The issue has been reproduced on Ubuntu 16.04, master 
> branch, commit `42848653b2`. 
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: deadlock, libprocess
> Attachments: deadlock.txt
>
>
> Since we are calling 
> [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157]
>  before returning from the IOSwitchboard's main function, we expect that all 
> http responses are going to be sent back to clients before IOSwitchboard 
> terminates. However, after [adding|https://reviews.apache.org/r/66147/] 
> `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in 
> `libprocess::finalize()`. See attached stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >