[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:46 PM:
-

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After less than a minute, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this seems to fail 
systematically for me.

 


was (Author: cf.natali):
[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:32 PM:
-

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 


was (Author: cf.natali):
[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks while allocate 1 GPU and just do 
"exit 0" (see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:

Failed to launch container: Requested 1 gpus but only 0 available

 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali commented on MESOS-8038:
---

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks while allocate 1 GPU and just do 
"exit 0" (see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:

Failed to launch container: Requested 1 gpus but only 0 available

 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088973#comment-17088973
 ] 

Charles Natali commented on MESOS-10119:


So for the good news: I couldn't reproduce it - it turned out to be a bug in 
one of our legacy systems which caused it to remove the agent's cgroups...

 

However I did observe this particular failure as a consequence of the now fixed 
https://issues.apache.org/jira/browse/MESOS-10107

 

> Marking as a duplicate of MESOS-8038.

 

Ah, let's close this one then.

 

> failure to destroy container can cause the agent to "leak" a GPU
> 
>
> Key: MESOS-10119
> URL: https://issues.apache.org/jira/browse/MESOS-10119
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Charles Natali
>Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring 

[jira] [Assigned] (MESOS-10113) OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before accepting new connection.

2020-04-21 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10113:
---

Assignee: Benjamin Mahler

https://reviews.apache.org/r/72352/

> OpenSSLSocketImpl with 'support_downgrade' waits for incoming bytes before 
> accepting new connection.
> 
>
> Key: MESOS-10113
> URL: https://issues.apache.org/jira/browse/MESOS-10113
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>
> The accept loop in OpenSSLSocketImpl in the case of {{support_downgrade}} 
> enabled will wait for incoming bytes on the accepted socket before allowing 
> another socket to be accepted. This will lead to significant throughput 
> issues for accepting new connections (e.g. during a master failover), or may 
> block entirely if a client doesn't send any data for whatever reason.
> Marking as a bug due to the potential for blocking incoming connections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.

2020-04-21 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-10124:
---

Assignee: Benjamin Mahler

> OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling 
> for read readiness.
> 
>
> Key: MESOS-10124
> URL: https://issues.apache.org/jira/browse/MESOS-10124
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: windows
>
> OpenSSLSocket is currently using the zero byte read trick on Windows to poll 
> for read readiness when peaking at the data to determine whether the incoming 
> connection is performing an SSL handshake. However, io::read is designed to 
> provide consistent semantics for a zero byte read across posix and windows, 
> which is to return immediately.
> To fix this, we can either:
> (1) Have different semantics for zero byte io::read on posix / windows, where 
> we just let it fall through to the system calls. This might be confusing for 
> users, but it's unlikely that a caller would perform a zero byte read in 
> typical code so the confusion is probably avoided.
> (2) Implement io::poll for reads on windows. This would make the caller code 
> consistent and is probably less confusing to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10124) OpenSSLSocketImpl on Windows with 'support_downgrade' is incorrectly polling for read readiness.

2020-04-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10124:
---

 Summary: OpenSSLSocketImpl on Windows with 'support_downgrade' is 
incorrectly polling for read readiness.
 Key: MESOS-10124
 URL: https://issues.apache.org/jira/browse/MESOS-10124
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


OpenSSLSocket is currently using the zero byte read trick on Windows to poll 
for read readiness when peaking at the data to determine whether the incoming 
connection is performing an SSL handshake. However, io::read is designed to 
provide consistent semantics for a zero byte read across posix and windows, 
which is to return immediately.

To fix this, we can either:

(1) Have different semantics for zero byte io::read on posix / windows, where 
we just let it fall through to the system calls. This might be confusing for 
users, but it's unlikely that a caller would perform a zero byte read in 
typical code so the confusion is probably avoided.

(2) Implement io::poll for reads on windows. This would make the caller code 
consistent and is probably less confusing to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10123) Windows overlapped IO discard handling can drop data.

2020-04-21 Thread Benjamin Mahler (Jira)
Benjamin Mahler created MESOS-10123:
---

 Summary: Windows overlapped IO discard handling can drop data.
 Key: MESOS-10123
 URL: https://issues.apache.org/jira/browse/MESOS-10123
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


When getting a discard request for an io operation on windows, a cancellation 
is requested [1] and when the io operation completes we check whether the 
future had a discard request to decide whether to discard it [2]:

{code}
template 
static void set_io_promise(Promise* promise, const T& data, DWORD error)
{
  if (promise->future().hasDiscard()) {
promise->discard();
  } else if (error == ERROR_SUCCESS) {
promise->set(data);
  } else {
promise->fail("IO failed with error code: " + WindowsError(error).message);
  }
}
{code}

However, it's possible the operation completed successfully, in which case we 
did not succeed at canceling it. We need to check for 
{{ERROR_OPERATION_ABORTED}} [3]:

{code}
template 
static void set_io_promise(Promise* promise, const T& data, DWORD error)
{
  if (promise->future().hasDiscard() && error == ERROR_OPERATION_ABORTED) {
promise->discard();
  } else if (error == ERROR_SUCCESS) {
promise->set(data);
  } else {
promise->fail("IO failed with error code: " + WindowsError(error).message);
  }
}
{code}

I don't think there are currently any major consequences to this issue, since 
most callers tend to be discarding only when they're essentially abandoning the 
entire process of reading or writing.

[1] 
https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L448
[2] 
https://github.com/apache/mesos/blob/1.9.0/3rdparty/libprocess/src/windows/libwinio.cpp#L141-L151
[3] https://docs.microsoft.com/en-us/windows/win32/fileio/cancelioex-func



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10122) cmake+MSBuild is uncapable of building all Mesos sources in parallel.

2020-04-21 Thread Andrei Sekretenko (Jira)
Andrei Sekretenko created MESOS-10122:
-

 Summary: cmake+MSBuild is uncapable of building all Mesos sources 
in parallel.
 Key: MESOS-10122
 URL: https://issues.apache.org/jira/browse/MESOS-10122
 Project: Mesos
  Issue Type: Bug
  Components: build
 Environment: When a library (in cmake's sense) contains several 
sources with different paths but the same filename (for example,  
slave/validation.cpp and resource_provider/validation.cpp), the build generated 
by CMake for MSVC does not allow for building those files in parallel 
(presumably, because the .obj files will be located in the same directory).

This has been tested observed with both cmake 3.9 and 3.17, with "Visual Studio 
15 2017 Win64" generator. 

It seems to be a known behaviour - see 
https://stackoverflow.com/questions/7033855/msvc10-mp-builds-not-multicore-across-folders-in-a-project.

Two options for fixing this in a way that will work with these cmake/MSVC 
configurations are:
 - splitting the build into small static libraries (a library per directory)
 - introducing an intermediate code-generation-like step optionally flattening 
the directory structure (slave/validation.cpp -> slave_validation.cpp)

Both options have their drawbacks:
 - The first will result in changing the layout the static build artifacts 
(mesos.lib will be replaced with a ton of smaller libraries), that will pose 
integration cahllenges,  and potentially will result in worse parallelism.
 - The second will result in being unable to use #include without a path  
(right now there are three or four such #include's in the whole 
Mesos/libprocess code buildable on Windows) in changed value of __FILE__ macro 
(as a consequence, in the example above, `validation.cpp` in logs will be 
replaced either with `slave_validation.cpp` or with 
`resource_provider_validation.cpp`)

Note that the second approach will need to deal with potential collisions when 
the source tree has filenames with underscores. If, for example, we had both 
slave/validation.cpp and slave_validation.cpp, then either some additional 
escaping will be needed when or, alternatively, such layout could be just 
forbidden (and made to fail the build).

Preliminary testing shows that on a 8-core AWS instance flattening source trees 
of libprocess, mesos-protobufs and libmesos results in clean build speedup from 
54 minutes to 33 minutes.
Reporter: Andrei Sekretenko
Assignee: Andrei Sekretenko






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10121) stdout/stderr not rotating

2020-04-21 Thread Evgeny (Jira)
Evgeny created MESOS-10121:
--

 Summary: stdout/stderr not rotating
 Key: MESOS-10121
 URL: https://issues.apache.org/jira/browse/MESOS-10121
 Project: Mesos
  Issue Type: Task
Reporter: Evgeny


Hello, i am trying to rotate my stdout/stderr files in Mesos containers.

Starting mesos-slave:

{code}
docker run -d  mesos-slave:1.4 
--container_logger=org_apache_mesos_LogrotateContainerLogger 
--modules=/var/tmp/mesos/mesos-slave-modules.json
{code}
config for rotation:
{code}
cat /var/tmp/mesos/mesos-slave-modules.json:
{
 "libraries": [{
 "file": "/usr/lib/liblogrotate_container_logger.so",
 "modules": [{
 "name": "org_apache_mesos_LogrotateContainerLogger",
 "parameters": [{
 "key": "launcher_dir",
 "value": "/usr/libexec/mesos"
 }, {
 "key": "logrotate_path",
 "value": "/usr/sbin/logrotate"
 }, {
 "key": "max_stdout_size",
 "value": "10240MB"
 }, {
 "key": "max_stderr_size",
 "value": "10240MB"
 }, {
 "key": "logrotate_stdout_options",
 "value": "rotate 5\nmissingok\nnotifempty\ncompress\nnomail\n"
 }, {
 "key": "logrotate_stderr_options",
 "value": "rotate 5\nmissingok\nnotifempty\ncompress\nnomail\n"
 }]
 }]
 }]
}
{code}

mesos-logrotate-logger is running for all containers:
{code}
root 733 0.1 0.0 841724 25020 ? Ssl Apr19 4:02 mesos-logrotate-logger 
--help=false 
--log_filename=/var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout
 --logrotate_options=rotate 5 missingok notifempty compress nomail 
--logrotate_path=/usr/sbin/logrotate --max_size=10GB --user=root
...
{code}

limit 10Gb is reached:
{code}
ls -lh 
/var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout
-rw-r--r-- 1 root root 12G Apr 21 10:57 
/var/tmp/mesos/slaves/c931deec-e65a-4362-9c6f-d4b278f52f5b-S0/frameworks/cb0d4342-fcf5-4e6d-abf2-764b8c5b8cf3-/executors/gateway.478dac1a-8276-11ea-b32f-0242ac110002/runs/60bc8179-3051-4214-8902-7d9747f1713e/stdout
{code}

logrotate is available (--logrotate_path=/usr/sbin/logrotate).

But rotating does not occur. Cant find a problem. Any tips pls?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10054) Update Docker containerizer to set Docker container’s resource limits and `oom_score_adj`

2020-04-21 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087845#comment-17087845
 ] 

Qian Zhang edited comment on MESOS-10054 at 4/21/20, 1:22 PM:
--

RR: 

[https://reviews.apache.org/r/72401/]

[https://reviews.apache.org/r/72391/]


was (Author: qianzhang):
RR: [https://reviews.apache.org/r/72391/]

> Update Docker containerizer to set Docker container’s resource limits and 
> `oom_score_adj`
> -
>
> Key: MESOS-10054
> URL: https://issues.apache.org/jira/browse/MESOS-10054
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>
> This is to set resource limits for executor which will run as a Docker 
> container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-21 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088568#comment-17088568
 ] 

Andrei Budnik commented on MESOS-10119:
---

Could you reproduce the cgroups desctruction problem consistently?
What are the kernel and systemd versions installed on your agents?

> failure to destroy container can cause the agent to "leak" a GPU
> 
>
> Key: MESOS-10119
> URL: https://issues.apache.org/jira/browse/MESOS-10119
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Charles Natali
>Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 87253521-8d39-47ea-b4d1-febe527d230c
> Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 
> process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed 
> connect: