[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-04-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4869:

Labels: health-check  (was: )

> /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
> ---
>
> Key: MESOS-4869
> URL: https://issues.apache.org/jira/browse/MESOS-4869
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.1
>Reporter: Anthony Scalisi
>Priority: Critical
>  Labels: health-check
>
> We switched our health checks in Marathon from HTTP to COMMAND:
> {noformat}
> "healthChecks": [
> {
>   "protocol": "COMMAND",
>   "path": "/ops/ping",
>   "command": { "value": "curl --silent -f -X GET 
> http://$HOST:$PORT0/ops/ping > /dev/null" },
>   "gracePeriodSeconds": 90,
>   "intervalSeconds": 2,
>   "portIndex": 0,
>   "timeoutSeconds": 5,
>   "maxConsecutiveFailures": 3
> }
>   ]
> {noformat}
> All our applications have the same health check (and /ops/ping endpoint).
> Even though we have the issue on all our Meos slaves, I'm going to focus on a 
> particular one: *mesos-slave-i-e3a9c724*.
> The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:
> !https://i.imgur.com/gbRf804.png!
> Here is a *docker ps* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724 # docker ps
> CONTAINER IDIMAGE   COMMAND  CREATED  
>STATUS  PORTS NAMES
> 4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31926->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
> 66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31939->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
> f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>Up 6 hours  0.0.0.0:31656->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
> 880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago 
>Up 24 hours 0.0.0.0:31371->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
> 5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31500->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
> b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>Up 46 hours 0.0.0.0:31382->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
> 5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31186->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
> 53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago   
>Up 2 days   0.0.0.0:31839->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
> {noformat}
> Here is a *docker stats* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724  # docker stats
> CONTAINER   CPU %   MEM USAGE / LIMIT MEM %   
> NET I/O   BLOCK I/O
> 4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%  
> 1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
> 53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%  
> 419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
> 5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%  
> 423 MB / 526.5 MB 3.219 MB / 61.44 kB
> 5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%  
> 2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
> 66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%  
> 258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
> 880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%  
> 1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
> b63740fe56e712.04%  629 MB / 1.611 GB 39.06%  
> 10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
> f7382f241fce6.21%   505 MB / 1.611 GB 31.36%  
> 153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
> {noformat}
> Not much else is running on the slave, yet the used memory doesn't map to the 
> tasks memory:
> {noformat}
> Mem:16047M used:13340M buffers:1139M cache:776M
> {noformat}
> If I exec into the container (*java:8* image), I can see correctly the shell 
> calls to execute the curl specified in the 

[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-03-04 Thread Anthony Scalisi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Scalisi updated MESOS-4869:
---
Description: 
We switched our health checks in Marathon from HTTP to COMMAND:

{noformat}
"healthChecks": [
{
  "protocol": "COMMAND",
  "path": "/ops/ping",
  "command": { "value": "curl --silent -f -X GET 
http://$HOST:$PORT0/ops/ping > /dev/null" },
  "gracePeriodSeconds": 90,
  "intervalSeconds": 2,
  "portIndex": 0,
  "timeoutSeconds": 5,
  "maxConsecutiveFailures": 3
}
  ]
{noformat}

All our applications have the same health check (and /ops/ping endpoint).

Even though we have the issue on all our Meos slaves, I'm going to focus on a 
particular one: *mesos-slave-i-e3a9c724*.

The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:

!https://i.imgur.com/gbRf804.png!

Here is a *docker ps* on it:

{noformat}
root@mesos-slave-i-e3a9c724 # docker ps
CONTAINER IDIMAGE   COMMAND  CREATED
 STATUS  PORTS NAMES
4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31926->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31939->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31656->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago   
 Up 24 hours 0.0.0.0:31371->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31500->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31382->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31186->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31839->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
{noformat}

Here is a *docker stats* on it:

{noformat}
root@mesos-slave-i-e3a9c724  # docker stats
CONTAINER   CPU %   MEM USAGE / LIMIT MEM % 
  NET I/O   BLOCK I/O
4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%
  1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%
  419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%
  423 MB / 526.5 MB 3.219 MB / 61.44 kB
5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%
  2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%
  258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%
  1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
b63740fe56e712.04%  629 MB / 1.611 GB 39.06%
  10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
f7382f241fce6.21%   505 MB / 1.611 GB 31.36%
  153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
{noformat}

Not much else is running on the slave, yet the used memory doesn't map to the 
tasks memory:

{noformat}
Mem:16047M used:13340M buffers:1139M cache:776M
{noformat}


If I exec into the container (*java:8* image), I can see correctly the shell 
calls to execute the curl specified in the health check as expected and exit 
correctly.

The only change we noticed since the memory usage woes was related to moving to 
Mesos doing the health checks instead, so I decided to take a look:

{noformat}
root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
 2504 ?Sl47:33 /usr/libexec/mesos/mesos-health-check 
--executor=(1)@10.92.32.63:53432 
--health_check_json={"command":{"shell":true,"value":"docker exec 
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
 sh -c \" curl --silent -f -X GET 

[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-03-04 Thread Anthony Scalisi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Scalisi updated MESOS-4869:
---
Description: 
We switched our health checks in Marathon from HTTP to COMMAND:

{noformat}
"healthChecks": [
{
  "protocol": "COMMAND",
  "path": "/ops/ping",
  "command": { "value": "curl --silent -f -X GET 
http://$HOST:$PORT0/ops/ping > /dev/null" },
  "gracePeriodSeconds": 90,
  "intervalSeconds": 2,
  "portIndex": 0,
  "timeoutSeconds": 5,
  "maxConsecutiveFailures": 3
}
  ]
{noformat}

All our applications have the same health check (and /ops/ping endpoint).

Even though we have the issue on all our Meos slaves, I'm going to focus on a 
particular one: *mesos-slave-i-e3a9c724*.

The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:

!https://i.imgur.com/gbRf804.png!

Here is a *docker ps* on it:

{noformat}
root@mesos-slave-i-e3a9c724 # docker ps
CONTAINER IDIMAGE   COMMAND  CREATED
 STATUS  PORTS NAMES
4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31926->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31939->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31656->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago   
 Up 24 hours 0.0.0.0:31371->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31500->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31382->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31186->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31839->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
{noformat}

Here is a *docker stats* on it:

{noformat}
root@mesos-slave-i-e3a9c724  # docker stats
CONTAINER   CPU %   MEM USAGE / LIMIT MEM % 
  NET I/O   BLOCK I/O
4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%
  1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%
  419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%
  423 MB / 526.5 MB 3.219 MB / 61.44 kB
5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%
  2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%
  258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%
  1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
b63740fe56e712.04%  629 MB / 1.611 GB 39.06%
  10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
f7382f241fce6.21%   505 MB / 1.611 GB 31.36%
  153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
{noformat}

Not much else is running on the slave, yet the used memory doesn't map to the 
tasks memory:

{noformat}
Mem:16047M used:13340M buffers:1139M cache:776M
{noformat}


If I exec into the container (*java:8* image), I can see correctly the shell 
calls to execute the curl specified in the health check as expected and exit 
correctly.

The only change we noticed since the memory usage woes was related to moving to 
Mesos doing the health checks instead, so I decided to take a look:

{noformat}
root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
 2504 ?Sl47:33 /usr/libexec/mesos/mesos-health-check 
--executor=(1)@10.92.32.63:53432 
--health_check_json={"command":{"shell":true,"value":"docker exec 
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
 sh -c \" curl --silent -f -X GET 

[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-03-04 Thread Anthony Scalisi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Scalisi updated MESOS-4869:
---
Description: 
We switched our health checks in Marathon from HTTP to COMMAND:

{noformat}
"healthChecks": [
{
  "protocol": "COMMAND",
  "path": "/ops/ping",
  "command": { "value": "curl --silent -f -X GET 
http://$HOST:$PORT0/ops/ping > /dev/null" },
  "gracePeriodSeconds": 90,
  "intervalSeconds": 2,
  "portIndex": 0,
  "timeoutSeconds": 5,
  "maxConsecutiveFailures": 3
}
  ]
{noformat}

All our applications have the same health check (and /ops/ping endpoint).

Even though we have the issue on all our Meos slaves, I'm going to focus on a 
particular one: *mesos-slave-i-e3a9c724*.

The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:

!https://i.imgur.com/gbRf804.png!

Here is a *docker ps* on it:

{noformat}
root@mesos-slave-i-e3a9c724 # docker ps
CONTAINER IDIMAGE   COMMAND  CREATED
 STATUS  PORTS NAMES
4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31926->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31939->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31656->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago   
 Up 24 hours 0.0.0.0:31371->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31500->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31382->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31186->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31839->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
{noformat}

Here is a *docker stats* on it:

{quote}
root@mesos-slave-i-e3a9c724  # docker stats
CONTAINER   CPU %   MEM USAGE / LIMIT MEM % 
  NET I/O   BLOCK I/O
4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%
  1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%
  419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%
  423 MB / 526.5 MB 3.219 MB / 61.44 kB
5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%
  2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%
  258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%
  1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
b63740fe56e712.04%  629 MB / 1.611 GB 39.06%
  10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
f7382f241fce6.21%   505 MB / 1.611 GB 31.36%
  153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
{noformat}

Not much else is running on the slave, yet the used memory doesn't map to the 
tasks memory:

{noformat}
Mem:16047M used:13340M buffers:1139M cache:776M
{noformat}


If I exec into the container (*java:8* image), I can see correctly the shell 
calls to execute the curl specified in the health check as expected and exit 
correctly.

The only change we noticed since the memory usage woes was related to moving to 
Mesos doing the health checks instead, so I decided to take a look:

{noformat}
root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
 2504 ?Sl47:33 /usr/libexec/mesos/mesos-health-check 
--executor=(1)@10.92.32.63:53432 
--health_check_json={"command":{"shell":true,"value":"docker exec 
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
 sh -c \" curl --silent -f -X GET 

[jira] [Updated] (MESOS-4869) /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory

2016-03-04 Thread Anthony Scalisi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Scalisi updated MESOS-4869:
---
Description: 
We switched our health checks in Marathon from HTTP to COMMAND:

{noformat}
"healthChecks": [
{
  "protocol": "COMMAND",
  "path": "/ops/ping",
  "command": { "value": "curl --silent -f -X GET 
http://$HOST:$PORT0/ops/ping > /dev/null" },
  "gracePeriodSeconds": 90,
  "intervalSeconds": 2,
  "portIndex": 0,
  "timeoutSeconds": 5,
  "maxConsecutiveFailures": 3
}
  ]
{noformat}

All our applications have the same health check (and /ops/ping endpoint).

Even though we have the issue on all our Meos slaves, I'm going to focus on a 
particular one: *mesos-slave-i-e3a9c724*.

The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:

!https://i.imgur.com/gbRf804.png!

Here is a *docker ps* on it:

{noformat}
root@mesos-slave-i-e3a9c724 # docker ps
CONTAINER IDIMAGE   COMMAND  CREATED
 STATUS  PORTS NAMES
4f7c0aa8d03ajava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31926->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
66f2fc8f8056java:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31939->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
f7382f241fcejava:8  "/bin/sh -c 'JAVA_OPT"   6 hours ago
 Up 6 hours  0.0.0.0:31656->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
880934c0049ejava:8  "/bin/sh -c 'JAVA_OPT"   24 hours ago   
 Up 24 hours 0.0.0.0:31371->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
5eab1f8dac4ajava:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31500->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
b63740fe56e7java:8  "/bin/sh -c 'JAVA_OPT"   46 hours ago   
 Up 46 hours 0.0.0.0:31382->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
5c7a9ea77b0ejava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31186->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
53065e7a31adjava:8  "/bin/sh -c 'JAVA_OPT"   2 days ago 
 Up 2 days   0.0.0.0:31839->8080/tcp   
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
{quote}

Here is a *docker stats* on it:

{quote}
root@mesos-slave-i-e3a9c724  # docker stats
CONTAINER   CPU %   MEM USAGE / LIMIT MEM % 
  NET I/O   BLOCK I/O
4f7c0aa8d03a2.93%   797.3 MB / 1.611 GB   49.50%
  1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
53065e7a31ad8.30%   738.9 MB / 1.611 GB   45.88%
  419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
5c7a9ea77b0e4.91%   1.081 GB / 1.611 GB   67.10%
  423 MB / 526.5 MB 3.219 MB / 61.44 kB
5eab1f8dac4a3.13%   1.007 GB / 1.611 GB   62.53%
  2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
66f2fc8f80563.15%   768.1 MB / 1.611 GB   47.69%
  258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
880934c0049e10.07%  735.1 MB / 1.611 GB   45.64%
  1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
b63740fe56e712.04%  629 MB / 1.611 GB 39.06%
  10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
f7382f241fce6.21%   505 MB / 1.611 GB 31.36%
  153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
{noformat}

Not much else is running on the slave, yet the used memory doesn't map to the 
tasks memory:

{noformat}
Mem:16047M used:13340M buffers:1139M cache:776M
{noformat}


If I exec into the container (*java:8* image), I can see correctly the shell 
calls to execute the curl specified in the health check as expected and exit 
correctly.

The only change we noticed since the memory usage woes was related to moving to 
Mesos doing the health checks instead, so I decided to take a look:

{noformat}
root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
 2504 ?Sl47:33 /usr/libexec/mesos/mesos-health-check 
--executor=(1)@10.92.32.63:53432 
--health_check_json={"command":{"shell":true,"value":"docker exec 
mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
 sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping >