[jira] [Assigned] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-07-04 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner reassigned MESOS-5294:


Assignee: haosdent  (was: Travis Hegner)

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: haosdent
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
> -but other instances of status updates have a similar line-
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-
> UPDATE:
> None of the above assumption are correct. Something else is causing the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262187#comment-15262187
 ] 

Travis Hegner commented on MESOS-5294:
--

While marathon shows the task in an "unknown" state, during it's grace period, 
the {{statuses}} array looks like this:
{code}
  "statuses": [
{
  "state": "TASK_RUNNING",
  "timestamp": 1461850431.76103,
  "labels": [
{
  "key": "Docker.NetworkSettings.IPAddress",
  "value": "10.1.9.16"
}
  ],
  "container_status": {
"network_infos": [
  {
"ip_address": "10.1.9.16",
"ip_addresses": [
  {
"ip_address": "10.1.9.16"
  }
]
  }
]
  }
}
  ],
{code}

Mind you, I'm running some custom networking, and I do _want_ docker's IP 
address to come through, not the hosts IP. Therefore my {{config.json}} for 
{{mesos-dns}} contains {{"IPSources":\["docker"\]}}

After, the task status is marked healthy, my status array looks like this:
{code}
  "statuses": [
{
  "state": "TASK_RUNNING",
  "timestamp": 1461850492.06737,
  "container_status": {
"network_infos": [
  {
"ip_address": "10.1.8.54",
"ip_addresses": [
  {
"ip_address": "10.1.8.54"
  }
]
  }
]
  },
  "healthy": true
}
  ],
{code}

Notice the IP change in {{network_infos}}. This new IP is of the host, and not 
of the container itself. It's as if the status update that gets generated by 
the health check process is doing it's own IP address discovery, different from 
the ip discovery code in MESOS-4370.

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
> -but other instances of status updates have a similar line-
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-
> UPDATE:
> None of the above assumption are correct. Something else is causing the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Remaining Estimate: (was: 2h)
 Original Estimate: (was: 2h)

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
> -but other instances of status updates have a similar line-
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-
> UPDATE:
> None of the above assumption are correct. Something else is causing the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Description: 
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with-

{code}status.mutable_task_id()->CopyFrom(taskID);{code}

-but other instances of status updates have a similar line-

{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}

-My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.-

-I'll try to get a patch together soon.-

UPDATE:
None of the above assumption are correct. Something else is causing the issue.

  was:
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with-

{code}-status.mutable_task_id()->CopyFrom(taskID);-{code}

-but other instances of status updates have a similar line-

{code}-status.mutable_task_id()->CopyFrom(taskID.get());-{code}

-My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.-

-I'll try to get a patch together soon.-


> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
> -but other instances of status updates have a similar line-
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-
> UPDATE:
> None of the above assumption are correct. Something else is causing the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Description: 
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with-

{code}-status.mutable_task_id()->CopyFrom(taskID);-{code}

-but other instances of status updates have a similar line-

{code}-status.mutable_task_id()->CopyFrom(taskID.get());-{code}

-My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.-

-I'll try to get a patch together soon.-

  was:
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with-

-{code}status.mutable_task_id()->CopyFrom(taskID);{code}-

 but other instances of status updates have a similar line

-{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}-

-My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.-

-I'll try to get a patch together soon.-


> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> {code}-status.mutable_task_id()->CopyFrom(taskID);-{code}
> -but other instances of status updates have a similar line-
> {code}-status.mutable_task_id()->CopyFrom(taskID.get());-{code}
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Description: 
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with-

-{code}status.mutable_task_id()->CopyFrom(taskID);{code}-

 but other instances of status updates have a similar line

-{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}-

-My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.-

-I'll try to get a patch together soon.-

  was:
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with

{code}status.mutable_task_id()->CopyFrom(taskID);{code}

 but other instances of status updates have a similar line

{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.-


> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with-
> -{code}status.mutable_task_id()->CopyFrom(taskID);{code}-
>  but other instances of status updates have a similar line
> -{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}-
> -My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.-
> -I'll try to get a patch together soon.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Description: 
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

-Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with

{code}status.mutable_task_id()->CopyFrom(taskID);{code}

 but other instances of status updates have a similar line

{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.-

  was:
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with

{code}status.mutable_task_id()->CopyFrom(taskID);{code}

 but other instances of status updates have a similar line

{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.


> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> -Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
>  but other instances of status updates have a similar line
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.
> I'll try to get a patch together soon.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Description: 
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

Digging through src/docker/executor.cpp, I found that in the 
{{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
status instance with

{code}status.mutable_task_id()->CopyFrom(taskID);{code}

 but other instances of status updates have a similar line

{code}status.mutable_task_id()->CopyFrom(taskID.get());{code}

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.

  was:
With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

Digging through src/docker/executor.cpp, I found that in the 
"taskHealthUpdated()" function is attempting to copy the taskID to the new 
status instance with "status.mutable_task_id()->CopyFrom(taskID);", but other 
instances of status updates have a similar line 
"status.mutable_task_id()->CopyFrom(taskID.get());".

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.


> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> Digging through src/docker/executor.cpp, I found that in the 
> {{taskHealthUpdated()}} function is attempting to copy the taskID to the new 
> status instance with
> {code}status.mutable_task_id()->CopyFrom(taskID);{code}
>  but other instances of status updates have a similar line
> {code}status.mutable_task_id()->CopyFrom(taskID.get());{code}
> My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.
> I'll try to get a patch together soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-28 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262112#comment-15262112
 ] 

Travis Hegner commented on MESOS-5294:
--

[~gilbert] It's hard to know exactly when this started, as I was also affected 
by MESOS-4370. I was running my own patch to fix that issue on 0.27, and I know 
that health checks were causing the IP addresses to not be resolved with that 
version and docker 1.10, and perhaps 1.9 as well.

[~kaysoky] I understand that it's not dependent directly. Something with 
command based health checks that run inside the docker container is causing the 
`/state` endpoint to lose the IP address only after the task is marked healthy. 
The IP address exists during the startup process, and is removed after the 
"healthy" status update. I'll try and get the requested info posted soon.

My original assumption in the description of this issue is slightly off too. As 
I looked closer, I realized that there is a local variable in the 
`taskHealthUpdated()` function called `taskID` of type `TaskID`, and the other 
references to similar lines of code were using the class scoped `taskId` 
variable of type `Option`. Hence, the `.get()` method on those.

Obviously my original fix wouldn't compile, so I'm kind of back to the 
beginning of my troubleshooting.

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> Digging through src/docker/executor.cpp, I found that in the 
> "taskHealthUpdated()" function is attempting to copy the taskID to the new 
> status instance with "status.mutable_task_id()->CopyFrom(taskID);", but other 
> instances of status updates have a similar line 
> "status.mutable_task_id()->CopyFrom(taskID.get());".
> My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.
> I'll try to get a patch together soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5294) Status updates after a health check are incomplete or invalid

2016-04-27 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-5294:
-
Summary: Status updates after a health check are incomplete or invalid  
(was: Status updates after a health check and incomplete)

> Status updates after a health check are incomplete or invalid
> -
>
> Key: MESOS-5294
> URL: https://issues.apache.org/jira/browse/MESOS-5294
> Project: Mesos
>  Issue Type: Bug
> Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
> ubuntu 14.04
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> With command health checks enabled via marathon, mesos-dns will resolve the 
> task correctly until the task is reported as "healthy". At that point, 
> mesos-dns stops resolving the task correctly.
> Digging through src/docker/executor.cpp, I found that in the 
> "taskHealthUpdated()" function is attempting to copy the taskID to the new 
> status instance with "status.mutable_task_id()->CopyFrom(taskID);", but other 
> instances of status updates have a similar line 
> "status.mutable_task_id()->CopyFrom(taskID.get());".
> My assumption is that this difference is causing the status update after a 
> health check to not have a proper taskID, which in turn is causing an 
> incorrect state.json output.
> I'll try to get a patch together soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5294) Status updates after a health check and incomplete

2016-04-27 Thread Travis Hegner (JIRA)
Travis Hegner created MESOS-5294:


 Summary: Status updates after a health check and incomplete
 Key: MESOS-5294
 URL: https://issues.apache.org/jira/browse/MESOS-5294
 Project: Mesos
  Issue Type: Bug
 Environment: mesos 0.28.0, docker 1.11, marathon 0.15.3, mesos-dns, 
ubuntu 14.04
Reporter: Travis Hegner
Assignee: Travis Hegner


With command health checks enabled via marathon, mesos-dns will resolve the 
task correctly until the task is reported as "healthy". At that point, 
mesos-dns stops resolving the task correctly.

Digging through src/docker/executor.cpp, I found that in the 
"taskHealthUpdated()" function is attempting to copy the taskID to the new 
status instance with "status.mutable_task_id()->CopyFrom(taskID);", but other 
instances of status updates have a similar line 
"status.mutable_task_id()->CopyFrom(taskID.get());".

My assumption is that this difference is causing the status update after a 
health check to not have a proper taskID, which in turn is causing an incorrect 
state.json output.

I'll try to get a patch together soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-08 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Affects Version/s: 0.28.0

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1, Docker 1.10.x
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>  Labels: Blocker
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-08 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Affects Version/s: (was: 0.28.0)

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1, Docker 1.10.x
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>  Labels: Blocker
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-08 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Environment: 
Ubuntu 14.04
Docker 1.9.1, Docker 1.10.x

  was:
Ubuntu 14.04
Docker 1.9.1


> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1, Docker 1.10.x
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>  Labels: Blocker
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-08 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Labels: Blocker  (was: )

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1, Docker 1.10.x
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>  Labels: Blocker
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-08 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184905#comment-15184905
 ] 

Travis Hegner commented on MESOS-4370:
--

Here is the latest comment on the review for this patch (but spell checked now 
:/): https://reviews.apache.org/r/43093/

I think I understand the miscommunication better now. As you know, beginning 
with docker 1.9 the docker inspect output changed the location of the 
IPAddress. The original location was still populated for backwards 
compatibility, but only for the common "bridge" and "host" network types. Mesos 
is written to fail with any other network type. With the new user defined 
networks feature, the old location was not populated. My patch was originally 
intended to address the fact that with user defined networks, the original ip 
location was null.

In order to utilize user defined networks in my environment, we are passing 
arbitrary docker parameters to mesos with the docker containerizer from 
marathon. This results in multiple "--net" parameters passed to docker. The 
luck comes into play because mesos interprets the first --net parameter of 
"bridge" and succeeds, and docker interprets the second --net parameter of my 
UDN, and connects to the right network. I would consider this behavior unstable 
at best.

Based on the sudden up-tick in interest in this patch, I am speculating that 
docker 1.10 is no longer populating the original ip address field (I would be 
un-aware, because I've been running my cluster with this patch), which this 
patch will successfully fix, and even be stable for the typical "host" and 
"bridge" networks.

All that said, I can see why this patch is now more important, even though it 
should be re-structured after review 42516 is implemented. I'll see if I can 
spend some time today and address the remaining issues with this patch.

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-03-07 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183395#comment-15183395
 ] 

Travis Hegner commented on MESOS-4370:
--

Thank you [~robbrockb...@gmail.com] for you testing and interest in this patch. 
I've discovered that this patch only works out of pure luck in the way docker 
interprets multiple "--net" parameters. I have been stalling this patch as it 
will have to be re-worked to account for official user defined network support 
in mesos, via https://reviews.apache.org/r/42516/.

I'd be happy to get a working fix merged in myself, but would prefer it be 
based on the patch linked above.

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4370) NetworkSettings.IPAddress field is deprectaed in Docker

2016-02-04 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133099#comment-15133099
 ] 

Travis Hegner commented on MESOS-4370:
--

I am running the latest version of this patch, and it seems to be working as 
intended. The patch is currently in review at: 
https://reviews.apache.org/r/43093/

The above mentioned PR #90 is up to date with the latest version of this patch 
as well.

> NetworkSettings.IPAddress field is deprectaed in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprectaed in Docker

2016-02-04 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Flags: Patch

> NetworkSettings.IPAddress field is deprectaed in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4370) NetworkSettings.IPAddress field is deprecated in Docker

2016-02-04 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner updated MESOS-4370:
-
Summary: NetworkSettings.IPAddress field is deprecated in Docker  (was: 
NetworkSettings.IPAddress field is deprectaed in Docker)

> NetworkSettings.IPAddress field is deprecated in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4581) mesos-docker-executor has a race condition causing docker tasks to be stuck in staging when trying to launch

2016-02-04 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133093#comment-15133093
 ] 

Travis Hegner commented on MESOS-4581:
--

I've been working on a patch for this issue as we've discussed. However, 
annoyingly, my environment has stopped producing this condition, so I have been 
unable to test the specific code path.

I am currently running a vanilla 0.27.0 with my patch for MESOS-4370 added. If 
my environment triggers this issue again, then I will attempt to verify that 
the patch for this issue works correctly.

> mesos-docker-executor has a race condition causing docker tasks to be stuck 
> in staging when trying to launch
> 
>
> Key: MESOS-4581
> URL: https://issues.apache.org/jira/browse/MESOS-4581
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.26.0, 0.27.0
> Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>
> We are still working to understand the root cause of this issue, but here is 
> what we know so far:
> Symptoms:
> Launching docker containers from marathon in mesos results in the marathon 
> app being stuck in a "staged" status, and the mesos task being stuck in a 
> "staging" status until a timeout, at which point it will launch on another 
> host, with approximately a 50/50 chance of working or being stuck staging 
> again.
> We have a lot of custom containers, custom networking configs, and custom 
> docker run parameters, but we can't seem to narrow this down to any one 
> particular aspect of our environment. This happens randomly per marathon app 
> while it's attempting to start or restart an instance, whether the apps 
> config has changed or not. I can't seem to find anyone else having a similar 
> issue, which leads me to believe that it is a culmination of aspects within 
> our environment to trigger this race condition.
> Deeper analysis:
> The mesos-docker-executor fires the "docker run ..." command in a future. It 
> simultaneously (for all intents and purposes) fires a "docker inspect" 
> against the container which it is trying to start at that moment. When we see 
> this bug, the container starts normally, but the docker inspect command hangs 
> forever. It never re-tries, and never times out.
> When the task launches successfully, the docker inspect command fails once 
> with an exit code, and retries 500ms later, working successfully and flagging 
> the task as "RUNNING" in both mesos and marathon simultaneously.
> If you watch the docker log, you'll notice that a "docker run" via the 
> command line actually triggers 3 docker API calls in succession. "create", 
> "attach", and "start", in that order. It's been fairly consistent that when 
> we see this bug triggered, the docker log has the "create" from the run 
> command, then a GET for the inspect command, then an "attach", and "start" 
> later. When we see this work successfully, we see the GET first (failing, of 
> course because the container doesn't exist yet), and then the "create", 
> "attach", and "start".
> Rudimentary Solution:
> We have written a very basic patch which delays that initial inspect call on 
> the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of 
> the docker run command. This has eliminated the bug as far as we can tell.
> I am not sure if this one time initial delay is the most appropriate fix, or 
> if it would be better to add a timeout to the inspect call in the 
> mesos-docker-executor, which destroys the current inspect thread and starts a 
> new one. The timeout/retry may be appropriate whether the initial delay 
> exists or not.
> In Summary:
> It appears that mesos-docker-executor does not have a race condition itself, 
> but it seems to be triggering one in docker. Since we haven't found this 
> issue anywhere else with any substance, we understand that it is likely 
> related to our environment. Our custom network driver for docker does some 
> cluster-wide coordination, and may introduce just enough delay between the 
> "create" and "attach" calls that are causing us to witness this bug at about 
> a 50-60% rate of attempted container start.
> The inspectDelay patch that I've written for this issue is located in my 
> inspectDelay branch at:
> https://github.com/travishegner/mesos/tree/inspectDelay
> I am happy to supply this patch as a pull request, or put it through the 
> review board if the maintainers feel this is an appropriate fix, or at least 
> as a stop-gap measure until a better fix can be written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4581) mesos-docker-executor has a race condition causing docker tasks to be stuck in staging when trying to launch

2016-02-03 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner reassigned MESOS-4581:


Assignee: Travis Hegner

> mesos-docker-executor has a race condition causing docker tasks to be stuck 
> in staging when trying to launch
> 
>
> Key: MESOS-4581
> URL: https://issues.apache.org/jira/browse/MESOS-4581
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.26.0, 0.27.0
> Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>
> We are still working to understand the root cause of this issue, but here is 
> what we know so far:
> Symptoms:
> Launching docker containers from marathon in mesos results in the marathon 
> app being stuck in a "staged" status, and the mesos task being stuck in a 
> "staging" status until a timeout, at which point it will launch on another 
> host, with approximately a 50/50 chance of working or being stuck staging 
> again.
> We have a lot of custom containers, custom networking configs, and custom 
> docker run parameters, but we can't seem to narrow this down to any one 
> particular aspect of our environment. This happens randomly per marathon app 
> while it's attempting to start or restart an instance, whether the apps 
> config has changed or not. I can't seem to find anyone else having a similar 
> issue, which leads me to believe that it is a culmination of aspects within 
> our environment to trigger this race condition.
> Deeper analysis:
> The mesos-docker-executor fires the "docker run ..." command in a future. It 
> simultaneously (for all intents and purposes) fires a "docker inspect" 
> against the container which it is trying to start at that moment. When we see 
> this bug, the container starts normally, but the docker inspect command hangs 
> forever. It never re-tries, and never times out.
> When the task launches successfully, the docker inspect command fails once 
> with an exit code, and retries 500ms later, working successfully and flagging 
> the task as "RUNNING" in both mesos and marathon simultaneously.
> If you watch the docker log, you'll notice that a "docker run" via the 
> command line actually triggers 3 docker API calls in succession. "create", 
> "attach", and "start", in that order. It's been fairly consistent that when 
> we see this bug triggered, the docker log has the "create" from the run 
> command, then a GET for the inspect command, then an "attach", and "start" 
> later. When we see this work successfully, we see the GET first (failing, of 
> course because the container doesn't exist yet), and then the "create", 
> "attach", and "start".
> Rudimentary Solution:
> We have written a very basic patch which delays that initial inspect call on 
> the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of 
> the docker run command. This has eliminated the bug as far as we can tell.
> I am not sure if this one time initial delay is the most appropriate fix, or 
> if it would be better to add a timeout to the inspect call in the 
> mesos-docker-executor, which destroys the current inspect thread and starts a 
> new one. The timeout/retry may be appropriate whether the initial delay 
> exists or not.
> In Summary:
> It appears that mesos-docker-executor does not have a race condition itself, 
> but it seems to be triggering one in docker. Since we haven't found this 
> issue anywhere else with any substance, we understand that it is likely 
> related to our environment. Our custom network driver for docker does some 
> cluster-wide coordination, and may introduce just enough delay between the 
> "create" and "attach" calls that are causing us to witness this bug at about 
> a 50-60% rate of attempted container start.
> The inspectDelay patch that I've written for this issue is located in my 
> inspectDelay branch at:
> https://github.com/travishegner/mesos/tree/inspectDelay
> I am happy to supply this patch as a pull request, or put it through the 
> review board if the maintainers feel this is an appropriate fix, or at least 
> as a stop-gap measure until a better fix can be written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4370) NetworkSettings.IPAddress field is deprectaed in Docker

2016-02-03 Thread Travis Hegner (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Hegner reassigned MESOS-4370:


Assignee: Travis Hegner

> NetworkSettings.IPAddress field is deprectaed in Docker
> ---
>
> Key: MESOS-4370
> URL: https://issues.apache.org/jira/browse/MESOS-4370
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.0
> Environment: Ubuntu 14.04
> Docker 1.9.1
>Reporter: Clint Armstrong
>Assignee: Travis Hegner
>
> The latest docker API deprecates the NetworkSettings.IPAddress field, in 
> favor of the NetworkSettings.Networks field.
> https://docs.docker.com/engine/reference/api/docker_remote_api/#v1-21-api-changes
> With this deprecation, NetworkSettings.IPAddress is not populated for 
> containers running with networks that use new network plugins.
> As a result the mesos API has no data in 
> container_status.network_infos.ip_address or 
> container_status.network_infos.ipaddresses.
> The immediate impact of this is that mesos-dns is unable to retrieve a 
> containers IP from the netinfo interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4581) mesos-docker-executor has a race condition causing docker tasks to be stuck in staging when trying to launch

2016-02-02 Thread Travis Hegner (JIRA)
Travis Hegner created MESOS-4581:


 Summary: mesos-docker-executor has a race condition causing docker 
tasks to be stuck in staging when trying to launch
 Key: MESOS-4581
 URL: https://issues.apache.org/jira/browse/MESOS-4581
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 0.26.0, 0.27.0
 Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
Reporter: Travis Hegner


We are still working to understand the root cause of this issue, but here is 
what we know so far:

Symptoms:

Launching docker containers from marathon in mesos results in the marathon app 
being stuck in a "staged" status, and the mesos task being stuck in a "staging" 
status until a timeout, at which point it will launch on another host, with 
approximately a 50/50 chance of working or being stuck staging again.

We have a lot of custom containers, custom networking configs, and custom 
docker run parameters, but we can't seem to narrow this down to any one 
particular aspect of our environment. This happens randomly per marathon app 
while it's attempting to start or restart an instance, whether the apps config 
has changed or not. I can't seem to find anyone else having a similar issue, 
which leads me to believe that it is a culmination of aspects within our 
environment to trigger this race condition.

Deeper analysis:
The mesos-docker-executor fires the "docker run ..." command in a future. It 
simultaneously (for all intents and purposes) fires a "docker inspect" against 
the container which it is trying to start at that moment. When we see this bug, 
the container starts normally, but the docker inspect command hangs forever. It 
never re-tries, and never times out.

When the task launches successfully, the docker inspect command fails once with 
an exit code, and retries 500ms later, working successfully and flagging the 
task as "RUNNING" in both mesos and marathon simultaneously.

If you watch the docker log, you'll notice that a "docker run" via the command 
line actually triggers 3 docker API calls in succession. "create", "attach", 
and "start", in that order. It's been fairly consistent that when we see this 
bug triggered, the docker log has the "create" from the run command, then a GET 
for the inspect command, then an "attach", and "start" later. When we see this 
work successfully, we see the GET first (failing, of course because the 
container doesn't exist yet), and then the "create", "attach", and "start".

Rudimentary Solution:
We have written a very basic patch which delays that initial inspect call on 
the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of the 
docker run command. This has eliminated the bug as far as we can tell.

I am not sure if this one time initial delay is the most appropriate fix, or if 
it would be better to add a timeout to the inspect call in the 
mesos-docker-executor, which destroys the current inspect thread and starts a 
new one. The timeout/retry may be appropriate whether the initial delay exists 
or not.

In Summary:
It appears that mesos-docker-executor does not have a race condition itself, 
but it seems to be triggering one in docker. Since we haven't found this issue 
anywhere else with any substance, we understand that it is likely related to 
our environment. Our custom network driver for docker does some cluster-wide 
coordination, and may introduce just enough delay between the "create" and 
"attach" calls that are causing us to witness this bug at about a 50-60% rate 
of attempted container start.

The inspectDelay patch that I've written for this issue is located in my 
inspectDelay branch at:
https://github.com/travishegner/mesos/tree/inspectDelay

I am happy to supply this patch as a pull request, or put it through the review 
board if the maintainers feel this is an appropriate fix, or at least as a 
stop-gap measure until a better fix can be written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3706) Tasks stuck in staging.

2016-01-25 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115286#comment-15115286
 ] 

Travis Hegner commented on MESOS-3706:
--

A closer look at your issue, and I see that your container never starts at all. 
This is different from our issue, but I still wonder if they are related to the 
same underlying cause. Perhaps your version of docker fails in a different way 
as a result of the race condition that we are seeing.

> Tasks stuck in staging.
> ---
>
> Key: MESOS-3706
> URL: https://issues.apache.org/jira/browse/MESOS-3706
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.23.0, 0.24.1
>Reporter: Jord Sonneveld
> Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot 
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, 
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one.  On 
> that one, it is stuck in STAGING for a long time and never starts.  The INFO 
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task 
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: 
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging.  It is launched by 
> marathon.  I have launched 161 instances successfully on my cluster.  But it 
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are / 
> should be identical.  I have re-run my ansible scripts and rebooted the 
> machines to no avail.
> It's been in this state for almost 30 minutes.  You can see the mesos docker 
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root 35360  0.0  0.0 1070576 21476 ?   Ssl  15:46   0:00 
> mesos-docker-executor 
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox 
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER IDIMAGE  
> COMMAND  CREATED STATUS  PORTS
> NAMES
> 5c858b90b0a0registry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   39 minutes ago  Up 39 minutes   
> 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp   statsd-fe-influxdb
> d765ba3829fdregistry.roger.dal.moz.com:5000/moz-statsd-v0.22   
> "/bin/sh -c ./start.s"   41 minutes ago  Up 41 minutes   
> 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp   statsd-repeater
> Those are the only two entries. Nothing about the kwe-vinland job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3706) Tasks stuck in staging.

2016-01-25 Thread Travis Hegner (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115249#comment-15115249
 ] 

Travis Hegner commented on MESOS-3706:
--

This may be related to an issue we've been experiencing with mesos, marathon, 
and docker. In our case, there is a combination of issues with both the 
mesos-docker-executor, and with the docker daemon itself.

We are running mesos 0.27 (master branch), marathon 0.14, and docker 1.9.1.

Also, in our case, the container would actually start and run perfectly 
normally, even though the mesos interface said that it was stuck "STAGING". 
Typically, the task would reach a timeout and retry somewhere else where it 
may, or may not, get stuck staging again.

I'm very curious if this is the same cause for the issue described here. You 
can determine so by watching your docker log when your task tries to launch. 
When the `docker run` command is issued, the client actually does a `create`, 
`attach`, and `start` against the docker API. The `docker inspect` command 
issues a single GET against the container name to get it's JSON configuration. 
The threaded nature of the `mesos-docker-executor` causes the `run` and 
`inspect` commands to be issued simultaneously. Whenever we experienced the 
issue, the docker log would indicate that the internal commands were `create`, 
`inspect`, `attach`, then `start` in that order. I believe, but have not 
verified for certain, that the inspect command is hanging indefinitely because 
the container was not completely started, and as a result, the 
`mesos-docker-executor` is never receiving the `running` state. The `run` 
command would still complete normally, and the container would start without 
issue, even though the task was never reported as running.

I was able to create a rudimentary fix for our issue by injecting a small delay 
(only 500ms) between the `docker run` command, and the `docker inspect` 
command. This allowed the container to fully start properly before attempting 
to do an inspect, thereby avoiding whatever in the docker daemon was causing 
the inspect command to hang indefinitely.

I'd like to know if this is the exact cause for this issue, as I could submit a 
pull request against it, otherwise if it's a separate issue, then I could file 
a new one to submit a pull request to. If you'd like to try out our fix, there 
is a branch here: https://github.com/travishegner/mesos/tree/inspectDelay. 
Beware that this branch also contains our fix for #4370.

> Tasks stuck in staging.
> ---
>
> Key: MESOS-3706
> URL: https://issues.apache.org/jira/browse/MESOS-3706
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Affects Versions: 0.23.0, 0.24.1
>Reporter: Jord Sonneveld
> Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot 
> 2015-10-12 at 9.24.32 AM.png, docker.txt, mesos-slave.INFO, 
> mesos-slave.INFO.2, mesos-slave.INFO.3, stderr, stdout
>
>
> I have a docker image which starts fine on all my slaves except for one.  On 
> that one, it is stuck in STAGING for a long time and never starts.  The INFO 
> log is full of messages like this:
> I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task 
> kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 
> 20150109-172016-504433162-5050-19367-0002
> E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: 
> Transport endpoint is not connected [107]
> kwe-vinland-work is the task that is stuck in staging.  It is launched by 
> marathon.  I have launched 161 instances successfully on my cluster.  But it 
> refuses to launch on this specific slave.
> These machines are all managed via ansible so their configurations are / 
> should be identical.  I have re-run my ansible scripts and rebooted the 
> machines to no avail.
> It's been in this state for almost 30 minutes.  You can see the mesos docker 
> executor is still running:
> jord@dalstgmesos03:~$ date
> Mon Oct 12 16:13:55 UTC 2015
> jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland
> root 35360  0.0  0.0 1070576 21476 ?   Ssl  15:46   0:00 
> mesos-docker-executor 
> --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox 
> --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe
>  --stop_timeout=0ns
> According to docker ps -a, nothing was ever even launched:
> jord@dalstgmesos03:/data/mesos$ sudo docker ps -a
> CONTAINER IDIMAGE  
> COMMAND  CREATED STATUS  PORTS