[jira] [Commented] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-12-01 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714038#comment-15714038
 ] 

Zhitao Li commented on MESOS-4945:
--

[~gilbert] [~jieyu], I've put up a short design doc for this in 
https://docs.google.com/document/d/1TSn7HOFLWpF3TLRVe4XyLpv6B__A1tk-tU16B1ZbsCI/edit#.

Please take a look and let me know if you see issues.

If it looks good, I'll add more issues to this epic.

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-4945) Garbage collect unused docker layers in the store.

2016-12-01 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4945:
-
Comment: was deleted

(was: Revised plan in rough steps:
* For each image, checkpoint a) container ids, b) time of last container using 
it being destroyed, and c) size of each layer;
** TODO: how do deal with migration? idea is passing in more info in 
recover() chain of containerizer -> provisioner -> store;
* Change store interface:
** "get(Image)" to "get(Image, ContainerID)": The containerID field added 
can be used to implement ref counting and further book keeping (i.e. get local 
images information);
** add "remove(Image, ContainerID)" virtual function: this is optional in 
that store which does not do ref counting can have an empty implementation.
*  Make sure provisioner::destroy() call store::remove(Image, ContainerID);
* Add command line flag for docker store capacity limit (in bytes);
* In (docker) store::get(Image, ContainerID), after a pull is done, calculate 
total layer sizes, if above store capacity, remove unused images (determined by 
empty container ids), sorted by last time not used. Any layer not shared by 
leftover images is also removed, until total size is dropped below capacity.

Open question: 

1) In this design, we have one explicit reference counting between 
{{Container}} and {{Image}} in store. However, this information could be 
constructed on-the-fly with all containers in {{Containerizer}} class. Do we 
consider this "double accounting" problematic, or error-prone?
2) Is calling new {{remove(Image, ContainerID)}} from 
{{Provisioner::destroy()}} sufficient to make sure all book keepings are 
properly done?)

> Garbage collect unused docker layers in the store.
> --
>
> Key: MESOS-4945
> URL: https://issues.apache.org/jira/browse/MESOS-4945
> Project: Mesos
>  Issue Type: Epic
>Reporter: Jie Yu
>Assignee: Zhitao Li
>
> Right now, we don't have any garbage collection in place for docker layers. 
> It's not straightforward to implement because we don't know what container is 
> currently using the layer. We probably need a way to track the current usage 
> of layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6668) can't fetch uris

2016-12-01 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6668:

Description: 
when with two uris: "https://nginx.org/download/nginx-1.6.3.tar.gz; 
"https://nginx.org/download/nginx-1.8.1.tar.gz;, sometimes it will be 
successful fetched, but most of time failed!

{code}
I1202 11:38:37.758714 1959038976 fetcher.cpp:498] Fetcher Info: 
{"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.6.3.tar.gz"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.8.1.tar.gz"}}],"sandbox_directory":"\/mesos\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3\/frameworks\/dfad11fe-c83a-40aa-abc5-6390a7615545-0039\/executors\/1480649917664199938-0.0001.defaultGroup.Unnamed\/runs\/037e56bc-8416-49c9-bc72-bb43c87a97d7"}
I1202 11:38:37.764912 1959038976 fetcher.cpp:409] Fetching URI 
'https://nginx.org/download/nginx-1.6.3.tar.gz'
I1202 11:38:37.764940 1959038976 fetcher.cpp:250] Fetching directly into the 
sandbox directory
I1202 11:38:37.764974 1959038976 fetcher.cpp:187] Fetching URI 
'https://nginx.org/download/nginx-1.6.3.tar.gz'
I1202 11:38:37.764999 1959038976 fetcher.cpp:134] Downloading resource from 
'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.293943 1959038976 fetcher.cpp:84] Extracting with command: tar 
-C 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
 -xf 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.385437 1959038976 fetcher.cpp:92] Extracted 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
 into 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
I1202 11:39:07.385507 1959038976 fetcher.cpp:547] Fetched 
'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.385517 1959038976 fetcher.cpp:409] Fetching URI 
'https://nginx.org/download/nginx-1.8.1.tar.gz'
I1202 11:39:07.385524 1959038976 fetcher.cpp:250] Fetching directly into the 
sandbox directory
I1202 11:39:07.385546 1959038976 fetcher.cpp:187] Fetching URI 
'https://nginx.org/download/nginx-1.8.1.tar.gz'
I1202 11:39:07.385560 1959038976 fetcher.cpp:134] Downloading resource from 
'https://nginx.org/download/nginx-1.8.1.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.8.1.tar.gz'

End fetcher log for container 037e56bc-8416-49c9-bc72-bb43c87a97d7
E1202 11:39:37.843308 3211264 fetcher.cpp:568] Failed to run mesos-fetcher: 
Failed to fetch all URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' 
with exit status: 9
E1202 11:39:37.843600 1601536 slave.cpp:4423] Container 
'037e56bc-8416-49c9-bc72-bb43c87a97d7' for executor 
'1480649917664199938-0.0001.defaultGroup.Unnamed' of framework 
dfad11fe-c83a-40aa-abc5-6390a7615545-0039 failed to start: Failed to fetch all 
URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' with exit status: 9
W1202 11:39:37.843725 1064960 composing.cpp:600] Attempted to destroy unknown 
container 037e56bc-8416-49c9-bc72-bb43c87a97d7
E1202 11:39:37.843735 1601536 slave.cpp:4529] Termination of executor 
'1480649917664199938-0.0001.defaultGroup.Unnamed' of framework 
dfad11fe-c83a-40aa-abc5-6390a7615545-0039 failed: unknown container
I1202 11:39:37.843849 1601536 slave.cpp:3634] Handling status update 
TASK_FAILED (UUID: e308b104-087d-4c9e-9606-c11e62dd14ad) for task 
1480649917664199938-0.0001.defaultGroup.Unnamed of framework 
dfad11fe-c83a-40aa-abc5-6390a7615545-0039 from @0.0.0.0:0
{code}

  was:
when with two uris: "https://nginx.org/download/nginx-1.6.3.tar.gz; 

[jira] [Assigned] (MESOS-6668) can't fetch uris

2016-12-01 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-6668:
---

Assignee: haosdent

> can't fetch uris
> 
>
> Key: MESOS-6668
> URL: https://issues.apache.org/jira/browse/MESOS-6668
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.1.0
>Reporter: pwzgorilla
>Assignee: haosdent
>Priority: Minor
>
> when with two uris: "https://nginx.org/download/nginx-1.6.3.tar.gz; 
> "https://nginx.org/download/nginx-1.8.1.tar.gz;, sometimes it will be 
> successful fetched, but most of time failed!
> I1202 11:38:37.758714 1959038976 fetcher.cpp:498] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.6.3.tar.gz"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.8.1.tar.gz"}}],"sandbox_directory":"\/mesos\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3\/frameworks\/dfad11fe-c83a-40aa-abc5-6390a7615545-0039\/executors\/1480649917664199938-0.0001.defaultGroup.Unnamed\/runs\/037e56bc-8416-49c9-bc72-bb43c87a97d7"}
> I1202 11:38:37.764912 1959038976 fetcher.cpp:409] Fetching URI 
> 'https://nginx.org/download/nginx-1.6.3.tar.gz'
> I1202 11:38:37.764940 1959038976 fetcher.cpp:250] Fetching directly into the 
> sandbox directory
> I1202 11:38:37.764974 1959038976 fetcher.cpp:187] Fetching URI 
> 'https://nginx.org/download/nginx-1.6.3.tar.gz'
> I1202 11:38:37.764999 1959038976 fetcher.cpp:134] Downloading resource from 
> 'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
> I1202 11:39:07.293943 1959038976 fetcher.cpp:84] Extracting with command: tar 
> -C 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
>  -xf 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
> I1202 11:39:07.385437 1959038976 fetcher.cpp:92] Extracted 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
>  into 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
> I1202 11:39:07.385507 1959038976 fetcher.cpp:547] Fetched 
> 'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
> I1202 11:39:07.385517 1959038976 fetcher.cpp:409] Fetching URI 
> 'https://nginx.org/download/nginx-1.8.1.tar.gz'
> I1202 11:39:07.385524 1959038976 fetcher.cpp:250] Fetching directly into the 
> sandbox directory
> I1202 11:39:07.385546 1959038976 fetcher.cpp:187] Fetching URI 
> 'https://nginx.org/download/nginx-1.8.1.tar.gz'
> I1202 11:39:07.385560 1959038976 fetcher.cpp:134] Downloading resource from 
> 'https://nginx.org/download/nginx-1.8.1.tar.gz' to 
> '/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.8.1.tar.gz'
> End fetcher log for container 037e56bc-8416-49c9-bc72-bb43c87a97d7
> E1202 11:39:37.843308 3211264 fetcher.cpp:568] Failed to run mesos-fetcher: 
> Failed to fetch all URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' 
> with exit status: 9
> E1202 11:39:37.843600 1601536 slave.cpp:4423] Container 
> '037e56bc-8416-49c9-bc72-bb43c87a97d7' for executor 
> '1480649917664199938-0.0001.defaultGroup.Unnamed' of framework 
> dfad11fe-c83a-40aa-abc5-6390a7615545-0039 failed to start: Failed to fetch 
> all URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' with exit 
> status: 9
> W1202 11:39:37.843725 1064960 composing.cpp:600] Attempted to destroy unknown 
> container 037e56bc-8416-49c9-bc72-bb43c87a97d7
> E1202 11:39:37.843735 1601536 slave.cpp:4529] Termination of executor 
> 

[jira] [Created] (MESOS-6668) can't fetch uris

2016-12-01 Thread pwzgorilla (JIRA)
pwzgorilla created MESOS-6668:
-

 Summary: can't fetch uris
 Key: MESOS-6668
 URL: https://issues.apache.org/jira/browse/MESOS-6668
 Project: Mesos
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.1.0
Reporter: pwzgorilla
Priority: Minor


when with two uris: "https://nginx.org/download/nginx-1.6.3.tar.gz; 
"https://nginx.org/download/nginx-1.8.1.tar.gz;, sometimes it will be 
successful fetched, but most of time failed!

I1202 11:38:37.758714 1959038976 fetcher.cpp:498] Fetcher Info: 
{"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.6.3.tar.gz"}},{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"https:\/\/nginx.org\/download\/nginx-1.8.1.tar.gz"}}],"sandbox_directory":"\/mesos\/slaves\/dfad11fe-c83a-40aa-abc5-6390a7615545-S3\/frameworks\/dfad11fe-c83a-40aa-abc5-6390a7615545-0039\/executors\/1480649917664199938-0.0001.defaultGroup.Unnamed\/runs\/037e56bc-8416-49c9-bc72-bb43c87a97d7"}
I1202 11:38:37.764912 1959038976 fetcher.cpp:409] Fetching URI 
'https://nginx.org/download/nginx-1.6.3.tar.gz'
I1202 11:38:37.764940 1959038976 fetcher.cpp:250] Fetching directly into the 
sandbox directory
I1202 11:38:37.764974 1959038976 fetcher.cpp:187] Fetching URI 
'https://nginx.org/download/nginx-1.6.3.tar.gz'
I1202 11:38:37.764999 1959038976 fetcher.cpp:134] Downloading resource from 
'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.293943 1959038976 fetcher.cpp:84] Extracting with command: tar 
-C 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
 -xf 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.385437 1959038976 fetcher.cpp:92] Extracted 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
 into 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7'
I1202 11:39:07.385507 1959038976 fetcher.cpp:547] Fetched 
'https://nginx.org/download/nginx-1.6.3.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.6.3.tar.gz'
I1202 11:39:07.385517 1959038976 fetcher.cpp:409] Fetching URI 
'https://nginx.org/download/nginx-1.8.1.tar.gz'
I1202 11:39:07.385524 1959038976 fetcher.cpp:250] Fetching directly into the 
sandbox directory
I1202 11:39:07.385546 1959038976 fetcher.cpp:187] Fetching URI 
'https://nginx.org/download/nginx-1.8.1.tar.gz'
I1202 11:39:07.385560 1959038976 fetcher.cpp:134] Downloading resource from 
'https://nginx.org/download/nginx-1.8.1.tar.gz' to 
'/mesos/slaves/dfad11fe-c83a-40aa-abc5-6390a7615545-S3/frameworks/dfad11fe-c83a-40aa-abc5-6390a7615545-0039/executors/1480649917664199938-0.0001.defaultGroup.Unnamed/runs/037e56bc-8416-49c9-bc72-bb43c87a97d7/nginx-1.8.1.tar.gz'

End fetcher log for container 037e56bc-8416-49c9-bc72-bb43c87a97d7
E1202 11:39:37.843308 3211264 fetcher.cpp:568] Failed to run mesos-fetcher: 
Failed to fetch all URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' 
with exit status: 9
E1202 11:39:37.843600 1601536 slave.cpp:4423] Container 
'037e56bc-8416-49c9-bc72-bb43c87a97d7' for executor 
'1480649917664199938-0.0001.defaultGroup.Unnamed' of framework 
dfad11fe-c83a-40aa-abc5-6390a7615545-0039 failed to start: Failed to fetch all 
URIs for container '037e56bc-8416-49c9-bc72-bb43c87a97d7' with exit status: 9
W1202 11:39:37.843725 1064960 composing.cpp:600] Attempted to destroy unknown 
container 037e56bc-8416-49c9-bc72-bb43c87a97d7
E1202 11:39:37.843735 1601536 slave.cpp:4529] Termination of executor 
'1480649917664199938-0.0001.defaultGroup.Unnamed' of framework 
dfad11fe-c83a-40aa-abc5-6390a7615545-0039 failed: unknown container
I1202 11:39:37.843849 1601536 slave.cpp:3634] Handling status update 
TASK_FAILED (UUID: e308b104-087d-4c9e-9606-c11e62dd14ad) for task 

[jira] [Updated] (MESOS-6487) Define RestartPolicy in TaskInfo

2016-12-01 Thread Megha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha updated MESOS-6487:
-
Shepherd: Yan Xu

> Define RestartPolicy in TaskInfo
> 
>
> Key: MESOS-6487
> URL: https://issues.apache.org/jira/browse/MESOS-6487
> Project: Mesos
>  Issue Type: Task
>Reporter: Yan Xu
>Assignee: Megha
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-01 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713512#comment-15713512
 ] 

Neil Conway commented on MESOS-6665:


Same error happens on {{IOSwitchboardTest.ServerRedirectLog}}.

> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create "3rdparty/libprocess/libprocess-tests"
> Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
> (lldb) run --gtest_filter=IOTest.Redirect
> Process 26064 launched: 
> '/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
> (x86_64)
> Note: Google Test filter = IOTest.Redirect
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOTest
> [ RUN  ] IOTest.Redirect
> Process 26064 stopped
> * thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
> EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
> frame #0: 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78
> libsystem_malloc.dylib`szone_malloc_should_clear:
> ->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
> 0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
> 0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
> 0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
> (lldb) bt
> .
> frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> {noformat}
> Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6667) Update vendored ZooKeeper to 3.4.9

2016-12-01 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-6667:
---
Target Version/s: 1.2.0

> Update vendored ZooKeeper to 3.4.9
> --
>
> Key: MESOS-6667
> URL: https://issues.apache.org/jira/browse/MESOS-6667
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere
>
> 3.4.9 has a few notable fixes for the C client library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6667) Update vendored ZooKeeper to 3.4.9

2016-12-01 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6667:
--

 Summary: Update vendored ZooKeeper to 3.4.9
 Key: MESOS-6667
 URL: https://issues.apache.org/jira/browse/MESOS-6667
 Project: Mesos
  Issue Type: Improvement
Reporter: Neil Conway


3.4.9 has a few notable fixes for the C client library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6662) Some HTTP scheduler calls are missing from the docs

2016-12-01 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6662:
--
Labels: apidocs documentation http newbie scheduler  (was: apidocs 
documentation http scheduler)

> Some HTTP scheduler calls are missing from the docs
> ---
>
> Key: MESOS-6662
> URL: https://issues.apache.org/jira/browse/MESOS-6662
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Reporter: Greg Mann
>  Labels: apidocs, documentation, http, newbie, scheduler
>
> Some of the calls available to HTTP schedulers are missing from the HTTP 
> scheduler API documentation. We should make sure that all of the calls 
> available in the {{Master::Http::scheduler}} handler are in the documentation 
> [here|https://github.com/apache/mesos/blob/master/docs/scheduler-http-api.md].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6665:
--
Issue Type: Bug  (was: Task)

> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create "3rdparty/libprocess/libprocess-tests"
> Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
> (lldb) run --gtest_filter=IOTest.Redirect
> Process 26064 launched: 
> '/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
> (x86_64)
> Note: Google Test filter = IOTest.Redirect
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from IOTest
> [ RUN  ] IOTest.Redirect
> Process 26064 stopped
> * thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
> EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
> frame #0: 0x7fffd6d463e0 
> libsystem_malloc.dylib`szone_malloc_should_clear + 78
> libsystem_malloc.dylib`szone_malloc_should_clear:
> ->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
> 0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
> 0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
> 0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
> (lldb) bt
> .
> frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
> {noformat}
> Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6666) HttpServeTest.Discard failed on OSX sierra

2016-12-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-:
--
Description: 
{noformat}
[ RUN  ] HttpServeTest.Discard
/Users/jie/workspace/vagrant/trusty/mesos/3rdparty/libprocess/src/tests/http_tests.cpp:1926:
 Failure
Failed to wait 15secs for response
[  FAILED  ] HttpServeTest.Discard (15003 ms)
{noformat}

  was:
{noformat}
[ RUN  ] HttpServeTest.Discard
/Users/jie/workspace/vagrant/trusty/mesos/3rdparty/libprocess/src/tests/http_tests.cpp:1926:
 Failure
Failed to wait 15secs for response
[  FAILED  ] HttpServeTest.Discard (15003 ms)
{nofromat}


> HttpServeTest.Discard failed on OSX sierra
> --
>
> Key: MESOS-
> URL: https://issues.apache.org/jira/browse/MESOS-
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> {noformat}
> [ RUN  ] HttpServeTest.Discard
> /Users/jie/workspace/vagrant/trusty/mesos/3rdparty/libprocess/src/tests/http_tests.cpp:1926:
>  Failure
> Failed to wait 15secs for response
> [  FAILED  ] HttpServeTest.Discard (15003 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6666) HttpServeTest.Discard failed on OSX sierra

2016-12-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-:
--
Issue Type: Bug  (was: Task)

> HttpServeTest.Discard failed on OSX sierra
> --
>
> Key: MESOS-
> URL: https://issues.apache.org/jira/browse/MESOS-
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jie Yu
>
> {noformat}
> [ RUN  ] HttpServeTest.Discard
> /Users/jie/workspace/vagrant/trusty/mesos/3rdparty/libprocess/src/tests/http_tests.cpp:1926:
>  Failure
> Failed to wait 15secs for response
> [  FAILED  ] HttpServeTest.Discard (15003 ms)
> {nofromat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-01 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6665:
--
Description: 
Can reproduce this on macOS sierra:
{noformat}
[--] 6 tests from IOTest
[ RUN  ] IOTest.Poll
[   OK ] IOTest.Poll (0 ms)
[ RUN  ] IOTest.Read
[   OK ] IOTest.Read (3 ms)
[ RUN  ] IOTest.BufferedRead
[   OK ] IOTest.BufferedRead (5 ms)
[ RUN  ] IOTest.Write
[   OK ] IOTest.Write (1 ms)
[ RUN  ] IOTest.Redirect
make[6]: *** [check-local] Illegal instruction: 4
make[5]: *** [check-am] Error 2
make[4]: *** [check-recursive] Error 1
make[3]: *** [check] Error 2
make[2]: *** [check-recursive] Error 1
make[1]: *** [check] Error 2
make: *** [check-recursive] Error 1
(reverse-i-search)`k': make check -j3
Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
(lldb) target create "3rdparty/libprocess/libprocess-tests"
Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
(lldb) run --gtest_filter=IOTest.Redirect
Process 26064 launched: 
'/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
(x86_64)
Note: Google Test filter = IOTest.Redirect
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from IOTest
[ RUN  ] IOTest.Redirect
Process 26064 stopped
* thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
frame #0: 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78
libsystem_malloc.dylib`szone_malloc_should_clear:
->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
(lldb) bt
.
frame #2794: 0x7fffd6ddb221 libsystem_pthread.dylib`thread_start + 13
{noformat}

Change the test to redirect just 1KB data will hide the issue.

  was:
Can reproduce this on macOS sierra:
{noformat}
[--] 6 tests from IOTest
[ RUN  ] IOTest.Poll
[   OK ] IOTest.Poll (0 ms)
[ RUN  ] IOTest.Read
[   OK ] IOTest.Read (3 ms)
[ RUN  ] IOTest.BufferedRead
[   OK ] IOTest.BufferedRead (5 ms)
[ RUN  ] IOTest.Write
[   OK ] IOTest.Write (1 ms)
[ RUN  ] IOTest.Redirect
make[6]: *** [check-local] Illegal instruction: 4
make[5]: *** [check-am] Error 2
make[4]: *** [check-recursive] Error 1
make[3]: *** [check] Error 2
make[2]: *** [check-recursive] Error 1
make[1]: *** [check] Error 2
make: *** [check-recursive] Error 1
(reverse-i-search)`k': make check -j3
Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
(lldb) target create "3rdparty/libprocess/libprocess-tests"
Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
(lldb) run --gtest_filter=IOTest.Redirect
Process 26064 launched: 
'/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
(x86_64)
Note: Google Test filter = IOTest.Redirect
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from IOTest
[ RUN  ] IOTest.Redirect
Process 26064 stopped
* thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
frame #0: 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78
libsystem_malloc.dylib`szone_malloc_should_clear:
->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
(lldb)
{noformat}

Change the test to redirect just 1KB data will hide the issue.


> io::redirect might cause stack overflow.
> 
>
> Key: MESOS-6665
> URL: https://issues.apache.org/jira/browse/MESOS-6665
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>
> Can reproduce this on macOS sierra:
> {noformat}
> [--] 6 tests from IOTest
> [ RUN  ] IOTest.Poll
> [   OK ] IOTest.Poll (0 ms)
> [ RUN  ] IOTest.Read
> [   OK ] IOTest.Read (3 ms)
> [ RUN  ] IOTest.BufferedRead
> [   OK ] IOTest.BufferedRead (5 ms)
> [ RUN  ] IOTest.Write
> [   OK ] IOTest.Write (1 ms)
> [ RUN  ] IOTest.Redirect
> make[6]: *** [check-local] Illegal instruction: 4
> make[5]: *** [check-am] Error 2
> make[4]: *** [check-recursive] Error 1
> make[3]: *** [check] Error 2
> make[2]: *** [check-recursive] Error 1
> make[1]: *** [check] Error 2
> make: *** [check-recursive] Error 1
> (reverse-i-search)`k': make check -j3
> Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
> (lldb) target create 

[jira] [Created] (MESOS-6665) io::redirect might cause stack overflow.

2016-12-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6665:
-

 Summary: io::redirect might cause stack overflow.
 Key: MESOS-6665
 URL: https://issues.apache.org/jira/browse/MESOS-6665
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


Can reproduce this on macOS sierra:
{noformat}
[--] 6 tests from IOTest
[ RUN  ] IOTest.Poll
[   OK ] IOTest.Poll (0 ms)
[ RUN  ] IOTest.Read
[   OK ] IOTest.Read (3 ms)
[ RUN  ] IOTest.BufferedRead
[   OK ] IOTest.BufferedRead (5 ms)
[ RUN  ] IOTest.Write
[   OK ] IOTest.Write (1 ms)
[ RUN  ] IOTest.Redirect
make[6]: *** [check-local] Illegal instruction: 4
make[5]: *** [check-am] Error 2
make[4]: *** [check-recursive] Error 1
make[3]: *** [check] Error 2
make[2]: *** [check-recursive] Error 1
make[1]: *** [check] Error 2
make: *** [check-recursive] Error 1
(reverse-i-search)`k': make check -j3
Jies-MacBook-Pro:build jie$ lldb 3rdparty/libprocess/libprocess-tests
(lldb) target create "3rdparty/libprocess/libprocess-tests"
Current executable set to '3rdparty/libprocess/libprocess-tests' (x86_64).
(lldb) run --gtest_filter=IOTest.Redirect
Process 26064 launched: 
'/Users/jie/workspace/dist/mesos/build/3rdparty/libprocess/libprocess-tests' 
(x86_64)
Note: Google Test filter = IOTest.Redirect
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from IOTest
[ RUN  ] IOTest.Redirect
Process 26064 stopped
* thread #2: tid = 0x152c5c, 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78, stop reason = 
EXC_BAD_ACCESS (code=2, address=0x7eb16ff8)
frame #0: 0x7fffd6d463e0 
libsystem_malloc.dylib`szone_malloc_should_clear + 78
libsystem_malloc.dylib`szone_malloc_should_clear:
->  0x7fffd6d463e0 <+78>: movq   %rax, -0x78(%rbp)
0x7fffd6d463e4 <+82>: movq   0x10f0(%r12), %r13
0x7fffd6d463ec <+90>: leaq   (%rax,%rax,4), %r14
0x7fffd6d463f0 <+94>: shlq   $0x9, %r14
(lldb)
{noformat}

Change the test to redirect just 1KB data will hide the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6467) Build a Container I/O Switchboard

2016-12-01 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712942#comment-15712942
 ] 

Jie Yu commented on MESOS-6467:
---

commit 18845afb7f02b0ec92d106e445827f92d3b02329
Author: Kevin Klues 
Date:   Thu Dec 1 11:21:47 2016 -0800

Updated IOSwitchboard to block IO until connected for DEBUG containers.

Review: https://reviews.apache.org/r/54241/

> Build a Container I/O Switchboard
> -
>
> Key: MESOS-6467
> URL: https://issues.apache.org/jira/browse/MESOS-6467
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: debugging, mesosphere
> Fix For: 1.2.0
>
>
> In order to facilitate attach operations for a running container, we plan to 
> introduce a new component into Mesos known as an “I/O switchboard”. The goal 
> of this switchboard is to allow external components to *dynamically* 
> interpose on the {{stdin}}, {{stdout}} and {{stderr}} of the init process of 
> a running Mesos container. It will be implemented as a per-container, 
> stand-alone process launched by the mesos containerizer at the time a 
> container is first launched.
> Each per-container switchboard will be responsible for the following:
>  * Accepting a single dynamic request to register an fd for streaming data to 
> the {{stdin}} of a container’s init process.
>  * Accepting *multiple* dynamic requests to register fds for streaming data 
> from the {{stdout}} and {{stderr}} of a container’s init process to those fds.
>  * Allocating a pty for the new process (if requested), and directing data 
> through the master fd of the pty as necessary.
>  * Passing the *actual* set of file descriptors that should be dup’d onto the 
> {{stdin}}, {{stdout}} and {{stderr}} of a container’s init process back to 
> the containerizer. 
> The idea being that the switchboard will maintain three asynchronous loops 
> (one each for {{stdin}}, {{stdout}} and {{stderr}}) that constantly pipe data 
> to/from a container’s init process to/from all of the file descriptors that 
> have been dynamically registered with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6658) Mesos tests generated with cmake build fail to unload libraries properly

2016-12-01 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712936#comment-15712936
 ] 

Benjamin Bannier commented on MESOS-6658:
-

Added a move ctr and move assignment operator to {{Owned}}, and then 
{{std::move}}'ing the {{Owned}} into the storage map inside the 
body of {{ModuleManager::loadManifest}} was not enough to make this problem go 
away. It might be worthwhile to look into related upstream issues such as e.g., 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60731 or 
https://sourceware.org/bugzilla/show_bug.cgi?id=17833 (there are more).

> Mesos tests generated with cmake build fail to unload libraries properly
> 
>
> Key: MESOS-6658
> URL: https://issues.apache.org/jira/browse/MESOS-6658
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake, tests
>Affects Versions: 1.2.0
>Reporter: Benjamin Bannier
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> A default cmake build created from {{ec0546e}} creates a {{mesos-tests}} 
> which cannot unload dependency without an error,
> {code}
> $ ./src/mesos-tests  --gtest_filter=''
> Source directory: /vagrant
> Build directory: /home/vagrant/mesos
> Note: Google Test filter =
> [==] Running 0 tests from 0 test cases.
> [==] 0 tests from 0 test cases ran. (0 ms total)
> [  PASSED  ] 0 tests.
> Inconsistency detected by ld.so: dl-close.c: 762: _dl_close: Assertion 
> `map->l_init_called' failed!
> {code}
> This problem appears e.g., ubuntu-14.04 with cmake-2.8.12, but also on 
> debian-8, or ubuntu-16.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6664) Force cleanup of IOSwitchboard server if it does not terminate after the container terminates.

2016-12-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6664:
-

 Summary: Force cleanup of IOSwitchboard server if it does not 
terminate after the container terminates.
 Key: MESOS-6664
 URL: https://issues.apache.org/jira/browse/MESOS-6664
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


In normal case, IOSwitchboard server will terminate after container terminates. 
However, we should be more defensive and always cleanup the IOSwitchboard 
server if it does not terminate within a reasonable grace period. 

The reason for the grace period is to allow the IOSwitchboard server to finish 
redirecting the stdout/stderr to the logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6663) Container should be destroyed if IOSwitchboard server terminates unexpectedly.

2016-12-01 Thread Jie Yu (JIRA)
Jie Yu created MESOS-6663:
-

 Summary: Container should be destroyed if IOSwitchboard server 
terminates unexpectedly.
 Key: MESOS-6663
 URL: https://issues.apache.org/jira/browse/MESOS-6663
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu


If IOSwitchboard server terminates unexpectedly, we should destroy the 
corresponding container because its IO might not be redirected correctly.

We can leverage the 'watch' method in the Isolator interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6661) maintenance status page has inconsistent formatting

2016-12-01 Thread Andrew Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712678#comment-15712678
 ] 

Andrew Blanchard edited comment on MESOS-6661 at 12/1/16 6:31 PM:
--

That makes perfect, sense. I did not realize that additional key/values were 
included in the {{draining_machines}} list. Setting the "statuses" key would 
certainly help make it more clear. Alternatively, it might make sense to 
include some other information in the {{down_machines}} list. Perhaps a 
timestamp of when the machine was set to "down", since that may differ from the 
scheduled start time in the maintenance window.

edit: granted, that would be a bit more of an involved change..


was (Author: blancharda):
That makes perfect, sense. I did not realize that additional key/values were 
included in the {{draining_machines}} list. Setting the "statuses" key would 
certainly help make it more clear. Alternatively, it might make sense to 
include some other information in the {{down_machines}} list. Perhaps a 
timestamp of when the machine was set to "down", since that may differ from the 
scheduled start time in the maintenance window.

> maintenance status page has inconsistent formatting
> ---
>
> Key: MESOS-6661
> URL: https://issues.apache.org/jira/browse/MESOS-6661
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.0.1
> Environment: Mesos Version: 1.01
> OS: Ubuntu 16.04 (both on the client and cluster nodes)
> HTTP Interface: Chrome browser
>Reporter: Andrew Blanchard
>Priority: Trivial
>  Labels: formatting, http
>
> *Description*:
> The MachineID lists on the /maintenance/status page are formatted 
> differently. Specifically, with one machine `down` and one machine 
> `draining`, the page is formatted as shown below. Note that the 
> `draining_machines` list has a label ("id") on the machine ID. This 
> formatting extends to multiple items in each respective list.
> {code:xml}
> {
>   "down_machines": [
> {
>   "hostname": "agent2",
>   "ip": "172.17.0.2"
> }
>   ],
>   "draining_machines": [
> {
>   "id": {
> "hostname": "agent3",
> "ip": "172.17.0.3"
>   }
> }
>   ]
> }
> {code}
> *Steps to reproduce*:
> # Add two agents to the maintenance schedule
> # Move one agent to the down state
> # View the status page at /maintenance/status
> *Proposed solution*:
> Unify the format. Either way is fine - personally I would prefer them to both 
> be in the current form of `down_machines` with no label on the MachineIDs - 
> but regardless, they should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6661) maintenance status page has inconsistent formatting

2016-12-01 Thread Andrew Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712678#comment-15712678
 ] 

Andrew Blanchard commented on MESOS-6661:
-

That makes perfect, sense. I did not realize that additional key/values were 
included in the {{draining_machines}} list. Setting the "statuses" key would 
certainly help make it more clear. Alternatively, it might make sense to 
include some other information in the {{down_machines}} list. Perhaps a 
timestamp of when the machine was set to "down", since that may differ from the 
scheduled start time in the maintenance window.

> maintenance status page has inconsistent formatting
> ---
>
> Key: MESOS-6661
> URL: https://issues.apache.org/jira/browse/MESOS-6661
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.0.1
> Environment: Mesos Version: 1.01
> OS: Ubuntu 16.04 (both on the client and cluster nodes)
> HTTP Interface: Chrome browser
>Reporter: Andrew Blanchard
>Priority: Trivial
>  Labels: formatting, http
>
> *Description*:
> The MachineID lists on the /maintenance/status page are formatted 
> differently. Specifically, with one machine `down` and one machine 
> `draining`, the page is formatted as shown below. Note that the 
> `draining_machines` list has a label ("id") on the machine ID. This 
> formatting extends to multiple items in each respective list.
> {code:xml}
> {
>   "down_machines": [
> {
>   "hostname": "agent2",
>   "ip": "172.17.0.2"
> }
>   ],
>   "draining_machines": [
> {
>   "id": {
> "hostname": "agent3",
> "ip": "172.17.0.3"
>   }
> }
>   ]
> }
> {code}
> *Steps to reproduce*:
> # Add two agents to the maintenance schedule
> # Move one agent to the down state
> # View the status page at /maintenance/status
> *Proposed solution*:
> Unify the format. Either way is fine - personally I would prefer them to both 
> be in the current form of `down_machines` with no label on the MachineIDs - 
> but regardless, they should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6658) Mesos tests generated with cmake build fail to unload libraries properly

2016-12-01 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712623#comment-15712623
 ] 

Benjamin Bannier commented on MESOS-6658:
-

This error appears when when unloading {{libtestanonymous.so}}.

A verbose run showing unloading progression:
{code}
$ LD_DEBUG=files ./src/mesos-tests  --gtest_filter=''
  4467: 
  4467: file=libmesos-1.2.0.so.0 [0];  needed by ./src/mesos-tests [0]
  4467: file=libmesos-1.2.0.so.0 [0];  generating link map
  4467:   dynamic: 0x7f563f66c4e8  base: 0x7f563dd0   size: 
0x0199fe18
  4467: entry: 0x7f563e557a50  phdr: 0x7f563dd00040  phnum: 
 8
  4467: 
  4467: 
  4467: file=libprocess-0.0.1.so.0 [0];  needed by ./src/mesos-tests [0]
  4467: file=libprocess-0.0.1.so.0 [0];  generating link map
  4467:   dynamic: 0x7f563dcf8098  base: 0x7f563d848000   size: 
0x004b7528
  4467: entry: 0x7f563d9374f0  phdr: 0x7f563d848040  phnum: 
 8
  4467: 
  4467: 
  4467: file=libload_qos_controller.so [0];  needed by 
./src/mesos-tests [0]
  4467: file=libload_qos_controller.so [0];  generating link map
  4467:   dynamic: 0x7f563d846a38  base: 0x7f563d621000   size: 
0x00226680
  4467: entry: 0x7f563d633e80  phdr: 0x7f563d621040  phnum: 
 7
  4467: 
  4467: 
  4467: file=libmesos-protobufs.so [0];  needed by ./src/mesos-tests [0]
  4467: file=libmesos-protobufs.so [0];  generating link map
  4467:   dynamic: 0x7f563d60c0b0  base: 0x7f563cd57000   size: 
0x008c9bb0
  4467: entry: 0x7f563cfd43f0  phdr: 0x7f563cd57040  phnum: 
 7
  4467: 
  4467: 
  4467: file=libglog.so.0 [0];  needed by ./src/mesos-tests [0]
  4467: file=libglog.so.0 [0];  generating link map
  4467:   dynamic: 0x7f563cd45c18  base: 0x7f563cb28000   size: 
0x0022e760
  4467: entry: 0x7f563cb31810  phdr: 0x7f563cb28040  phnum: 
 7
  4467: 
  4467: 
  4467: file=libprotobuf.so.9 [0];  needed by ./src/mesos-tests [0]
  4467: file=libprotobuf.so.9 [0];  generating link map
  4467:   dynamic: 0x7f563cb24810  base: 0x7f563c815000   size: 
0x003128f0
  4467: entry: 0x7f563c868650  phdr: 0x7f563c815040  phnum: 
 7
  4467: 
  4467: 
  4467: file=libdl.so.2 [0];  needed by ./src/mesos-tests [0]
  4467: file=libdl.so.2 [0];  generating link map
  4467:   dynamic: 0x7f563c813d88  base: 0x7f563c611000   size: 
0x00203130
  4467: entry: 0x7f563c611ed0  phdr: 0x7f563c611040  phnum: 
 9
  4467: 
  4467: 
  4467: file=librt.so.1 [0];  needed by ./src/mesos-tests [0]
  4467: file=librt.so.1 [0];  generating link map
  4467:   dynamic: 0x7f563c60fd70  base: 0x7f563c409000   size: 
0x00207c78
  4467: entry: 0x7f563c40b350  phdr: 0x7f563c409040  phnum: 
 9
  4467: 
  4467: 
  4467: file=libpthread.so.0 [0];  needed by ./src/mesos-tests [0]
  4467: file=libpthread.so.0 [0];  generating link map
  4467:   dynamic: 0x7f563c403d50  base: 0x7f563c1eb000   size: 
0x0021d530
  4467: entry: 0x7f563c1f1f70  phdr: 0x7f563c1eb040  phnum: 
 9
  4467: 
  4467: 
  4467: file=libstdc++.so.6 [0];  needed by ./src/mesos-tests [0]
  4467: file=libstdc++.so.6 [0];  generating link map
  4467:   dynamic: 0x7f563c1d34f8  base: 0x7f563bee7000   size: 
0x00303400
  4467: entry: 0x7f563bf42620  phdr: 0x7f563bee7040  phnum: 
 8
  4467: 
  4467: 
  4467: file=libm.so.6 [0];  needed by ./src/mesos-tests [0]
  4467: file=libm.so.6 [0];  generating link map
  4467:   dynamic: 0x7f563bee5da8  base: 0x7f563bbe1000   size: 
0x00305168
  4467: entry: 0x7f563bbe6610  phdr: 0x7f563bbe1040  phnum: 
 9
  4467: 
  4467: 
  4467: file=libgcc_s.so.1 [0];  needed by ./src/mesos-tests [0]
  4467: file=libgcc_s.so.1 [0];  generating link map
  4467:   dynamic: 0x7f563bbe04b0  base: 0x7f563b9cb000   size: 
0x00215b20
  4467: entry: 0x7f563b9cdab0  phdr: 0x7f563b9cb040  phnum: 
 6
  4467: 
  4467: 
  4467: file=libc.so.6 [0];  needed by ./src/mesos-tests [0]
  4467: 

[jira] [Commented] (MESOS-6661) maintenance status page has inconsistent formatting

2016-12-01 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712618#comment-15712618
 ] 

Neil Conway commented on MESOS-6661:


The format results from this protobuf definition:

https://github.com/apache/mesos/blob/c33ba209d226fb91874b00976298faf278a29369/include/mesos/maintenance/maintenance.proto#L73

In particular, the {{draining_machines}} list can also include information 
about inverse offers made for each draining machine.

We could make the {{down_machines}} list more similar to the 
{{draining_machines}} list (by adding a an "id" key), but I don't think 
superficial consistency is that important, since the two lists do contain 
different information.

We might consider always setting the "statuses" key (rather than omitting that 
key if there are no inverse offers, which might make the asymmetry more obvious.

cc [~kaysoky]

> maintenance status page has inconsistent formatting
> ---
>
> Key: MESOS-6661
> URL: https://issues.apache.org/jira/browse/MESOS-6661
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.0.1
> Environment: Mesos Version: 1.01
> OS: Ubuntu 16.04 (both on the client and cluster nodes)
> HTTP Interface: Chrome browser
>Reporter: Andrew Blanchard
>Priority: Trivial
>  Labels: formatting, http
>
> *Description*:
> The MachineID lists on the /maintenance/status page are formatted 
> differently. Specifically, with one machine `down` and one machine 
> `draining`, the page is formatted as shown below. Note that the 
> `draining_machines` list has a label ("id") on the machine ID. This 
> formatting extends to multiple items in each respective list.
> {code:xml}
> {
>   "down_machines": [
> {
>   "hostname": "agent2",
>   "ip": "172.17.0.2"
> }
>   ],
>   "draining_machines": [
> {
>   "id": {
> "hostname": "agent3",
> "ip": "172.17.0.3"
>   }
> }
>   ]
> }
> {code}
> *Steps to reproduce*:
> # Add two agents to the maintenance schedule
> # Move one agent to the down state
> # View the status page at /maintenance/status
> *Proposed solution*:
> Unify the format. Either way is fine - personally I would prefer them to both 
> be in the current form of `down_machines` with no label on the MachineIDs - 
> but regardless, they should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6662) Some HTTP scheduler calls are missing from the docs

2016-12-01 Thread Greg Mann (JIRA)
Greg Mann created MESOS-6662:


 Summary: Some HTTP scheduler calls are missing from the docs
 Key: MESOS-6662
 URL: https://issues.apache.org/jira/browse/MESOS-6662
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Reporter: Greg Mann


Some of the calls available to HTTP schedulers are missing from the HTTP 
scheduler API documentation. We should make sure that all of the calls 
available in the {{Master::Http::scheduler}} handler are in the documentation 
[here|https://github.com/apache/mesos/blob/master/docs/scheduler-http-api.md].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6661) maintenance status page has inconsistent formatting

2016-12-01 Thread Andrew Blanchard (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Blanchard updated MESOS-6661:

Priority: Trivial  (was: Major)

> maintenance status page has inconsistent formatting
> ---
>
> Key: MESOS-6661
> URL: https://issues.apache.org/jira/browse/MESOS-6661
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.0.1
> Environment: Mesos Version: 1.01
> OS: Ubuntu 16.04 (both on the client and cluster nodes)
> HTTP Interface: Chrome browser
>Reporter: Andrew Blanchard
>Priority: Trivial
>  Labels: formatting, http
>
> *Description*:
> The MachineID lists on the /maintenance/status page are formatted 
> differently. Specifically, with one machine `down` and one machine 
> `draining`, the page is formatted as shown below. Note that the 
> `draining_machines` list has a label ("id") on the machine ID. This 
> formatting extends to multiple items in each respective list.
> {code:xml}
> {
>   "down_machines": [
> {
>   "hostname": "agent2",
>   "ip": "172.17.0.2"
> }
>   ],
>   "draining_machines": [
> {
>   "id": {
> "hostname": "agent3",
> "ip": "172.17.0.3"
>   }
> }
>   ]
> }
> {code}
> *Steps to reproduce*:
> # Add two agents to the maintenance schedule
> # Move one agent to the down state
> # View the status page at /maintenance/status
> *Proposed solution*:
> Unify the format. Either way is fine - personally I would prefer them to both 
> be in the current form of `down_machines` with no label on the MachineIDs - 
> but regardless, they should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6339) Support docker registry that requires basic auth.

2016-12-01 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song reassigned MESOS-6339:
---

Assignee: Gilbert Song

> Support docker registry that requires basic auth.
> -
>
> Key: MESOS-6339
> URL: https://issues.apache.org/jira/browse/MESOS-6339
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Gilbert Song
>
> Currently, we assume Bearer auth (in Mesos containerizer) because it's what 
> docker hub uses. We also need to support basic auth for some private registry 
> that people deploys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6661) maintenance status page has inconsistent formatting

2016-12-01 Thread Andrew Blanchard (JIRA)
Andrew Blanchard created MESOS-6661:
---

 Summary: maintenance status page has inconsistent formatting
 Key: MESOS-6661
 URL: https://issues.apache.org/jira/browse/MESOS-6661
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Affects Versions: 1.0.1
 Environment: Mesos Version: 1.01
OS: Ubuntu 16.04 (both on the client and cluster nodes)
HTTP Interface: Chrome browser

Reporter: Andrew Blanchard


*Description*:
The MachineID lists on the /maintenance/status page are formatted differently. 
Specifically, with one machine `down` and one machine `draining`, the page is 
formatted as shown below. Note that the `draining_machines` list has a label 
("id") on the machine ID. This formatting extends to multiple items in each 
respective list.

{code:xml}
{
  "down_machines": [
{
  "hostname": "agent2",
  "ip": "172.17.0.2"
}
  ],
  "draining_machines": [
{
  "id": {
"hostname": "agent3",
"ip": "172.17.0.3"
  }
}
  ]
}
{code}

*Steps to reproduce*:
# Add two agents to the maintenance schedule
# Move one agent to the down state
# View the status page at /maintenance/status

*Proposed solution*:
Unify the format. Either way is fine - personally I would prefer them to both 
be in the current form of `down_machines` with no label on the MachineIDs - but 
regardless, they should be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6250) Ensure valid task state before connecting with framework on master failover

2016-12-01 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712324#comment-15712324
 ] 

Neil Conway commented on MESOS-6250:


{quote}
Now, on Mesos master failover, Mesos does not guarantee that it first 
re-registers with it slaves before it re-connects to a framework. So it can 
occur that the framework connects before Mesos has finished or started the 
re-registration with the slaves. When the framework then sends reconciliation 
requests directly after a re-registration Mesos will reply with status updates 
where the task state is wrong (TASK_LOST instead of TASK_RUNNING).
{quote}

This is not quite true. This is the scenario:

* Framework F runs task T on agent A. Master M1 is the current leading master.
* M1 fails over and M2 is elected as the new leading master.
* F re-registers with M2 and does explicit reconciliation for task T; agent A 
has not yet re-registered.

At this point, explicit reconciliation for {{T}} will _not_ return 
{{TASK_LOST}} -- it does not return anything, because the master gives the 
agent 10 minutes to re-register (see {{agent_reregister_timeout}} master flag). 
If the agent reregister timeout has expired, the master returns {{TASK_LOST}} 
-- but the task _will_ be terminated if/when the agent re-registers. (Returning 
TASK_LOST here is not "wrong", because it is consistent with what TASK_LOST 
means in general: it does _not_ mean that the task is definitely not running, 
but rather that the master has lost contact with the agent running the task.) 
For more on the behavior here, see 
https://mesos.apache.org/documentation/latest/high-availability-framework-guide/

There are several projects in progress to improve this behavior:

* If the framework is partition-aware (see MESOS-5344 and MESOS-6394), it will 
see a new set of task states that more precisely describe the master's 
knowledge of the state of the task (TASK_UNREACHABLE, TASK_UNKNOWN, TASK_GONE, 
TASK_DROPPED, etc.). An experimental version of this feature is available in 
Mesos 1.1, and will be improved in 1.2
* To clarify reconciliation behavior after master failover but before the 
{{agent_reregister_timeout}} has expired, we might change the master to instead 
send a different task state (e.g., {{TASK_FAILOVER_UNKNOWN}}). MESOS-4050.

Therefore I'm going to close this ticket, because I think the current issues 
are covered by other existing JIRAs. Please let me know if you disagree.

> Ensure valid task state before connecting with framework on master failover
> ---
>
> Key: MESOS-6250
> URL: https://issues.apache.org/jira/browse/MESOS-6250
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0, 0.28.1, 1.0.1
> Environment: OS X 10.11.6
>Reporter: Markus Jura
>Priority: Minor
>
> During a Mesos master failover the master re-registers with its slaves to 
> receive the current state of the running tasks. It also reconnects to a 
> framework.
> In the documentation it is recommended that a framework performs an explicit 
> task reconciliation when the Mesos master re-registers: 
> http://mesos.apache.org/documentation/latest/reconciliation/
> When allowing a reconciliation of a framework, Mesos master should guarantee 
> that its task state is valid, i.e. the same as on the slaves. Otherwise, 
> Mesos can reply with status updates of state {{TASK_LOST}} even the tasks is 
> still running on the slave.
> Now, on Mesos master failover, Mesos does not guarantee that it first 
> re-registers with it slaves before it re-connects to a framework. So it can 
> occur that the framework connects before Mesos has finished or started the 
> re-registration with the slaves. When the framework then sends reconciliation 
> requests directly after a re-registration Mesos will reply with status 
> updates where the task state is wrong ({{TASK_LOST}} instead of 
> {{TASK_RUNNING}}).
> For a reconciliation request, Mesos should guarantee that the task state is 
> consistent with the slaves before it replies with a status update.
> Another possibility would be that Mesos sends a message to the framework once 
> it has re-registered with the slaves so that the framework then starts the 
> reconciliation. So far a framework can only delay the reconciliation for a 
> certain amount of time. But it does not know how long the delay should be 
> because Mesos is not notifying the framework when the task state is 
> consistent again. 
> *Log: Mesos master - connecting with framework before re-registering with 
> slaves*
> {code:bash}
> I0926 12:39:42.006933 4284416 detector.cpp:152] Detected a new leader: 
> (id='92')
> I0926 12:39:42.007242 1064960 group.cpp:706] Trying to get 
> '/mesos/json.info_92' in ZooKeeper
> I0926 12:39:42.008129 4284416 

[jira] [Commented] (MESOS-6550) Mesos master ui shows a Lost task as Running in Completed tasks section

2016-12-01 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712232#comment-15712232
 ] 

Neil Conway commented on MESOS-6550:


I believe these issues should be resolved by the RR chain that ends here:

https://reviews.apache.org/r/54232/

i.e., these JIRAs: MESOS-6419, MESOS-6619, MESOS-6602.

> Mesos master ui shows a Lost task as Running in Completed tasks section
> ---
>
> Key: MESOS-6550
> URL: https://issues.apache.org/jira/browse/MESOS-6550
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Megha
>Assignee: Neil Conway
> Attachments: screenshot-1.png
>
>
> This is particularly happening when an agent is marked unreachable and as a 
> result master marks the tasks from partition unaware frameworks as lost but 
> when the agent comes back up then we see another instance of this task (from 
> unaware framework) as running on the master ui in the completed tasks section 
> although the master already sent a task kill for this task and agent acted on 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6660) Status updates and disconnected frameworks

2016-12-01 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712207#comment-15712207
 ] 

Neil Conway commented on MESOS-6660:


[~anandmazumdar] [~vinodkone]

> Status updates and disconnected frameworks
> --
>
> Key: MESOS-6660
> URL: https://issues.apache.org/jira/browse/MESOS-6660
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> When the master receives a status update, it invokes {{Master::forward}} to 
> send the status update to the framework. {{Master::forward}} also updates the 
> task's {{status_update_state}} and {{status_update_uuid}} fields. However, 
> {{Master::forward}} is not invoked for disconnected frameworks.
> This scheme has the following drawbacks:
> # Logically, {{Master::forward}} probably shouldn't be updating the state of 
> a task; {{forward}} should just forward messages.
> # The reasoning for not updating {{status_update_state}} and 
> {{status_update_uuid}} for disconnected frameworks is to try to avoid 
> inconsistencies upon framework reconciliation -- we don't want reconciliation 
> to show update X, but then the framework to receive a non-reconciliation 
> update Y, such that Y comes before X in the agent's order. I'm not sure that 
> this is actually a problem (since we wait for framework acks before 
> forwarding subsequent status updates), but in any case, depending on 
> framework connectivity _cannot_ be a correct fix: whether the master thinks 
> the framework is connected at a given time is unreliable and racy (i.e., the 
> master might think the framework is disconnected but it still might receive 
> messages; similarly, "connected" frameworks might not receive messages).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6660) Status updates and disconnected frameworks

2016-12-01 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6660:
--

 Summary: Status updates and disconnected frameworks
 Key: MESOS-6660
 URL: https://issues.apache.org/jira/browse/MESOS-6660
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Neil Conway


When the master receives a status update, it invokes {{Master::forward}} to 
send the status update to the framework. {{Master::forward}} also updates the 
task's {{status_update_state}} and {{status_update_uuid}} fields. However, 
{{Master::forward}} is not invoked for disconnected frameworks.

This scheme has the following drawbacks:

# Logically, {{Master::forward}} probably shouldn't be updating the state of a 
task; {{forward}} should just forward messages.
# The reasoning for not updating {{status_update_state}} and 
{{status_update_uuid}} for disconnected frameworks is to try to avoid 
inconsistencies upon framework reconciliation -- we don't want reconciliation 
to show update X, but then the framework to receive a non-reconciliation update 
Y, such that Y comes before X in the agent's order. I'm not sure that this is 
actually a problem (since we wait for framework acks before forwarding 
subsequent status updates), but in any case, depending on framework 
connectivity _cannot_ be a correct fix: whether the master thinks the framework 
is connected at a given time is unreliable and racy (i.e., the master might 
think the framework is disconnected but it still might receive messages; 
similarly, "connected" frameworks might not receive messages).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6635) Update allocator to handle multi-role frameworks.

2016-12-01 Thread Qian Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang updated MESOS-6635:
--
Description: 
The allocator needs to be adjusted once we allow frameworks to have multiple 
roles:

(1) When adding a framework, we need to store all of its roles and add it to 
multiple role sorters.

(2) We will CHECK that the framework does not modify its roles when updating 
the framework (much like we do for single-role frameworks).

(3) When performing an allocation, the allocator will set allocation_info.role. 
When recovering resources, the allocator will unset allocation_info.role.

(4) The allocator will send AllocationInfo alongside offers that it sends to 
the master, so that the master can easily augment {{Offer}} with allocation 
info.

  was:
The allocator needs to be adjusted once we allow frameworks to have multiple 
roles:

(1) When adding a framework a framework, we need to store all of its roles and 
add it to multiple role sorters.

(2) We will CHECK that the framework does not modify its roles when updating 
the framework (much like we do for single-role frameworks).

(3) When performing an allocation, the allocator will set allocation_info.role. 
When recovering resources, the allocator will unset allocation_info.role.

(4) The allocator will send AllocationInfo alongside offers that it sends to 
the master, so that the master can easily augment {{Offer}} with allocation 
info.


> Update allocator to handle multi-role frameworks.
> -
>
> Key: MESOS-6635
> URL: https://issues.apache.org/jira/browse/MESOS-6635
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>
> The allocator needs to be adjusted once we allow frameworks to have multiple 
> roles:
> (1) When adding a framework, we need to store all of its roles and add it to 
> multiple role sorters.
> (2) We will CHECK that the framework does not modify its roles when updating 
> the framework (much like we do for single-role frameworks).
> (3) When performing an allocation, the allocator will set 
> allocation_info.role. When recovering resources, the allocator will unset 
> allocation_info.role.
> (4) The allocator will send AllocationInfo alongside offers that it sends to 
> the master, so that the master can easily augment {{Offer}} with allocation 
> info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6659) Commit 93fc87c breaks compilation with clang

2016-12-01 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-6659:
---

 Summary: Commit 93fc87c breaks compilation with clang
 Key: MESOS-6659
 URL: https://issues.apache.org/jira/browse/MESOS-6659
 Project: Mesos
  Issue Type: Bug
 Environment: Mesos HEAD on commit 
{{93fc87cde504cc5a38fbfe566f12fac888a61cc0}},
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Reporter: Jan Schlicht
Priority: Blocker


Running {{make}} on macOS with clang fails with
{noformat}
In file included from ../../src/slave/containerizer/mesos/containerizer.cpp:58:
../../src/slave/containerizer/mesos/containerizer.hpp:296:18: error: private 
field 'ioSwitchboard' is not used [-Werror,-Wunused-private-field]
  IOSwitchboard* ioSwitchboard;
 ^
1 error generated.
make[2]: *** 
[slave/containerizer/mesos/libmesos_no_3rdparty_la-containerizer.lo] Error 1
make[2]: *** Waiting for unfinished jobs
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6658) Mesos tests generated with cmake build fail to unload libraries properly

2016-12-01 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6658:
---

 Summary: Mesos tests generated with cmake build fail to unload 
libraries properly
 Key: MESOS-6658
 URL: https://issues.apache.org/jira/browse/MESOS-6658
 Project: Mesos
  Issue Type: Bug
  Components: cmake, tests
Affects Versions: 1.2.0
Reporter: Benjamin Bannier
Assignee: Benjamin Bannier


A default cmake build created from {{ec0546e}} creates a {{mesos-tests}} which 
cannot unload dependency without an error,
{code}
$ ./src/mesos-tests  --gtest_filter=''
Source directory: /vagrant
Build directory: /home/vagrant/mesos
Note: Google Test filter =
[==] Running 0 tests from 0 test cases.
[==] 0 tests from 0 test cases ran. (0 ms total)
[  PASSED  ] 0 tests.
Inconsistency detected by ld.so: dl-close.c: 762: _dl_close: Assertion 
`map->l_init_called' failed!
{code}
This problem appears e.g., ubuntu-14.04 with cmake-2.8.12, but also on 
debian-8, or ubuntu-16.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6474) Add fine-grained ACLs for authorization with the new debugging APIs

2016-12-01 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711253#comment-15711253
 ] 

Adam B commented on MESOS-6474:
---

That was 1) the authz interface and local-authorizer handling, and 2) authz for 
nested containers, but we still need 3) authz for debugging APIs 
(LaunchNestedContainerSession, AttachContainerInput, AttachContainerOutput).

> Add fine-grained ACLs for authorization with the new debugging APIs
> ---
>
> Key: MESOS-6474
> URL: https://issues.apache.org/jira/browse/MESOS-6474
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Alexander Rojas
>  Labels: debugging, mesosphere, security
>
> We already have ACLs in place for determining if a user has access to see a 
> certain task when querying {{state.json}} on the master/agent, or 
> browse/download a task's sandbox. However, we will have to add similar ACLs 
> for making sure they have the correct permissions to execute the new 
> Debugging APs on behalf of those tasks.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6474) Add fine-grained ACLs for authorization with the new debugging APIs

2016-12-01 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711250#comment-15711250
 ] 

Adam B commented on MESOS-6474:
---

commit de2a7f41407b6b171d10675b7a09bcbfea41564d
Author: Alexander Rojas 
Date:   Wed Nov 30 18:03:40 2016 -0800

Added authorization to Nested Container API.

Makes use of the already existing authorization actions and ACLs
definitions and wires them together with the existing API
implementations.

Review: https://reviews.apache.org/r/53851/

commit 19296e0fc2bd28f83bafdf5a7ac48146ee085449
Author: Alexander Rojas 
Date:   Wed Nov 30 17:51:22 2016 -0800

Added authorization actions for Nested Container and Debug API.

Creates new authorization action for all the API's related to
nested containers. This patch does not add the code necesary to
call use those actions, this is done in a latter patch.

Review: https://reviews.apache.org/r/53541/


> Add fine-grained ACLs for authorization with the new debugging APIs
> ---
>
> Key: MESOS-6474
> URL: https://issues.apache.org/jira/browse/MESOS-6474
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Alexander Rojas
>  Labels: debugging, mesosphere, security
>
> We already have ACLs in place for determining if a user has access to see a 
> certain task when querying {{state.json}} on the master/agent, or 
> browse/download a task's sandbox. However, we will have to add similar ACLs 
> for making sure they have the correct permissions to execute the new 
> Debugging APs on behalf of those tasks.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)