[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9 and CentOS 6.

2018-01-02 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8335:
--
Sprint: Mesosphere Sprint 71

> ProvisionerDockerTest fails on Debian 9  and CentOS 6.
> --
>
> Key: MESOS-8335
> URL: https://issues.apache.org/jira/browse/MESOS-8335
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: centos-6-curl-7.19.7.txt, centos-6-curl-7.57.txt
>
>
> Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
> Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
> OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
> libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
> Error:
> {code}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' 
> authorizer
> I1215 00:09:28.697144 30867 master.cpp:456] Master 
> 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started 
> on 127.0.1.1:35029
> I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
> --zk_session_timeout="10secs"
> I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing 
> authenticated frameworks to register
> I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing 
> authenticated agents to register
> I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing 
> authenticated HTTP frameworks to register
> I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/4RYdF1/credentials'
> I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
> authenticator
> I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
> I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
> allocator process
> I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
> I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
> I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
> I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
> I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 507904ns
> I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 
> 28977ns; attempting to update the registry
> I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
> registry in 464896ns
> I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered 
> registrar
> I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
> registry (167B); allowing 10mins for agents to re-register
> I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
> hierarchical allocator: nothing to recover
> W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already 
> running process files@127.0.1.1:35029
> I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { 
> environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, 
> 

[jira] [Updated] (MESOS-7967) Make `mesos-execute` work with old-style resources

2018-01-02 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7967:

Target Version/s: 1.5.1

> Make `mesos-execute` work with old-style resources
> --
>
> Key: MESOS-7967
> URL: https://issues.apache.org/jira/browse/MESOS-7967
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Michael Park
>
> {{mesos-execute}} should be updated to be able to handle
> "pre-reservation-refinement" resource format.
> For reservation refinement, new resource format were introduced.
> The master and agent have been carefully updated to be able to handle
> pre/post reservation-refinement resource formats, whereas the example
> frameworks and {{mesos-execute}} were updated such that they require
> the new resource format. While the example frameworks are probably fine
> being updated to use the new format, {{mesos-execute}} is used as a
> developer tool, and as such we should update it to be more robust in its
> handling of resource formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7967) Make `mesos-execute` work with old-style resources

2018-01-02 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308958#comment-16308958
 ] 

Michael Park commented on MESOS-7967:
-

[~jieyu]: yep. I'll push to 1.5.1

> Make `mesos-execute` work with old-style resources
> --
>
> Key: MESOS-7967
> URL: https://issues.apache.org/jira/browse/MESOS-7967
> Project: Mesos
>  Issue Type: Improvement
>  Components: cli
>Reporter: Michael Park
>
> {{mesos-execute}} should be updated to be able to handle
> "pre-reservation-refinement" resource format.
> For reservation refinement, new resource format were introduced.
> The master and agent have been carefully updated to be able to handle
> pre/post reservation-refinement resource formats, whereas the example
> frameworks and {{mesos-execute}} were updated such that they require
> the new resource format. While the example frameworks are probably fine
> being updated to use the new format, {{mesos-execute}} is used as a
> developer tool, and as such we should update it to be more robust in its
> handling of resource formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8372) Stop masters and agents from downgrading resources to pre-reservation-refinement format.

2018-01-02 Thread Michael Park (JIRA)
Michael Park created MESOS-8372:
---

 Summary: Stop masters and agents from downgrading resources to 
pre-reservation-refinement format.
 Key: MESOS-8372
 URL: https://issues.apache.org/jira/browse/MESOS-8372
 Project: Mesos
  Issue Type: Task
  Components: agent, master
Reporter: Michael Park


Due to the 1.x compatibility story, we need to downgrade the resources on 
masters and agents for the rest of 1.x. One option for 2.0 would be for 1.x to 
be upgradable to 2.0 (i.e., 2.0 would read pre-/post- reservation refinement 
formats), but for 2.0 to not be downgradable to 1.x, but rather, downgradable 
to 1.4.x (i.e., 2.0 can checkpoint in post- format only.)

If we restrict the upgrade path to 1.x -> 2.0 -> 2.y, we can also, starting 
from 2.1 to only read
post- format, as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8368) Improve HTTP parser to support HTTP/2 messages.

2018-01-02 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308938#comment-16308938
 ] 

James Peach commented on MESOS-8368:


Probably we should implement [SSL_CTX_set_next_protos_advertised_cb 
|https://www.openssl.org/docs/man1.1.0/ssl/SSL_set_alpn_protos.html] and only 
advertise {{http/1.1}}. This ought to prevent HTTP/2 negotiation, though it 
seems pretty aggressing of curl to try HTTP/2 without an explicit negotiation.

> Improve HTTP parser to support HTTP/2 messages.
> ---
>
> Key: MESOS-8368
> URL: https://issues.apache.org/jira/browse/MESOS-8368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>
> We currently use [http-parser|https://github.com/nodejs/http-parser] to parse 
> HTTP messages. This parser does not work with HTTP/2 requests and responses 
> which as an issue as curl enables HTTP/2 by default for HTTPS connections 
> since its version 7.47.
> The issue has been discovered in some of our tests (e.g. 
> ProvisionerDockerTest) where it crashes with the message {{Failed to decode 
> HTTP responses: Decoding failed}}. See 
> [MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335] for more 
> details.
> Possible long-term solutions:
> * Upgrade the parser to be compatible with HTTP/2 messages. 
> [http-parser|https://github.com/nodejs/http-parser] has not been updated 
> regularly this past year in favor of 
> [nghttp2|https://github.com/nghttp2/nghttp2] which has a much broader scope. 
> [There is no equivalent of http-parser for HTTP/2 
> yet|https://users.rust-lang.org/t/is-there-anything-similar-to-http-parser-but-for-http2/10721].
> * Test which version of curl is used at startup and report an error if the 
> version is >= 7.47 and the flag {{--http1.0}} is not used in curl (more 
> details regarding this flag are available 
> [here|https://curl.haxx.se/docs/manpage.html].
> In the meantime, we are upgrading our testing machines using a recent version 
> of curl to run with the flag {{--http1.0}} 
> ([MESOS-8335|https://issues.apache.org/jira/browse/MESOS-8335]).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8371) Support External Storage using CSI and External Resource Provider

2018-01-02 Thread Jie Yu (JIRA)
Jie Yu created MESOS-8371:
-

 Summary: Support External Storage using CSI and External Resource 
Provider
 Key: MESOS-8371
 URL: https://issues.apache.org/jira/browse/MESOS-8371
 Project: Mesos
  Issue Type: Epic
Reporter: Jie Yu


MESOS-7235 adds the foundation for supporting CSI based storage plugins. This 
epic captures the work to support external storage using CSI and external 
resource provider.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8221) Use protobuf reflection to simplify downgrading of resources.

2018-01-02 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308867#comment-16308867
 ] 

Michael Park commented on MESOS-8221:
-

[~jieyu]: The protobuf reflection based approach caught cases that I had missed 
in the containerizer code.
Specifically, {{ContainerTermination}} and {{ContainerConfig}} which contain 
{{Resource}}. I believe
these resources also need to be downgraded in order for the agent to remain 
downgradable.

After some digging, I found that the {{Resource}} in {{ContainerTermination}} 
were added in 
[98d96ca|https://github.com/apache/mesos/commit/98d96ca96570eb4d0d1604ba738c24ecc7e71f7f#diff-4d34722b5ad4f490a95639a6d441106dR256],
which is after 1.4.x, and the {{Resource}} in {{ContainerConfig}} were added 
earlier but the {{ContainerConfig}}
message itself was not checkpointed until 
[03a2a4d|https://github.com/apache/mesos/commit/03a2a4dfa47b1d47c5eb23e81f5ef8213e46d545#diff-c8ca6e064a8bf7b1b3c70e6525eabeceR1354]
 which is also after 1.4.x. So we're fine on both counts, and
we should get this in for 1.5.0 to make sure these are downgraded, and to 
future-proof.

> Use protobuf reflection to simplify downgrading of resources.
> -
>
> Key: MESOS-8221
> URL: https://issues.apache.org/jira/browse/MESOS-8221
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> We currently have a {{downgradeResources}} function which is called on every
> {{repeated Resource}} field in every message that we checkpoint. We should 
> leverage
> protobuf reflection to automatically downgrade any instances of {{Resource}} 
> within any
> protobuf message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8370) `liburi_volume_profile.la` cannot be built standalone.

2018-01-02 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8370:
--

 Summary: `liburi_volume_profile.la` cannot be built standalone.
 Key: MESOS-8370
 URL: https://issues.apache.org/jira/browse/MESOS-8370
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


Currently the `liburi_volume_profile.la` module cannot be built standalone. The 
reason is that this module depends on the following three generated header 
files:
{noformat}
../include/csi/csi.grpc.pb.h
../include/csi/csi.pb.h
resource_provider/storage/volume_profile.pb.h
{noformat}
But there is no way in autotools to specify such dependencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8221) Use protobuf reflection to simplify downgrading of resources.

2018-01-02 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-8221:

Priority: Blocker  (was: Major)

> Use protobuf reflection to simplify downgrading of resources.
> -
>
> Key: MESOS-8221
> URL: https://issues.apache.org/jira/browse/MESOS-8221
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Michael Park
>Assignee: Michael Park
>Priority: Blocker
>
> We currently have a {{downgradeResources}} function which is called on every
> {{repeated Resource}} field in every message that we checkpoint. We should 
> leverage
> protobuf reflection to automatically downgrade any instances of {{Resource}} 
> within any
> protobuf message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5362) Add authentication to example frameworks

2018-01-02 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5362:
--
Sprint: Mesosphere Sprint 71

> Add authentication to example frameworks
> 
>
> Key: MESOS-5362
> URL: https://issues.apache.org/jira/browse/MESOS-5362
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Reporter: Greg Mann
>Assignee: Till Toenshoff
>  Labels: authentication, mesosphere, security
>
> Some example frameworks do not have the ability to authenticate with the 
> master. Adding authentication to the example frameworks that don't already 
> have it implemented would allow us to use these frameworks for testing in 
> authenticated/authorized scenarios.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8357) Example frameworks have an inconsistent UX.

2018-01-02 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-8357:
--
Labels: mesosphere  (was: )
Sprint: Mesosphere Sprint 71

> Example frameworks have an inconsistent UX.
> ---
>
> Key: MESOS-8357
> URL: https://issues.apache.org/jira/browse/MESOS-8357
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: mesosphere
>
> Our example frameworks are a bit inconsistent when it comes to specifying 
> things like the framework principal / secret etc.. 
> Many of these examples have great value in testing a Mesos cluster. Unifying 
> the parameterizing would improve the user experience when testing Mesos.
> {{MESOS_AUTHENTICATE_FRAMEWORKS}} is being used by many examples for enabling 
> / disabling authentication. {{load_generator_framework}} as one example 
> however uses {{MESOS_AUTHENTICATE}} for that purpose. The credentials 
> themselves are most commonly expected in environment variables 
> {{DEFAULT_PRINCIPAL}} and {{DEFAULT_SECRET}} while in some cases we chose to 
> use {{MESOS_PRINCIPAL}}, {{MESOS_SECRET}} instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8361) Example frameworks to support launching mesos-local.

2018-01-02 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-8361:
--
Labels: mesosphere  (was: )
Sprint: Mesosphere Sprint 71

> Example frameworks to support launching mesos-local.
> 
>
> Key: MESOS-8361
> URL: https://issues.apache.org/jira/browse/MESOS-8361
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: mesosphere
>
> The scheduler driver and library support implicit launching of mesos-local 
> for a convenient test setup. Some of our example frameworks account for this 
> in supporting implicit ACL rendering and more. 
> We should unify the experience by documenting this behaviour and adding it to 
> all example frameworks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8243) Add feature to have offer to provide GPU memory info

2018-01-02 Thread heng zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

heng zhang updated MESOS-8243:
--
Labels: features  (was: )

> Add feature to have offer to provide GPU memory info
> 
>
> Key: MESOS-8243
> URL: https://issues.apache.org/jira/browse/MESOS-8243
> Project: Mesos
>  Issue Type: Improvement
>  Components: gpu
>Affects Versions: 1.2.0
> Environment: A cluster with 2 node, each is Centos7 with two Nvidia 
> Titan X (12GB). 
>Reporter: heng zhang
>  Labels: features
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The new feature would enable a Mesos offer provide not only the number of 
> GPUs but also how many GPU memory left. For example, if a user needs 10 GB on 
> each GPU to run a Deep Learning training job on Caffe, but the offered GPU 
> only gets 6 GB left, the user should be able to know and reject this offer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8243) Add feature to have offer to provide GPU memory info

2018-01-02 Thread heng zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

heng zhang updated MESOS-8243:
--
Labels:   (was: easyfix)

> Add feature to have offer to provide GPU memory info
> 
>
> Key: MESOS-8243
> URL: https://issues.apache.org/jira/browse/MESOS-8243
> Project: Mesos
>  Issue Type: Improvement
>  Components: gpu
>Affects Versions: 1.2.0
> Environment: A cluster with 2 node, each is Centos7 with two Nvidia 
> Titan X (12GB). 
>Reporter: heng zhang
>  Labels: features
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The new feature would enable a Mesos offer provide not only the number of 
> GPUs but also how many GPU memory left. For example, if a user needs 10 GB on 
> each GPU to run a Deep Learning training job on Caffe, but the offered GPU 
> only gets 6 GB left, the user should be able to know and reject this offer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8341) Agent can become stuck in (re-)registering state during upgrades

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8341:
--
Fix Version/s: 1.5.0

Cherry picked to 1.5.0

{quote}
commit 4bc5f03497b64ae39d30418e803d9ded88acb74a
Author: Benno Evers 
Date:   Tue Jan 2 10:58:23 2018 -0800

Correctly reset slave status when aborting a registration.

Previously, the slave was not erased from the `registering`
and `reregistering` sets in the master for some code paths
that would result in a failed (re-)registration attempt.

This could lead to a state where the reason of the unsuccessful
(re-)registration attempt is fixed on the agent, but the master
ignores subsequent attempts because it assumes the previous
operation is still in progress.

Review: https://reviews.apache.org/r/64506/
{quote}

> Agent can become stuck in (re-)registering state during upgrades
> 
>
> Key: MESOS-8341
> URL: https://issues.apache.org/jira/browse/MESOS-8341
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
> Fix For: 1.5.0, 1.5.1
>
>
> Currently, an agent will not be erased from the set of currently 
> (re-)registering agents if
>  - it tries to (re-)register with a malformed version string
>  - it tries to (re-)register with a version smaller than the minimum 
> supported version
>  - it tries to (re-)register with a domain when the master has no domain 
> configured
>  - the operator marks the slave as gone while the (re-)registration is ongoing
> Afterwards, all further (re-)registration attempts with the same agent id 
> will be discarded, because the master still  thinks that the original 
> (re-)registration is ongoing.
> Since most realistic way to encounter this issue would be during cluster 
> upgrades, and it will fix itself with a master restart, it is unlikely to be 
> reported externally.
> Review: https://reviews.apache.org/r/64506



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8115:
--
Fix Version/s: 1.5.0

Cherry picked to 1.5.0.

commit 79d189260eb705c73fd7fa356cef6e28351cb490
Author: Benno Evers 
Date:   Tue Jan 2 10:58:35 2018 -0800

Added a master flag to disallow agents without domain.

Added a new `--require_agent_domain` flag and implementation. When
set to true, it will cause the master to refuse (re-)registration
attempts for agents with no configured domain.

This is intended as a safety net for operators, who could forget to
configure the fault domain of a remote agent and let it join the
cluster. If this happens, an agent in a remote region will be
considered a local agent by the master and frameworks (because agent's
fault domain is not configured), causing tasks to potentially land in a
remote agent which is undesirable.

Review: https://reviews.apache.org/r/64507/


> Add a master flag to disallow agents that are not configured with fault domain
> --
>
> Key: MESOS-8115
> URL: https://issues.apache.org/jira/browse/MESOS-8115
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Benno Evers
> Fix For: 1.5.0, 1.5.1
>
>
> Once mesos masters and agents in a cluster are *all* upgraded to a version 
> where the fault domains feature is available, it is beneficial to enforce 
> that agents without a fault domain configured are not allowed to join the 
> cluster. 
> This is a safety net for operators who could forget to configure the fault 
> domain of a remote agent and let it join the cluster. If this happens, an 
> agent in a remote region will be considered a local agent by the master and 
> frameworks (because agent's fault domain is not configured) causing tasks to 
> potentially land in a remote agent which is undesirable.
> Note that this has to be a configurable flag and not enforced by default 
> because otherwise upgrades from a fault domain non-configured cluster to a 
> configured cluster will not be possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain

2018-01-02 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308593#comment-16308593
 ] 

Vinod Kone edited comment on MESOS-8115 at 1/2/18 7:49 PM:
---

commit d02d66043e2beedbb7cd1c1bbf78ce8d506d82b7
Author: Benno Evers 
Date:   Tue Jan 2 10:58:35 2018 -0800

Added a master flag to disallow agents without domain.

Added a new `--require_agent_domain` flag and implementation. When
set to true, it will cause the master to refuse (re-)registration
attempts for agents with no configured domain.

This is intended as a safety net for operators, who could forget to
configure the fault domain of a remote agent and let it join the
cluster. If this happens, an agent in a remote region will be
considered a local agent by the master and frameworks (because agent's
fault domain is not configured), causing tasks to potentially land in a
remote agent which is undesirable.

Review: https://reviews.apache.org/r/64507/


was (Author: vinodkone):
commit 79d189260eb705c73fd7fa356cef6e28351cb490
Author: Benno Evers 
Date:   Tue Jan 2 10:58:35 2018 -0800

Added a master flag to disallow agents without domain.

Added a new `--require_agent_domain` flag and implementation. When
set to true, it will cause the master to refuse (re-)registration
attempts for agents with no configured domain.

This is intended as a safety net for operators, who could forget to
configure the fault domain of a remote agent and let it join the
cluster. If this happens, an agent in a remote region will be
considered a local agent by the master and frameworks (because agent's
fault domain is not configured), causing tasks to potentially land in a
remote agent which is undesirable.

Review: https://reviews.apache.org/r/64507/


> Add a master flag to disallow agents that are not configured with fault domain
> --
>
> Key: MESOS-8115
> URL: https://issues.apache.org/jira/browse/MESOS-8115
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Benno Evers
> Fix For: 1.5.0, 1.5.1
>
>
> Once mesos masters and agents in a cluster are *all* upgraded to a version 
> where the fault domains feature is available, it is beneficial to enforce 
> that agents without a fault domain configured are not allowed to join the 
> cluster. 
> This is a safety net for operators who could forget to configure the fault 
> domain of a remote agent and let it join the cluster. If this happens, an 
> agent in a remote region will be considered a local agent by the master and 
> frameworks (because agent's fault domain is not configured) causing tasks to 
> potentially land in a remote agent which is undesirable.
> Note that this has to be a configurable flag and not enforced by default 
> because otherwise upgrades from a fault domain non-configured cluster to a 
> configured cluster will not be possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain

2018-01-02 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308593#comment-16308593
 ] 

Vinod Kone commented on MESOS-8115:
---

commit 79d189260eb705c73fd7fa356cef6e28351cb490
Author: Benno Evers 
Date:   Tue Jan 2 10:58:35 2018 -0800

Added a master flag to disallow agents without domain.

Added a new `--require_agent_domain` flag and implementation. When
set to true, it will cause the master to refuse (re-)registration
attempts for agents with no configured domain.

This is intended as a safety net for operators, who could forget to
configure the fault domain of a remote agent and let it join the
cluster. If this happens, an agent in a remote region will be
considered a local agent by the master and frameworks (because agent's
fault domain is not configured), causing tasks to potentially land in a
remote agent which is undesirable.

Review: https://reviews.apache.org/r/64507/


> Add a master flag to disallow agents that are not configured with fault domain
> --
>
> Key: MESOS-8115
> URL: https://issues.apache.org/jira/browse/MESOS-8115
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Benno Evers
> Fix For: 1.5.1
>
>
> Once mesos masters and agents in a cluster are *all* upgraded to a version 
> where the fault domains feature is available, it is beneficial to enforce 
> that agents without a fault domain configured are not allowed to join the 
> cluster. 
> This is a safety net for operators who could forget to configure the fault 
> domain of a remote agent and let it join the cluster. If this happens, an 
> agent in a remote region will be considered a local agent by the master and 
> frameworks (because agent's fault domain is not configured) causing tasks to 
> potentially land in a remote agent which is undesirable.
> Note that this has to be a configurable flag and not enforced by default 
> because otherwise upgrades from a fault domain non-configured cluster to a 
> configured cluster will not be possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8341) Agent can become stuck in (re-)registering state during upgrades

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8341:
-

   Resolution: Fixed
 Assignee: Benno Evers
Fix Version/s: 1.5.1

commit 3eb57cae3674fc835c784cac9eaa63e1aab7ba1c
Author: Benno Evers 
Date:   Tue Jan 2 10:58:23 2018 -0800

Correctly reset slave status when aborting a registration.

Previously, the slave was not erased from the \`registering\`
and \`reregistering\` sets in the master for some code paths
that would result in a failed (re-)registration attempt.

This could lead to a state where the reason of the unsuccessful
(re-)registration attempt is fixed on the agent, but the master
ignores subsequent attempts because it assumes the previous
operation is still in progress.

Review: https://reviews.apache.org/r/64506/

> Agent can become stuck in (re-)registering state during upgrades
> 
>
> Key: MESOS-8341
> URL: https://issues.apache.org/jira/browse/MESOS-8341
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
> Fix For: 1.5.1
>
>
> Currently, an agent will not be erased from the set of currently 
> (re-)registering agents if
>  - it tries to (re-)register with a malformed version string
>  - it tries to (re-)register with a version smaller than the minimum 
> supported version
>  - it tries to (re-)register with a domain when the master has no domain 
> configured
>  - the operator marks the slave as gone while the (re-)registration is ongoing
> Afterwards, all further (re-)registration attempts with the same agent id 
> will be discarded, because the master still  thinks that the original 
> (re-)registration is ongoing.
> Since most realistic way to encounter this issue would be during cluster 
> upgrades, and it will fix itself with a master restart, it is unlikely to be 
> reported externally.
> Review: https://reviews.apache.org/r/64506



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8369) CI build failure compiling volume_profile.proto

2018-01-02 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-8369:
-

 Summary: CI build failure compiling volume_profile.proto 
 Key: MESOS-8369
 URL: https://issues.apache.org/jira/browse/MESOS-8369
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Chun-Hung Hsiao


Found this on ASF CI

https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!qnode3)&&(!H23)/4681/consoleFull

{code}
  top_distdir="../../mesos-1.5.0" 
distdir="../../mesos-1.5.0/3rdparty/libprocess" \
  dist-hook
make[4]: Entering directory `/mesos/3rdparty/libprocess'
cp -r ./3rdparty ../../mesos-1.5.0/3rdparty/libprocess/
make[4]: Leaving directory `/mesos/3rdparty/libprocess'
make[3]: Leaving directory `/mesos/3rdparty/libprocess'
make[2]: Leaving directory `/mesos/3rdparty'
 (cd src && make  top_distdir=../mesos-1.5.0 distdir=../mesos-1.5.0/src \
 am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
make[2]: Entering directory `/mesos/src'
../3rdparty/protobuf-3.5.0/src/protoc -I../include -I.  --cpp_out=. 
resource_provider/storage/volume_profile.proto
make[2]: ../3rdparty/protobuf-3.5.0/src/protoc: Command not found
make[2]: *** [resource_provider/storage/volume_profile.pb.h] Error 127
make[2]: Leaving directory `/mesos/src'
make[1]: *** [distdir] Error 1
make[1]: Leaving directory `/mesos'
make: *** [dist] Error 2
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8369) CI build failure compiling volume_profile.proto

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8369:
--
Priority: Blocker  (was: Major)

> CI build failure compiling volume_profile.proto 
> 
>
> Key: MESOS-8369
> URL: https://issues.apache.org/jira/browse/MESOS-8369
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>
> Found this on ASF CI
> https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(ubuntu)&&(!ubuntu-us1)&&(!ubuntu-eu2)&&(!qnode3)&&(!H23)/4681/consoleFull
> {code}
>   top_distdir="../../mesos-1.5.0" 
> distdir="../../mesos-1.5.0/3rdparty/libprocess" \
>   dist-hook
> make[4]: Entering directory `/mesos/3rdparty/libprocess'
> cp -r ./3rdparty ../../mesos-1.5.0/3rdparty/libprocess/
> make[4]: Leaving directory `/mesos/3rdparty/libprocess'
> make[3]: Leaving directory `/mesos/3rdparty/libprocess'
> make[2]: Leaving directory `/mesos/3rdparty'
>  (cd src && make  top_distdir=../mesos-1.5.0 distdir=../mesos-1.5.0/src \
>  am__remove_distdir=: am__skip_length_check=: am__skip_mode_fix=: distdir)
> make[2]: Entering directory `/mesos/src'
> ../3rdparty/protobuf-3.5.0/src/protoc -I../include -I.  --cpp_out=. 
> resource_provider/storage/volume_profile.proto
> make[2]: ../3rdparty/protobuf-3.5.0/src/protoc: Command not found
> make[2]: *** [resource_provider/storage/volume_profile.pb.h] Error 127
> make[2]: Leaving directory `/mesos/src'
> make[1]: *** [distdir] Error 1
> make[1]: Leaving directory `/mesos'
> make: *** [dist] Error 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8243) Add feature to have offer to provide GPU memory info

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-8243:
--
Target Version/s:   (was: 1.4.0)

> Add feature to have offer to provide GPU memory info
> 
>
> Key: MESOS-8243
> URL: https://issues.apache.org/jira/browse/MESOS-8243
> Project: Mesos
>  Issue Type: Improvement
>  Components: gpu
>Affects Versions: 1.2.0
> Environment: A cluster with 2 node, each is Centos7 with two Nvidia 
> Titan X (12GB). 
>Reporter: heng zhang
>  Labels: easyfix
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The new feature would enable a Mesos offer provide not only the number of 
> GPUs but also how many GPU memory left. For example, if a user needs 10 GB on 
> each GPU to run a Deep Learning training job on Caffe, but the offered GPU 
> only gets 6 GB left, the user should be able to know and reject this offer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1739:
--
Target Version/s: 1.5.0

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Epic
>Reporter: Patrick Reilly
>Assignee: Benno Evers
>  Labels: external-volumes, mesosphere, myriad
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1739) Allow slave reconfiguration on restart

2018-01-02 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1739:
--
Fix Version/s: (was: 1.5.0)

> Allow slave reconfiguration on restart
> --
>
> Key: MESOS-1739
> URL: https://issues.apache.org/jira/browse/MESOS-1739
> Project: Mesos
>  Issue Type: Epic
>Reporter: Patrick Reilly
>Assignee: Benno Evers
>  Labels: external-volumes, mesosphere, myriad
>
> Make it so that either via a slave restart or a out of process "reconfigure" 
> ping, the attributes and resources of a slave can be updated to be a superset 
> of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8243) Add feature to have offer to provide GPU memory info

2018-01-02 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8243:
--
Fix Version/s: (was: 1.5.0)

> Add feature to have offer to provide GPU memory info
> 
>
> Key: MESOS-8243
> URL: https://issues.apache.org/jira/browse/MESOS-8243
> Project: Mesos
>  Issue Type: Improvement
>  Components: gpu
>Affects Versions: 1.2.0
> Environment: A cluster with 2 node, each is Centos7 with two Nvidia 
> Titan X (12GB). 
>Reporter: heng zhang
>  Labels: easyfix
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The new feature would enable a Mesos offer provide not only the number of 
> GPUs but also how many GPU memory left. For example, if a user needs 10 GB on 
> each GPU to run a Deep Learning training job on Caffe, but the offered GPU 
> only gets 6 GB left, the user should be able to know and reject this offer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8334) PartitionedSlaveReregistrationMasterFailover is flaky.

2018-01-02 Thread Megha Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Megha Sharma reassigned MESOS-8334:
---

Assignee: Megha Sharma

> PartitionedSlaveReregistrationMasterFailover is flaky.
> --
>
> Key: MESOS-8334
> URL: https://issues.apache.org/jira/browse/MESOS-8334
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Megha Sharma
>  Labels: flaky-test
> Attachments: PartitionedSlaveReregistrationMasterFailover-badrun.txt
>
>
> This test became extremely flaky on various Linux platforms, presumably after 
> the chain with https://reviews.apache.org/r/64098/ has been committed.
> {noformat}
> ../../src/tests/partition_tests.cpp:1032
> Failed to wait 15secs for runningAgainStatus1
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8247) Executor registered message is lost

2018-01-02 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-8247:
--
Target Version/s: 1.3.2, 1.4.2, 1.5.1  (was: 1.3.2, 1.4.2, 1.5.0)

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2018-01-02 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308419#comment-16308419
 ] 

Jie Yu commented on MESOS-8247:
---

OK, i retargeted this for 1.5.1

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8219) Validate that any offer operation is only applied on resources from a single provider

2018-01-02 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307956#comment-16307956
 ] 

Jan Schlicht commented on MESOS-8219:
-

Sure, will work on this.

> Validate that any offer operation is only applied on resources from a single 
> provider
> -
>
> Key: MESOS-8219
> URL: https://issues.apache.org/jira/browse/MESOS-8219
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Benjamin Bannier
>Assignee: Jan Schlicht
>
> Offer operations can only be applied to resources from one single resource 
> provider. A number of places in the implementation assume that the provider 
> ID obtained from any {Resource} in an offer operation is equivalent to the 
> one from any other resource. We should update the master to validate that 
> invariant and reject malformed operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2018-01-02 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307902#comment-16307902
 ] 

Alexander Rukletsov commented on MESOS-8247:


[~jieyu] nope.

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8335) ProvisionerDockerTest fails on Debian 9 and CentOS 6.

2018-01-02 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307867#comment-16307867
 ] 

Armand Grillet commented on MESOS-8335:
---

Since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS connections. We 
can see in the logs of CentOS 6 with curl 7.57.0 that we indeed receive a 
{{HTTP/2 200}}.

libcurl uses HTTP1.1 for clear text HTTP servers since 7.47.0, we have not seen 
fails regarding that thus we can fix the issue by enforcing curl to use HTTP 
1.1 for everything using the option {{CURL_HTTP_VERSION_1_1}}.

Adding a {{.curlrc}} with {{http1.1}} in {{/root}} indeed fixes the test on 
CentOS 6 with Docker version 1.7.1, build 786b29d and curl 7.57.0 
(x86_64-redhat-linux-gnu) libcurl/7.57.0 OpenSSL/1.0.1e zlib/1.2.3 
c-ares/1.13.0 libssh2/1.8.0 nghttp2/1.6.0. 

> ProvisionerDockerTest fails on Debian 9  and CentOS 6.
> --
>
> Key: MESOS-8335
> URL: https://issues.apache.org/jira/browse/MESOS-8335
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Armand Grillet
>Assignee: Armand Grillet
> Attachments: centos-6-curl-7.19.7.txt, centos-6-curl-7.57.txt
>
>
> Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
> Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
> OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
> libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3
> Error:
> {code}
> [ RUN  ] 
> ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
> I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' 
> authorizer
> I1215 00:09:28.697144 30867 master.cpp:456] Master 
> 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started 
> on 127.0.1.1:35029
> I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
> --zk_session_timeout="10secs"
> I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing 
> authenticated frameworks to register
> I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing 
> authenticated agents to register
> I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing 
> authenticated HTTP frameworks to register
> I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/4RYdF1/credentials'
> I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
> authenticator
> I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
> I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
> allocator process
> I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
> I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
> I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
> I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
> I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
> registry (0B) in 507904ns
> I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 
> 28977ns; attempting to update the registry
> I1215 00:09:28.702997 30869 registrar.cpp:552] 

[jira] [Commented] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2018-01-02 Thread Julien Pepy (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307776#comment-16307776
 ] 

Julien Pepy commented on MESOS-7007:


Thanks [~jieyu] and [~chhsia0]!

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>  Labels: storage
> Fix For: 1.5.0
>
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)