date:20170113

[jira] [Created] (MESOS-6925) Break down the `mesos-protobufs` target in CMake further.

2017-01-13 Thread Michael Park (JIRA)

Michael Park created MESOS-6925:
---

 Summary: Break down the `mesos-protobufs` target in CMake further.
 Key: MESOS-6925
 URL: https://issues.apache.org/jira/browse/MESOS-6925
 Project: Mesos
  Issue Type: Task
  Components: cmake
Reporter: Michael Park


In the {{mesos-tidy}} setup, we need to perform the protobuf generation, but we 
don't need to compile the generated files. If we could have a separate targets 
for protobuf generation vs compilation, we would be able to do less work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6924) Add a target for external dependencies in CMake.

2017-01-13 Thread Michael Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-6924:

Description: 
It would be nice to be able to have a target for external dependencies, i.e. 
3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
{{mesos-tidy}} in specific, to do less work. We can currently spell out all of 
them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to simply 
build an external dependencies target.

{code}
# Build the external dependencies.
# TODO(mpark): Use an external dependencies target once MESOS-6924 is resolved.
cmake --build 3rdparty --target boost-1.53.0
cmake --build 3rdparty --target elfio-3.2
cmake --build 3rdparty --target glog-0.3.3
cmake --build 3rdparty --target gmock-1.7.0
cmake --build 3rdparty --target http_parser-2.6.2

# TODO(mpark): The `|| true` is a hack to try both `libev` and `libevent` and
#  use whichever one happens to be configured. This would also go
#  away with MESOS-6924.
cmake --build 3rdparty --target libev-4.22 || true
cmake --build 3rdparty --target libevent-2.1.5-beta || true

cmake --build 3rdparty --target leveldb-1.4
cmake --build 3rdparty --target nvml-352.79
cmake --build 3rdparty --target picojson-1.3.0
cmake --build 3rdparty --target protobuf-2.6.1
cmake --build 3rdparty --target zookeeper-3.4.8
{code}

  was:
It would be nice to be able to have a target for external dependencies, i.e. 
3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
{{mesos-tidy}} in specific, to do less work. We can currently spell out all of 
them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to simply 
build an external dependencies target.

{code}
cmake --build 3rdparty --target boost-1.53.0
cmake --build 3rdparty --target elfio-3.2
cmake --build 3rdparty --target glog-0.3.3
cmake --build 3rdparty --target gmock-1.7.0
cmake --build 3rdparty --target http_parser-2.6.2

# NOTE: Try both `libev` and `libevent`. This is a terrible hack.
cmake --build 3rdparty --target libev-4.22 || true
cmake --build 3rdparty --target libevent-2.1.5-beta || true

cmake --build 3rdparty --target leveldb-1.4
cmake --build 3rdparty --target nvml-352.79
cmake --build 3rdparty --target picojson-1.3.0
cmake --build 3rdparty --target protobuf-2.6.1
cmake --build 3rdparty --target zookeeper-3.4.8
{code}


> Add a target for external dependencies in CMake.
> 
>
> Key: MESOS-6924
> URL: https://issues.apache.org/jira/browse/MESOS-6924
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Michael Park
>
> It would be nice to be able to have a target for external dependencies, i.e. 
> 3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
> {{mesos-tidy}} in specific, to do less work. We can currently spell out all 
> of them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to 
> simply build an external dependencies target.
> {code}
> # Build the external dependencies.
> # TODO(mpark): Use an external dependencies target once MESOS-6924 is 
> resolved.
> cmake --build 3rdparty --target boost-1.53.0
> cmake --build 3rdparty --target elfio-3.2
> cmake --build 3rdparty --target glog-0.3.3
> cmake --build 3rdparty --target gmock-1.7.0
> cmake --build 3rdparty --target http_parser-2.6.2
> # TODO(mpark): The `|| true` is a hack to try both `libev` and `libevent` and
> #  use whichever one happens to be configured. This would also go
> #  away with MESOS-6924.
> cmake --build 3rdparty --target libev-4.22 || true
> cmake --build 3rdparty --target libevent-2.1.5-beta || true
> cmake --build 3rdparty --target leveldb-1.4
> cmake --build 3rdparty --target nvml-352.79
> cmake --build 3rdparty --target picojson-1.3.0
> cmake --build 3rdparty --target protobuf-2.6.1
> cmake --build 3rdparty --target zookeeper-3.4.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6789) SSL socket's 'shutdown()' method is broken

2017-01-13 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822560#comment-15822560
 ] 

Joseph Wu commented on MESOS-6789:
--

{code}
commit c600d12a01865daad8ba7607b53eff35686f0f35
Author: Greg Mann 
Date:   Fri Jan 13 15:56:34 2017 -0800

Fixed SSL socket 'shutdown()'.

Recently, a change was made to the signature of
`Socket::shutdown`, but the corresponding override in
`LibeventSSLSocketImpl` was not updated, so that the
implementation-specific method is no longer being
executed. Further, the SSL socket's `shutdown` code
did not actually shutdown the socket; rather, the
shutdown was performed in the destructor.

This patch updates the function's signature to match
that of the base class's method, adds the `override`
specifier to the implemention's method declaration,
and updates the function to properly shutdown the
SSL socket.

Review: https://reviews.apache.org/r/55343/
{code}

> SSL socket's 'shutdown()' method is broken
> --
>
> Key: MESOS-6789
> URL: https://issues.apache.org/jira/browse/MESOS-6789
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: encryption, libprocess, ssl
> Fix For: 1.2.0
>
>
> We recently uncovered two issues with the {{LibeventSSLSocketImpl::shutdown}} 
> method:
> * The introduction of a shutdown method parameter with [this 
> commit|https://reviews.apache.org/r/54113/] means that the implementation's 
> method is no longer overriding the default implementation. In addition to 
> fixing the implementation method's signature, we should add the {{override}} 
> specifier to all of our socket implementations' methods to ensure that this 
> doesn't happen in the future.
> * The {{LibeventSSLSocketImpl::shutdown}} function does not actually shutdown 
> the SSL socket. The proper function to shutdown an SSL socket is 
> {{SSL_shutdown}}, which is called in the implementation's destructor. We 
> should move this into {{shutdown()}} so that by the time that method returns, 
> the socket has actually been shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6789) SSL socket's 'shutdown()' method is broken

2017-01-13 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6789:
-
Fix Version/s: 1.2.0

> SSL socket's 'shutdown()' method is broken
> --
>
> Key: MESOS-6789
> URL: https://issues.apache.org/jira/browse/MESOS-6789
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: encryption, libprocess, ssl
> Fix For: 1.2.0
>
>
> We recently uncovered two issues with the {{LibeventSSLSocketImpl::shutdown}} 
> method:
> * The introduction of a shutdown method parameter with [this 
> commit|https://reviews.apache.org/r/54113/] means that the implementation's 
> method is no longer overriding the default implementation. In addition to 
> fixing the implementation method's signature, we should add the {{override}} 
> specifier to all of our socket implementations' methods to ensure that this 
> doesn't happen in the future.
> * The {{LibeventSSLSocketImpl::shutdown}} function does not actually shutdown 
> the SSL socket. The proper function to shutdown an SSL socket is 
> {{SSL_shutdown}}, which is called in the implementation's destructor. We 
> should move this into {{shutdown()}} so that by the time that method returns, 
> the socket has actually been shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6924) Add a target for external dependencies in CMake.

2017-01-13 Thread Michael Park (JIRA)

Michael Park created MESOS-6924:
---

 Summary: Add a target for external dependencies in CMake.
 Key: MESOS-6924
 URL: https://issues.apache.org/jira/browse/MESOS-6924
 Project: Mesos
  Issue Type: Task
  Components: cmake
Reporter: Michael Park


It would be nice to be able to have a target for external dependencies, i.e. 
3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
{{mesos-tidy}} in specific, to do less work. We can currently spell out all of 
them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to simply 
build an external dependencies target.

{code}
cmake --build 3rdparty --target boost-1.53.0
cmake --build 3rdparty --target elfio-3.2
cmake --build 3rdparty --target glog-0.3.3
cmake --build 3rdparty --target gmock-1.7.0
cmake --build 3rdparty --target http_parser-2.6.2
# NOTE: Try both `libev` and `libevent`. This is a terrible hack.
cmake --build 3rdparty --target libev-4.22 || true
cmake --build 3rdparty --target libevent-2.1.5-beta || true
cmake --build 3rdparty --target leveldb-1.4
cmake --build 3rdparty --target nvml-352.79
cmake --build 3rdparty --target picojson-1.3.0
cmake --build 3rdparty --target protobuf-2.6.1
cmake --build 3rdparty --target zookeeper-3.4.8
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6924) Add a target for external dependencies in CMake.

2017-01-13 Thread Michael Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-6924:

Description: 
It would be nice to be able to have a target for external dependencies, i.e. 
3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
{{mesos-tidy}} in specific, to do less work. We can currently spell out all of 
them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to simply 
build an external dependencies target.

{code}
cmake --build 3rdparty --target boost-1.53.0
cmake --build 3rdparty --target elfio-3.2
cmake --build 3rdparty --target glog-0.3.3
cmake --build 3rdparty --target gmock-1.7.0
cmake --build 3rdparty --target http_parser-2.6.2

# NOTE: Try both `libev` and `libevent`. This is a terrible hack.
cmake --build 3rdparty --target libev-4.22 || true
cmake --build 3rdparty --target libevent-2.1.5-beta || true

cmake --build 3rdparty --target leveldb-1.4
cmake --build 3rdparty --target nvml-352.79
cmake --build 3rdparty --target picojson-1.3.0
cmake --build 3rdparty --target protobuf-2.6.1
cmake --build 3rdparty --target zookeeper-3.4.8
{code}

  was:
It would be nice to be able to have a target for external dependencies, i.e. 
3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
{{mesos-tidy}} in specific, to do less work. We can currently spell out all of 
them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to simply 
build an external dependencies target.

{code}
cmake --build 3rdparty --target boost-1.53.0
cmake --build 3rdparty --target elfio-3.2
cmake --build 3rdparty --target glog-0.3.3
cmake --build 3rdparty --target gmock-1.7.0
cmake --build 3rdparty --target http_parser-2.6.2
# NOTE: Try both `libev` and `libevent`. This is a terrible hack.
cmake --build 3rdparty --target libev-4.22 || true
cmake --build 3rdparty --target libevent-2.1.5-beta || true
cmake --build 3rdparty --target leveldb-1.4
cmake --build 3rdparty --target nvml-352.79
cmake --build 3rdparty --target picojson-1.3.0
cmake --build 3rdparty --target protobuf-2.6.1
cmake --build 3rdparty --target zookeeper-3.4.8
{code}


> Add a target for external dependencies in CMake.
> 
>
> Key: MESOS-6924
> URL: https://issues.apache.org/jira/browse/MESOS-6924
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Michael Park
>
> It would be nice to be able to have a target for external dependencies, i.e. 
> 3rdparty dependencies except {{stout}}/{{libprocess}}. This would help with 
> {{mesos-tidy}} in specific, to do less work. We can currently spell out all 
> of them + a hack around {{libev}} vs {{libevent}}, it would be cleaner to 
> simply build an external dependencies target.
> {code}
> cmake --build 3rdparty --target boost-1.53.0
> cmake --build 3rdparty --target elfio-3.2
> cmake --build 3rdparty --target glog-0.3.3
> cmake --build 3rdparty --target gmock-1.7.0
> cmake --build 3rdparty --target http_parser-2.6.2
> # NOTE: Try both `libev` and `libevent`. This is a terrible hack.
> cmake --build 3rdparty --target libev-4.22 || true
> cmake --build 3rdparty --target libevent-2.1.5-beta || true
> cmake --build 3rdparty --target leveldb-1.4
> cmake --build 3rdparty --target nvml-352.79
> cmake --build 3rdparty --target picojson-1.3.0
> cmake --build 3rdparty --target protobuf-2.6.1
> cmake --build 3rdparty --target zookeeper-3.4.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6802) SSL socket can lose bytes in the case of EOF

2017-01-13 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822555#comment-15822555
 ] 

Joseph Wu commented on MESOS-6802:
--

{code}
commit 5023e004030e6018ea64f6824c353ffe4165c907
Author: Greg Mann 
Date:   Fri Jan 13 15:47:57 2017 -0800

Added new libprocess socket tests.

This patch adds NetSocketTest.EOFBeforeRecv and
NetSocketTest.EOFAfterRecv to verify that EOFs are
reliably received whether or not there is a pending recv()
request at the time the EOF is received.

Review: https://reviews.apache.org/r/53803/
{code}

> SSL socket can lose bytes in the case of EOF
> 
>
> Key: MESOS-6802
> URL: https://issues.apache.org/jira/browse/MESOS-6802
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: libevent, libprocess, ssl
> Fix For: 1.2.0
>
>
> During recent work on SSL-enabled tests in libprocess (MESOS-5966), we 
> discovered a bug in {{LibeventSSLSocketImpl}}, wherein the socket can either 
> fail to receive an EOF, or lose data when an EOF is received.
> The {{LibeventSSLSocketImpl::event_callback(short events)}} method 
> immediately sets any pending {{RecvRequest}}'s promise to zero upon receipt 
> of an EOF. However, at the time the promise is set, there may actually be 
> data waiting to be read by libevent. Upon receipt of an EOF, we should 
> attempt to read the socket's bufferevent first to ensure that we aren't 
> losing any data previously received by the socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4705) Linux 'perf' parsing logic may fail when OS distribution has perf backports.

2017-01-13 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822425#comment-15822425
 ] 

Benjamin Mahler commented on MESOS-4705:


[~fan.du] It appears that this doesn't handle my version of perf on CentOS 
7.3.1611 with perf version 3.10.0-514.2.2.el7.x86_64.debug:

{noformat}
(statistics).failure(): Failed to parse perf sample: Failed to parse perf 
sample line '3710583015,,cycles,mesos_test,1459686383,100.00,2.539,GHz': 
Unexpected number of fields
{noformat}

> Linux 'perf' parsing logic may fail when OS distribution has perf backports.
> 
>
> Key: MESOS-4705
> URL: https://issues.apache.org/jira/browse/MESOS-4705
> Project: Mesos
>  Issue Type: Bug
>  Components: cgroups, isolation
>Affects Versions: 0.27.1
>Reporter: Fan Du
>Assignee: Fan Du
> Fix For: 0.26.2, 0.27.3, 0.28.2, 1.0.0
>
>
> When sampling container with perf event on Centos7 with kernel 
> 3.10.0-123.el7.x86_64, slave complained with below error spew:
> {code}
> E0218 16:32:00.591181  8376 perf_event.cpp:408] Failed to get perf sample: 
> Failed to parse perf sample: Failed to parse perf sample line 
> '25871993253,,cycles,mesos/5f23ffca-87ed-4ff6-84f2-6ec3d4098ab8,10059827422,100.00':
>  Unexpected number of fields
> {code}
> it's caused by the current perf format [assumption | 
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/linux/perf.cpp;h=1c113a2b3f57877e132bbd65e01fb2f045132128;hb=HEAD#l430]
>  with kernel version below 3.12 
> On 3.10.0-123.el7.x86_64 kernel, the format is with 6 tokens as below:
> value,unit,event,cgroup,running,ratio
> A local modification fixed this error on my test bed, please review this 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6843) Fetcher should not assume stdout/stderr in the sandbox.

2017-01-13 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822406#comment-15822406
 ] 

Joseph Wu commented on MESOS-6843:
--

Thinking out loud:

The ideal solution would be to pipe logs from the fetcher into the container 
logger.  In the past, this would have required a pretty large refactor, as the 
container logger simply outputs a description of FDs (or FDs that can only be 
inherited once).

But now, in the Mesos containerizer at least, we have the IO Switchboard 
sitting in between the container and the container logger.  (Logs go container 
-> IO Switchboard -> container logger.)  It is conceivable to add a way of 
injecting stdout/stderr to the IO Switchboard (cc [~klueska]).
We'd still need another solution for docker containers though.

> Fetcher should not assume stdout/stderr in the sandbox.
> ---
>
> Key: MESOS-6843
> URL: https://issues.apache.org/jira/browse/MESOS-6843
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>  Labels: mesosphere
>
> If container logger is used, this assumption might not be true. For instance, 
> a journald logger might redirect all task logs to journald. So in theory, the 
> fetcher log should go to journald as well, rather than writing to 
> sandbox/stdout and sandbox/stderr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (MESOS-6923) mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no bueno

2017-01-13 Thread Alan Scherger (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Scherger updated MESOS-6923:
-
Comment: was deleted

(was: this hack seems to work:

{code}
mv /usr/lib/python2.7/site-packages /tmp/
apt-get install python-minimal
mv /tmp/site-packages /usr/lib/python2.7/site-packages
{code})

> mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no bueno 
> ---
>
> Key: MESOS-6923
> URL: https://issues.apache.org/jira/browse/MESOS-6923
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.0.1
>Reporter: Alan Scherger
>
> When you install the mesos package on xenial we drop files into:
> # ls -al /usr/lib/python2.7/site-packages/
> total 64
> drwxr-xr-x 9 root root  4096 Sep 21 02:54 .
> drwxr-xr-x 4 root root 12288 Jan 13 21:28 ..
> drwxr-xr-x 6 root root  4096 Jan 13 21:02 mesos
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos-1.0.1.dist-info
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.cli-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.cli-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.executor-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.executor-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.interface-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 
> mesos.interface-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.native-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.native-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.scheduler-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 
> mesos.scheduler-1.0.1-py2.7-nspkg.pth
> when you got to install "python-minimal" after the fact it fails with:
> new installation of python2.7-minimal; /usr/lib/python2.7/site-packages is a 
> directory
> which is expected a symlink to /usr/local/lib/python2.7/dist-packages.
> please find the package shipping files in /usr/lib/python2.7/site-packages and
> file a bug report to ship these in /usr/lib/python2.7/dist-packages instead
> aborting installation of python2.7-minimal
> idk if we care, but it makes my life :sadpanda:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6923) mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no bueno

2017-01-13 Thread Alan Scherger (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822390#comment-15822390
 ] 

Alan Scherger commented on MESOS-6923:
--

this hack seems to work:

{code}
mv /usr/lib/python2.7/site-packages /tmp/
apt-get install python-minimal
mv /tmp/site-packages /usr/lib/python2.7/site-packages
{code}

> mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no bueno 
> ---
>
> Key: MESOS-6923
> URL: https://issues.apache.org/jira/browse/MESOS-6923
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.0.1
>Reporter: Alan Scherger
>
> When you install the mesos package on xenial we drop files into:
> # ls -al /usr/lib/python2.7/site-packages/
> total 64
> drwxr-xr-x 9 root root  4096 Sep 21 02:54 .
> drwxr-xr-x 4 root root 12288 Jan 13 21:28 ..
> drwxr-xr-x 6 root root  4096 Jan 13 21:02 mesos
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos-1.0.1.dist-info
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.cli-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.cli-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.executor-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.executor-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.interface-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 
> mesos.interface-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.native-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.native-1.0.1-py2.7-nspkg.pth
> drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.scheduler-1.0.1.dist-info
> -rw-rw-r-- 1 root root   302 Sep  2 03:02 
> mesos.scheduler-1.0.1-py2.7-nspkg.pth
> when you got to install "python-minimal" after the fact it fails with:
> new installation of python2.7-minimal; /usr/lib/python2.7/site-packages is a 
> directory
> which is expected a symlink to /usr/local/lib/python2.7/dist-packages.
> please find the package shipping files in /usr/lib/python2.7/site-packages and
> file a bug report to ship these in /usr/lib/python2.7/dist-packages instead
> aborting installation of python2.7-minimal
> idk if we care, but it makes my life :sadpanda:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6923) mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no bueno

2017-01-13 Thread Alan Scherger (JIRA)

Alan Scherger created MESOS-6923:


 Summary: mesos 1.0.1-2.0.94.ubuntu1604 + python 2.7 install == no 
bueno 
 Key: MESOS-6923
 URL: https://issues.apache.org/jira/browse/MESOS-6923
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.0.1
Reporter: Alan Scherger


When you install the mesos package on xenial we drop files into:

# ls -al /usr/lib/python2.7/site-packages/
total 64
drwxr-xr-x 9 root root  4096 Sep 21 02:54 .
drwxr-xr-x 4 root root 12288 Jan 13 21:28 ..
drwxr-xr-x 6 root root  4096 Jan 13 21:02 mesos
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos-1.0.1.dist-info
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.cli-1.0.1.dist-info
-rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.cli-1.0.1-py2.7-nspkg.pth
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.executor-1.0.1.dist-info
-rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.executor-1.0.1-py2.7-nspkg.pth
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.interface-1.0.1.dist-info
-rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.interface-1.0.1-py2.7-nspkg.pth
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.native-1.0.1.dist-info
-rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.native-1.0.1-py2.7-nspkg.pth
drwxrwxr-x 2 root root  4096 Sep 21 02:54 mesos.scheduler-1.0.1.dist-info
-rw-rw-r-- 1 root root   302 Sep  2 03:02 mesos.scheduler-1.0.1-py2.7-nspkg.pth


when you got to install "python-minimal" after the fact it fails with:

new installation of python2.7-minimal; /usr/lib/python2.7/site-packages is a 
directory
which is expected a symlink to /usr/local/lib/python2.7/dist-packages.
please find the package shipping files in /usr/lib/python2.7/site-packages and
file a bug report to ship these in /usr/lib/python2.7/dist-packages instead
aborting installation of python2.7-minimal

idk if we care, but it makes my life :sadpanda:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6843) Fetcher should not assume stdout/stderr in the sandbox.

2017-01-13 Thread Adam B (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822326#comment-15822326
 ] 

Adam B commented on MESOS-6843:
---

This is rather annoying with a custom container logger, because even if my task 
doesn't fetch anything, there's still an empty stdout/stderr file in the 
sandbox, but all my task output is actually in journald. It's confusing as a 
user.
Would this be complicated to do, or just a couple of new if checks?

> Fetcher should not assume stdout/stderr in the sandbox.
> ---
>
> Key: MESOS-6843
> URL: https://issues.apache.org/jira/browse/MESOS-6843
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>  Labels: mesosphere
>
> If container logger is used, this assumption might not be true. For instance, 
> a journald logger might redirect all task logs to journald. So in theory, the 
> fetcher log should go to journald as well, rather than writing to 
> sandbox/stdout and sandbox/stderr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6843) Fetcher should not assume stdout/stderr in the sandbox.

2017-01-13 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6843:
--
Target Version/s: 1.2.0
  Labels: mesosphere  (was: )
Priority: Critical  (was: Major)
 Component/s: fetcher

> Fetcher should not assume stdout/stderr in the sandbox.
> ---
>
> Key: MESOS-6843
> URL: https://issues.apache.org/jira/browse/MESOS-6843
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>  Labels: mesosphere
>
> If container logger is used, this assumption might not be true. For instance, 
> a journald logger might redirect all task logs to journald. So in theory, the 
> fetcher log should go to journald as well, rather than writing to 
> sandbox/stdout and sandbox/stderr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6652) Perf version not correctly parsed on Fedora 24 (and probably others)

2017-01-13 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-6652:
---
Shepherd: Yan Xu

> Perf version not correctly parsed on Fedora 24 (and probably others)
> 
>
> Key: MESOS-6652
> URL: https://issues.apache.org/jira/browse/MESOS-6652
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.1.0
> Environment: Fedora 24
>Reporter: Jan Schlicht
>Assignee: James Peach
>Priority: Minor
>
> Happened on a current Fedora 24 machine, when trying to run tests.
> {noformat}
> $ perf --version
> perf version 4.8.10.200.fc24.x86_64.gc23c
> {noformat}
> doesn't seem to be parsed correctly by {{perf::supported()}}, because when 
> running {{./bin/mesos-tests.sh}} it reads
> {noformat}
> -
> Could not find the 'perf' command or its version lower that 2.6.39 so tests 
> using it to sample the 'cpu-cycles' hardware event will not be run.
> -
> -
> require 'perf' version >= 2.6.39 so no 'perf' tests will be run
> -
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6652) Perf version not correctly parsed on Fedora 24 (and probably others)

2017-01-13 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6652:
--

Assignee: James Peach

> Perf version not correctly parsed on Fedora 24 (and probably others)
> 
>
> Key: MESOS-6652
> URL: https://issues.apache.org/jira/browse/MESOS-6652
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.1.0
> Environment: Fedora 24
>Reporter: Jan Schlicht
>Assignee: James Peach
>Priority: Minor
>
> Happened on a current Fedora 24 machine, when trying to run tests.
> {noformat}
> $ perf --version
> perf version 4.8.10.200.fc24.x86_64.gc23c
> {noformat}
> doesn't seem to be parsed correctly by {{perf::supported()}}, because when 
> running {{./bin/mesos-tests.sh}} it reads
> {noformat}
> -
> Could not find the 'perf' command or its version lower that 2.6.39 so tests 
> using it to sample the 'cpu-cycles' hardware event will not be run.
> -
> -
> require 'perf' version >= 2.6.39 so no 'perf' tests will be run
> -
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6652) Perf version not correctly parsed on Fedora 24 (and probably others)

2017-01-13 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822315#comment-15822315
 ] 

James Peach commented on MESOS-6652:


| Handle perf versions with more than 3 components. | 
https://reviews.apache.org/r/55521/ |

> Perf version not correctly parsed on Fedora 24 (and probably others)
> 
>
> Key: MESOS-6652
> URL: https://issues.apache.org/jira/browse/MESOS-6652
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.1.0
> Environment: Fedora 24
>Reporter: Jan Schlicht
>Priority: Minor
>
> Happened on a current Fedora 24 machine, when trying to run tests.
> {noformat}
> $ perf --version
> perf version 4.8.10.200.fc24.x86_64.gc23c
> {noformat}
> doesn't seem to be parsed correctly by {{perf::supported()}}, because when 
> running {{./bin/mesos-tests.sh}} it reads
> {noformat}
> -
> Could not find the 'perf' command or its version lower that 2.6.39 so tests 
> using it to sample the 'cpu-cycles' hardware event will not be run.
> -
> -
> require 'perf' version >= 2.6.39 so no 'perf' tests will be run
> -
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6907) FutureTest.After3 is flaky

2017-01-13 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6907:
--
Target Version/s:   (was: 1.2.0)

> FutureTest.After3 is flaky
> --
>
> Key: MESOS-6907
> URL: https://issues.apache.org/jira/browse/MESOS-6907
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of 
> {{Future}} goes out of scope and when the enclosing data is actually 
> deleted, if {{Future::after(Duration, lambda::function Future&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is 
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future future;
>   process::WeakFuture weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
> Clock::pause();
> // The original future disappears here. After this call the
> // original future goes out of scope and should not be reachable
> // anymore.
> future = future
>   .after(Milliseconds(1), [](Future f) {
> f.discard();
> return Nothing();
>   });
> Clock::advance(Seconds(2));
> Clock::settle();
> AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
> os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed 
> (either one or the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6907) FutureTest.After3 is flaky

2017-01-13 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6907:
--
Target Version/s: 1.2.0

> FutureTest.After3 is flaky
> --
>
> Key: MESOS-6907
> URL: https://issues.apache.org/jira/browse/MESOS-6907
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of 
> {{Future}} goes out of scope and when the enclosing data is actually 
> deleted, if {{Future::after(Duration, lambda::function Future&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is 
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future future;
>   process::WeakFuture weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
> Clock::pause();
> // The original future disappears here. After this call the
> // original future goes out of scope and should not be reachable
> // anymore.
> future = future
>   .after(Milliseconds(1), [](Future f) {
> f.discard();
> return Nothing();
>   });
> Clock::advance(Seconds(2));
> Clock::settle();
> AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
> os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed 
> (either one or the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6914) Command 'hadoop version 2>&1' failed

2017-01-13 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822267#comment-15822267
 ] 

Kevin Klues commented on MESOS-6914:


Maybe you could give a bit more context about how to reproduce the error?

> Command 'hadoop version 2>&1' failed
> 
>
> Key: MESOS-6914
> URL: https://issues.apache.org/jira/browse/MESOS-6914
> Project: Mesos
>  Issue Type: Bug
>Reporter: yangjunfeng
>
> I am green hand in spark on mesos.
> when I run spark-shell on mesos. The error is below:
> Command 'hadoop version 2>&1' failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 
> 'hdfs://188.188.0.189:9000/usr/yjf/spark-2.1.0-bin-hadoop2.7.tgz': Failed to 
> create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was 
> either not found or exited with a non-zero exit status: 127
> Failed to synchronize with agent (it's probably exited)
> How can I fix this problom.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6915) Encountered a problem while starting mesos-master

2017-01-13 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822262#comment-15822262
 ] 

Kevin Klues commented on MESOS-6915:


I am unsure what the error you think you are encountering is. Those appear to 
just be log messages printed while running the mesos master.

> Encountered a problem while starting mesos-master
> -
>
> Key: MESOS-6915
> URL: https://issues.apache.org/jira/browse/MESOS-6915
> Project: Mesos
>  Issue Type: Wish
>  Components: agent, master
>Affects Versions: 1.1.0
>Reporter: Jijo Joy
>Assignee: Kevin Klues
>
> I0112 17:23:43.639902 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:44407 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:51.350908 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:52.892664 17430 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64; 
> Trident/7.0; rv:11.0) like Gecko'
> I am getting the above notification while running mesos-master.sh
> But still able to get the JAVA PYTHON example executed successfully .
> I am new to the Apache Mesos and Clustering Environment. Kindly help !!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6915) Encountered a problem while starting mesos-master

2017-01-13 Thread Kevin Klues (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-6915:
--

Assignee: Kevin Klues

> Encountered a problem while starting mesos-master
> -
>
> Key: MESOS-6915
> URL: https://issues.apache.org/jira/browse/MESOS-6915
> Project: Mesos
>  Issue Type: Wish
>  Components: agent, master
>Affects Versions: 1.1.0
>Reporter: Jijo Joy
>Assignee: Kevin Klues
>
> I0112 17:23:43.639902 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:44407 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:51.350908 17432 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
> I0112 17:23:52.892664 17430 http.cpp:391] HTTP GET for /master/state from 
> 192.168.10.35:29323 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64; 
> Trident/7.0; rv:11.0) like Gecko'
> I am getting the above notification while running mesos-master.sh
> But still able to get the JAVA PYTHON example executed successfully .
> I am new to the Apache Mesos and Clustering Environment. Kindly help !!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6914) Command 'hadoop version 2>&1' failed

2017-01-13 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822255#comment-15822255
 ] 

Kevin Klues commented on MESOS-6914:


To resolve this, it depends on exactly what it is you'd like to accomplish. Do 
you actually want/need HDFS running on your system, or are you just bothered by 
the error message being there? The agent should still start up despite this 
error, correct? 

> Command 'hadoop version 2>&1' failed
> 
>
> Key: MESOS-6914
> URL: https://issues.apache.org/jira/browse/MESOS-6914
> Project: Mesos
>  Issue Type: Bug
>Reporter: yangjunfeng
>
> I am green hand in spark on mesos.
> when I run spark-shell on mesos. The error is below:
> Command 'hadoop version 2>&1' failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 
> 'hdfs://188.188.0.189:9000/usr/yjf/spark-2.1.0-bin-hadoop2.7.tgz': Failed to 
> create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was 
> either not found or exited with a non-zero exit status: 127
> Failed to synchronize with agent (it's probably exited)
> How can I fix this problom.
> Thanks a lot!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6054) Agent Crash with Malformed UUID when doing TaskUpdate

2017-01-13 Thread Aaron Wood (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822240#comment-15822240
 ] 

Aaron Wood commented on MESOS-6054:
---

[~dvonthenen] seems that Mesos expects valid v4 UUIDs. I've submitted a patch 
to fix the segfault that occurs https://reviews.apache.org/r/55480/

> Agent Crash with Malformed UUID when doing TaskUpdate
> -
>
> Key: MESOS-6054
> URL: https://issues.apache.org/jira/browse/MESOS-6054
> Project: Mesos
>  Issue Type: Bug
>  Components: framework api
>Affects Versions: 1.0.0
> Environment: Ubuntu 14.04, Mesos 1.0.0-2.0.89.ubuntu1404, Marathon 
> 1.1.2
>Reporter: David vonThenen
>Priority: Minor
> Attachments: _usr_sbin_mesos-slave.0.crash
>
>
> When using the HTTP API using protobufs, if the UUID in a TaskUpdate is 
> malformed (in this case, was using a UUID that was base64 encoded), it would 
> cause the Agent where the executor is running on to crash and restart.
> Here is a JSON dump of the protobuf used:
> {code}
> {
>   "executor_id": {
>     "value": "executor-scaleio1"
>   },
>   "framework_id": {
>     "value": "ac8545a7-f8fc-431e-bc36-0239c4460658-0002"
>   },
>   "type": 2,
>   "update": {
>     "status": {
>       "task_id": {
>         "value": "scaleio1"
>       },
>       "state": 1,
>       "source": 2,
>       "executor_id": {
>         "value": "executor-scaleio1"
>       },
>       "uuid": 
> "WVdVd01EQTFNakF0TkdVeU9TMDBNell3TFdJMk4yUXRPR05sT1RFNU56VmlPREUw"
>     }
>   }
> }
> {code}
> In the master it looks like is processes the accept calls… but after it 
> processes all of them, it looks like the agents are immediately being 
> disconnected:
> {code}
> ...
> ...
> I0816 17:53:09.974340  4010 master.cpp:3342] Processing ACCEPT call for 
> offers: [ 2bf179c3-004a-49e3-98ab-5a75fa773522-O80 ] on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) for framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework)
> W0816 17:53:09.974578  4010 validation.cpp:647] Executor executor-scaleio4 
> for task scaleio4 uses less CPUs (None) than the minimum required (0.01). 
> Please update your executor, as this will be mandatory in future releases.
> W0816 17:53:09.974604  4010 validation.cpp:659] Executor executor-scaleio4 
> for task scaleio4 uses less memory (None) than the minimum required (32MB). 
> Please update your executor, as this will be mandatory in future releases.
> I0816 17:53:09.974645  4010 master.cpp:7439] Adding task scaleio4 with 
> resources cpus(*):1; mem(*):2048 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:09.974668  4010 master.cpp:3831] Launching task scaleio4 of 
> framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) with 
> resources cpus(*):1; mem(*):2048 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306182  4010 master.cpp:1245] Agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) disconnected
> I0816 17:53:11.306335  4010 master.cpp:2784] Disconnecting agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306520  4010 master.cpp:2803] Deactivating agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306676  4010 master.cpp:1264] Removing framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) from 
> disconnected agent 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at 
> slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com) because the framework is 
> not checkpointing
> I0816 17:53:11.306798  4010 master.cpp:6448] Removing framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (ScaleIO Framework) from agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at slave(1)@172.31.22.211:5051 
> (ec2-52-89-227-184.us-west-2.compute.amazonaws.com)
> I0816 17:53:11.306882  4010 master.cpp:6833] Updating the state of task 
> scaleio4 of framework 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 (latest 
> state: TASK_LOST, status update state: TASK_LOST)
> I0816 17:53:11.306778  4013 hierarchical.cpp:571] Agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 deactivated
> I0816 17:53:11.307140  4010 master.cpp:6899] Removing task scaleio4 with 
> resources cpus(*):1; mem(*):2048 of framework 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-0001 on agent 
> 2bf179c3-004a-49e3-98ab-5a75fa773522-S7 at

[jira] [Updated] (MESOS-6917) Segfault when the executor sets an invalid UUID when sending a status update.

2017-01-13 Thread Aaron Wood (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6917:
--
Description: 
A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
sends it off to the agent:

{code}
ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state == 
ERROR: Not a valid UUID
*** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
using GNU date ***
PC: @ 0x7efeb6101428 (unknown)
*** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
14007; stack trace: ***
@ 0x7efeb64a6390 (unknown)
@ 0x7efeb6101428 (unknown)
@ 0x7efeb610302a (unknown)
@ 0x560df739fa6e _Abort()
@ 0x560df739fa9c _Abort()
@ 0x7efebb53a5ad Try<>::get()
@ 0x7efebb5363d6 Try<>::get()
@ 0x7efebbd84809 
mesos::internal::slave::validation::executor::call::validate()
@ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
@ 0x7efebbc773b8 
_ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
@ 0x7efebbcb5808 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
@ 0x7efebbfb2aea std::function<>::operator()()
@ 0x7efebcb158b8 
_ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
@ 0x7efebcb1a10a 
_ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
@ 0x7efebcb1c5f8 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7efebb5ce8ca std::function<>::operator()()
@ 0x7efebb5c4b27 
_ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
@ 0x7efebb5d4e1e 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7efebcb30baf std::function<>::operator()()
@ 0x7efebcb13fd6 process::ProcessBase::visit()
@ 0x7efebcb1f3c8 process::DispatchEvent::visit()
@ 0x7efebb3ab2ea process::ProcessBase::serve()
@ 0x7efebcb0fe8a process::ProcessManager::resume()
@ 0x7efebcb0c5a3 _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
@ 0x7efebcb1ea34 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7efebcb1e98a 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
@ 0x7efebcb1e91a 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7efeb6980c80 (unknown)
@ 0x7efeb649c6ba start_thread
@ 0x7efeb61d282d (unknown)
Aborted (core dumped)
{code}

https://reviews.apache.org/r/55480/

  was:
A segfault occurs when an executor sends an update with a UUID that's not a 
valid v4 UUID and sends it off to the agent:

{code}
ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state == 
ERROR: Not a valid UUID
*** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
using GNU date ***
PC: @ 0x7efeb6101428 (unknown)
*** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
14007; stack trace: ***
@ 0x7efeb64a6390 (unknown)
@ 0x7efeb6101428 (unknown)
@ 0x7efeb610302a (unknown)
@ 0x560df739fa6e _Abort()
@ 0x560df739fa9c _Abort()
@ 0x7efebb53a5ad Try<>::get()
@ 0x7efebb5363d6 Try<>::get()
@ 0x7efebbd84809 
mesos::internal::slave::validation::executor::call::validate()
@ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
@ 0x7efebbc773b8 
_ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
@ 0x7efebbcb5808 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
@ 0x7efebbfb2aea std::function<>::operator()()
@

[jira] [Updated] (MESOS-6917) Segfault when the executor sets an invalid UUID when sending a status update.

2017-01-13 Thread Anand Mazumdar (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-6917:
--
Target Version/s: 1.1.1, 1.2.0, 1.0.3  (was: 1.0.0, 1.1.0, 1.2.0)
  Labels: mesosphere  (was: )
 Description: 
A segfault occurs when an executor sends an update with a UUID that's not a 
valid v4 UUID and sends it off to the agent:

{code}
ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state == 
ERROR: Not a valid UUID
*** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
using GNU date ***
PC: @ 0x7efeb6101428 (unknown)
*** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
14007; stack trace: ***
@ 0x7efeb64a6390 (unknown)
@ 0x7efeb6101428 (unknown)
@ 0x7efeb610302a (unknown)
@ 0x560df739fa6e _Abort()
@ 0x560df739fa9c _Abort()
@ 0x7efebb53a5ad Try<>::get()
@ 0x7efebb5363d6 Try<>::get()
@ 0x7efebbd84809 
mesos::internal::slave::validation::executor::call::validate()
@ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
@ 0x7efebbc773b8 
_ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
@ 0x7efebbcb5808 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
@ 0x7efebbfb2aea std::function<>::operator()()
@ 0x7efebcb158b8 
_ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
@ 0x7efebcb1a10a 
_ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
@ 0x7efebcb1c5f8 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7efebb5ce8ca std::function<>::operator()()
@ 0x7efebb5c4b27 
_ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
@ 0x7efebb5d4e1e 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7efebcb30baf std::function<>::operator()()
@ 0x7efebcb13fd6 process::ProcessBase::visit()
@ 0x7efebcb1f3c8 process::DispatchEvent::visit()
@ 0x7efebb3ab2ea process::ProcessBase::serve()
@ 0x7efebcb0fe8a process::ProcessManager::resume()
@ 0x7efebcb0c5a3 _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
@ 0x7efebcb1ea34 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7efebcb1e98a 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
@ 0x7efebcb1e91a 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7efeb6980c80 (unknown)
@ 0x7efeb649c6ba start_thread
@ 0x7efeb61d282d (unknown)
Aborted (core dumped)
{code}

  was:
A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
sends it off to the agent:

{code}
ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state == 
ERROR: Not a valid UUID
*** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
using GNU date ***
PC: @ 0x7efeb6101428 (unknown)
*** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
14007; stack trace: ***
@ 0x7efeb64a6390 (unknown)
@ 0x7efeb6101428 (unknown)
@ 0x7efeb610302a (unknown)
@ 0x560df739fa6e _Abort()
@ 0x560df739fa9c _Abort()
@ 0x7efebb53a5ad Try<>::get()
@ 0x7efebb5363d6 Try<>::get()
@ 0x7efebbd84809 
mesos::internal::slave::validation::executor::call::validate()
@ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
@ 0x7efebbc773b8 
_ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
@ 0x7efebbcb5808

[jira] [Updated] (MESOS-6917) Segfault when the executor sets a UUID that is not a valid v4 UUID

2017-01-13 Thread Kevin Klues (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-6917:
---
Affects Version/s: (was: 1.2.0)

> Segfault when the executor sets a UUID that is not a valid v4 UUID
> --
>
> Key: MESOS-6917
> URL: https://issues.apache.org/jira/browse/MESOS-6917
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Blocker
>
> A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
> sends it off to the agent:
> {code}
> ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state 
> == ERROR: Not a valid UUID
> *** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
> using GNU date ***
> PC: @ 0x7efeb6101428 (unknown)
> *** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
> 14007; stack trace: ***
> @ 0x7efeb64a6390 (unknown)
> @ 0x7efeb6101428 (unknown)
> @ 0x7efeb610302a (unknown)
> @ 0x560df739fa6e _Abort()
> @ 0x560df739fa9c _Abort()
> @ 0x7efebb53a5ad Try<>::get()
> @ 0x7efebb5363d6 Try<>::get()
> @ 0x7efebbd84809 
> mesos::internal::slave::validation::executor::call::validate()
> @ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
> @ 0x7efebbc773b8 
> _ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
> @ 0x7efebbcb5808 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
> @ 0x7efebbfb2aea std::function<>::operator()()
> @ 0x7efebcb158b8 
> _ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
> @ 0x7efebcb1a10a 
> _ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
> @ 0x7efebcb1c5f8 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7efebb5ce8ca std::function<>::operator()()
> @ 0x7efebb5c4b27 
> _ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
> @ 0x7efebb5d4e1e 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7efebcb30baf std::function<>::operator()()
> @ 0x7efebcb13fd6 process::ProcessBase::visit()
> @ 0x7efebcb1f3c8 process::DispatchEvent::visit()
> @ 0x7efebb3ab2ea process::ProcessBase::serve()
> @ 0x7efebcb0fe8a process::ProcessManager::resume()
> @ 0x7efebcb0c5a3 
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @ 0x7efebcb1ea34 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7efebcb1e98a 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @ 0x7efebcb1e91a 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7efeb6980c80 (unknown)
> @ 0x7efeb649c6ba start_thread
> @ 0x7efeb61d282d (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6917) Segfault when the executor sets a UUID that is not a valid v4 UUID

2017-01-13 Thread Kevin Klues (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-6917:
---
Affects Version/s: 1.2.0

> Segfault when the executor sets a UUID that is not a valid v4 UUID
> --
>
> Key: MESOS-6917
> URL: https://issues.apache.org/jira/browse/MESOS-6917
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.2.0
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Blocker
>
> A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
> sends it off to the agent:
> {code}
> ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state 
> == ERROR: Not a valid UUID
> *** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
> using GNU date ***
> PC: @ 0x7efeb6101428 (unknown)
> *** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
> 14007; stack trace: ***
> @ 0x7efeb64a6390 (unknown)
> @ 0x7efeb6101428 (unknown)
> @ 0x7efeb610302a (unknown)
> @ 0x560df739fa6e _Abort()
> @ 0x560df739fa9c _Abort()
> @ 0x7efebb53a5ad Try<>::get()
> @ 0x7efebb5363d6 Try<>::get()
> @ 0x7efebbd84809 
> mesos::internal::slave::validation::executor::call::validate()
> @ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
> @ 0x7efebbc773b8 
> _ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
> @ 0x7efebbcb5808 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
> @ 0x7efebbfb2aea std::function<>::operator()()
> @ 0x7efebcb158b8 
> _ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
> @ 0x7efebcb1a10a 
> _ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
> @ 0x7efebcb1c5f8 
> _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7efebb5ce8ca std::function<>::operator()()
> @ 0x7efebb5c4b27 
> _ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
> @ 0x7efebb5d4e1e 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7efebcb30baf std::function<>::operator()()
> @ 0x7efebcb13fd6 process::ProcessBase::visit()
> @ 0x7efebcb1f3c8 process::DispatchEvent::visit()
> @ 0x7efebb3ab2ea process::ProcessBase::serve()
> @ 0x7efebcb0fe8a process::ProcessManager::resume()
> @ 0x7efebcb0c5a3 
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @ 0x7efebcb1ea34 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7efebcb1e98a 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @ 0x7efebcb1e91a 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7efeb6980c80 (unknown)
> @ 0x7efeb649c6ba start_thread
> @ 0x7efeb61d282d (unknown)
> Aborted (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6907) FutureTest.After3 is flaky

2017-01-13 Thread Alexander Rojas (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822216#comment-15822216
 ] 

Alexander Rojas commented on MESOS-6907:


>From the behavior of the tests, and the snippets that _fix_ the issue, there 
>must be someone keeping around a reference to {{future.data}} for longer than 
>expected. The known references are kept in copies of the future: [one in the 
>caller|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/tests/future_tests.cpp#L275-L279],
> which is destroyed with the call to after. The [other copy of the 
>future|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1481]
> is kept in a {{Timer}} instance. However copies of this timer are moved 
>around. One copy of the timer is control by the {{future}} itself, and it is 
>stored in the vector of [{{onAny()}} 
>callbacks|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1483].
> This copy of the timer is destroyed when the [timer 
>expires|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1345]
> or when the original [future is 
>satisfied|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1371-L1372]
> (which doesn't happen in this test).

There is at least one known more copy of the {{timer}} which is kept by the 
[{{Clock}} 
itself|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L281-L294].
 However, libprocess itself gets involved managing the lifetime of the timers 
through a [callback 
function|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/clock.cpp#L70-L71]
 which is set when [libprocess is 
starting|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L1069].
 

My theory is that libprocess is keeping this callbacks for longer than 
expected, but so far I haven't been able to prove it. However, I think is 
perfectly normal that this behavior occurs and probably the test needs to be 
updated (this last paragraph is just a conjecture at this point).

> FutureTest.After3 is flaky
> --
>
> Key: MESOS-6907
> URL: https://issues.apache.org/jira/browse/MESOS-6907
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of 
> {{Future}} goes out of scope and when the enclosing data is actually 
> deleted, if {{Future::after(Duration, lambda::function Future&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is 
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future future;
>   process::WeakFuture weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
> Clock::pause();
> // The original future disappears here. After this call the
> // original future goes out of scope and should not be reachable
> // anymore.
> future = future
>   .after(Milliseconds(1), [](Future f) {
> f.discard();
> return Nothing();
>   });
> Clock::advance(Seconds(2));
> Clock::settle();
> AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
> os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed 
> (either one or the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822158#comment-15822158
 ] 

Jie Yu edited comment on MESOS-6010 at 1/13/17 7:11 PM:


Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think our http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956
https://curl.haxx.se/mail/lib-2005-10/0023.html
http://stackoverflow.com/questions/16965530/what-to-do-with-extra-http-header-from-proxy




was (Author: jieyu):
Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think our http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956
https://curl.haxx.se/mail/lib-2005-10/0023.html



> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
>

[jira] [Comment Edited] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822158#comment-15822158
 ] 

Jie Yu edited comment on MESOS-6010 at 1/13/17 7:07 PM:


Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think our http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956
https://curl.haxx.se/mail/lib-2005-10/0023.html




was (Author: jieyu):
Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think our http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956



> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
>

[jira] [Comment Edited] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822158#comment-15822158
 ] 

Jie Yu edited comment on MESOS-6010 at 1/13/17 6:57 PM:


Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think our http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956




was (Author: jieyu):
Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think out http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956



> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
>

[jira] [Commented] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822158#comment-15822158
 ] 

Jie Yu commented on MESOS-6010:
---

Did some digging today on this issue.

1) As [~nfnt] mentioned, there is a proxy between agent and docker registry 
(e.g., squid) that is doing HTTP CONNECT tunneling (e.g., 
http://wiki.squid-cache.org/Features/HTTPS).
2) Recent versions of curl (after 7.11.1) starts to include the proxy response 
(200 Connection established) as well in the response 
(https://curl.haxx.se/mail/lib-2005-10/0024.html).
3) According to RFC, it's ok to have zero headers (See 4.1, "zero or more 
header fields", https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.1).
4) According to RFC, if no header is specified, client should read the response 
body till EOF (See 4.4, 
https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4).
5) I think out http parser does the right thing, since we didn't feed the 
decoder an EOF (i.e., decoder->decode("", 0)), the parser thinks that there 
will be more body content from the socket.
6) That results in "No response decoded".

Some relevant thread:
https://github.com/nodejs/node-v0.x-archive/issues/1711
https://github.com/nodejs/http-parser/issues/327
https://github.com/nodejs/node-v0.x-archive/issues/1956



> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
> required","detail":[{"Type":"repository","Name":"library/redis","Action":"pull"}]}]}
> ; Container destroyed while provisioning images'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6922) SlaveRecoveryTest/0.RecoverTerminatedExecutor is flaky

2017-01-13 Thread Greg Mann (JIRA)

Greg Mann created MESOS-6922:


 Summary: SlaveRecoveryTest/0.RecoverTerminatedExecutor is flaky
 Key: MESOS-6922
 URL: https://issues.apache.org/jira/browse/MESOS-6922
 Project: Mesos
  Issue Type: Bug
  Components: tests
 Environment: CentOS 7
Reporter: Greg Mann


This was observed on ASF CI. Find attached the log from a failed run; it 
appears that too many status updates are being received:
{code}
/mesos/src/tests/slave_recovery_tests.cpp:1350: Failure
Mock function called more times than expected - returning directly.
Function call: statusUpdate(0x7ffcf00155b8, @0x2b3f4f7ab8c0 120-byte object 
<50-66 6A-45 3F-2B 00-00 00-00 00-00 00-00 00-00 DF-13 00-00 00-00 00-00 70-59 
01-90 3F-2B 00-00 A0-D7 00-90 3F-2B 00-00 05-00 00-00 01-00 00-00 D0-01 91-04 
00-00 00-00 D0-9C 00-90 3F-2B 00-00 C0-EB 01-90 3F-2B 00-00 18-00 00-00 00-2B 
00-00 47-98 7C-B9 92-29 D6-41 90-5B 02-90 3F-2B 00-00 00-00 00-00 00-00 00-00 
70-6E 01-90 3F-2B 00-00 00-00 00-00 00-00 00-00>)
 Expected: to be called once
   Actual: called twice - over-saturated and active
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6922) SlaveRecoveryTest/0.RecoverTerminatedExecutor is flaky

2017-01-13 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6922:
-
Attachment: SlaveRecoveryTest.RecoverTerminatedExecutor.txt

> SlaveRecoveryTest/0.RecoverTerminatedExecutor is flaky
> --
>
> Key: MESOS-6922
> URL: https://issues.apache.org/jira/browse/MESOS-6922
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
> Environment: CentOS 7
>Reporter: Greg Mann
>  Labels: tests
> Attachments: SlaveRecoveryTest.RecoverTerminatedExecutor.txt
>
>
> This was observed on ASF CI. Find attached the log from a failed run; it 
> appears that too many status updates are being received:
> {code}
> /mesos/src/tests/slave_recovery_tests.cpp:1350: Failure
> Mock function called more times than expected - returning directly.
> Function call: statusUpdate(0x7ffcf00155b8, @0x2b3f4f7ab8c0 120-byte 
> object <50-66 6A-45 3F-2B 00-00 00-00 00-00 00-00 00-00 DF-13 00-00 00-00 
> 00-00 70-59 01-90 3F-2B 00-00 A0-D7 00-90 3F-2B 00-00 05-00 00-00 01-00 00-00 
> D0-01 91-04 00-00 00-00 D0-9C 00-90 3F-2B 00-00 C0-EB 01-90 3F-2B 00-00 18-00 
> 00-00 00-2B 00-00 47-98 7C-B9 92-29 D6-41 90-5B 02-90 3F-2B 00-00 00-00 00-00 
> 00-00 00-00 70-6E 01-90 3F-2B 00-00 00-00 00-00 00-00 00-00>)
>  Expected: to be called once
>Actual: called twice - over-saturated and active
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6921) Document posix isolators could not isolate resources in configuration.md

2017-01-13 Thread haosdent huang (JIRA)

haosdent huang created MESOS-6921:
-

 Summary: Document posix isolators could not isolate resources in 
configuration.md
 Key: MESOS-6921
 URL: https://issues.apache.org/jira/browse/MESOS-6921
 Project: Mesos
  Issue Type: Improvement
  Components: documentation
Reporter: haosdent huang
Priority: Trivial


POSIX isolators only report resource usage without do any actual isolation. We 
should make this more obviously in {{slave/flags.cpp}} and configuration.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6920) Validate the UUID in Master::statusUpdate.

2017-01-13 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822023#comment-15822023
 ] 

James Peach commented on MESOS-6920:


| Validate the StatusUpdate UUID in Master::statusUpdate. | 
https://reviews.apache.org/r/55509/ |

> Validate the UUID in Master::statusUpdate.
> --
>
> Key: MESOS-6920
> URL: https://issues.apache.org/jira/browse/MESOS-6920
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: James Peach
>
> Validate the UUID in Master::statusUpdate() to avoid the possibility of 
> triggering a CHECK when logging the {{StatusUpdate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6920) Validate the UUID in Master::statusUpdate.

2017-01-13 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-6920:
---
Shepherd: Yan Xu

> Validate the UUID in Master::statusUpdate.
> --
>
> Key: MESOS-6920
> URL: https://issues.apache.org/jira/browse/MESOS-6920
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: James Peach
>
> Validate the UUID in Master::statusUpdate() to avoid the possibility of 
> triggering a CHECK when logging the {{StatusUpdate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2017-01-13 Thread haosdent huang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821985#comment-15821985
 ] 

haosdent huang commented on MESOS-5342:
---

Hi, [~klueska][~ct.clmsn] I read the file briefly before and going to read it 
again tmr. The CgroupsCpushareIsolatorProcess have changed to CpuSubsytem. And 
some Huawei guys are adding NUMA/cpuset support to Mesos recently, they 
implementation consider cpuset like network ports which more simpler than the 
proposal in 
https://docs.google.com/document/d/1G3L1Tdulg5iW7hZ2WXbG-bqROILu7zdBh2aWYu3An6A/
 . I would try to add some comments tomorrow and see if we could merge both 
Huawei guys and [~ct.clmsn] works in the proposal.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>  Labels: cgroups, cpu, cpu-usage, gpu, isolation, isolator, 
> mentor, perfomance
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6920) Validate the UUID in Master::statusUpdate.

2017-01-13 Thread James Peach (JIRA)

James Peach created MESOS-6920:
--

 Summary: Validate the UUID in Master::statusUpdate.
 Key: MESOS-6920
 URL: https://issues.apache.org/jira/browse/MESOS-6920
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


Validate the UUID in Master::statusUpdate() to avoid the possibility of 
triggering a CHECK when logging the {{StatusUpdate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6920) Validate the UUID in Master::statusUpdate.

2017-01-13 Thread James Peach (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6920:
--

Assignee: James Peach

> Validate the UUID in Master::statusUpdate.
> --
>
> Key: MESOS-6920
> URL: https://issues.apache.org/jira/browse/MESOS-6920
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: James Peach
>
> Validate the UUID in Master::statusUpdate() to avoid the possibility of 
> triggering a CHECK when logging the {{StatusUpdate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6010:
--
Target Version/s: 1.1.1, 1.2.0  (was: 1.1.1, 1.2.0, 1.0.3)

> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
> required","detail":[{"Type":"repository","Name":"library/redis","Action":"pull"}]}]}
> ; Container destroyed while provisioning images'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6010:
--
Target Version/s: 1.1.1, 1.2.0, 1.0.3  (was: 1.2.0)

> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
> required","detail":[{"Type":"repository","Name":"library/redis","Action":"pull"}]}]}
> ; Container destroyed while provisioning images'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2017-01-13 Thread Kevin Klues (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821948#comment-15821948
 ] 

Kevin Klues commented on MESOS-5342:


Is anyone currently shepherding this? [~jieyu] [~kaysoky]
Has anyone reviewed Chris's design doc?

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>  Labels: cgroups, cpu, cpu-usage, gpu, isolation, isolator, 
> mentor, perfomance
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-01-13 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-6919:


Assignee: Greg Mann

> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-01-13 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6919:
-
Description: 
After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
discovered that tests which use {{process::reinitialize}} to switch between SSL 
and non-SSL modes will leak the file descriptor associated with the server 
socket {{\_\_s\_\_}}. This can be reproduced by running the following trivial 
test in repetition:
{code}
diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
index 1ff423f..d5fd575 100644
--- a/src/tests/scheduler_tests.cpp
+++ b/src/tests/scheduler_tests.cpp
@@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
 #endif // USE_SSL_SOCKET


+TEST_P(SchedulerSSLTest, LeakTest)
+{
+  ::sleep(1);
+}
+
+
 // Tests that a scheduler can subscribe, run a task, and then tear itself down.
 TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
 {
{code}

  was:
After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
discovered that tests which use {{process::reinitialize}} to switch between SSL 
and non-SSL modes will leak the file descriptor associated with the server 
socket {{__s__}}. This can be reproduced by running the following trivial test 
in repetition:
{code}
diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
index 1ff423f..d5fd575 100644
--- a/src/tests/scheduler_tests.cpp
+++ b/src/tests/scheduler_tests.cpp
@@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
 #endif // USE_SSL_SOCKET


+TEST_P(SchedulerSSLTest, LeakTest)
+{
+  ::sleep(1);
+}
+
+
 // Tests that a scheduler can subscribe, run a task, and then tear itself down.
 TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
 {
{code}


> Libprocess reinit code leaks SSL server socket FD
> -
>
> Key: MESOS-6919
> URL: https://issues.apache.org/jira/browse/MESOS-6919
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Greg Mann
>  Labels: libprocess, ssl
>
> After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
> discovered that tests which use {{process::reinitialize}} to switch between 
> SSL and non-SSL modes will leak the file descriptor associated with the 
> server socket {{\_\_s\_\_}}. This can be reproduced by running the following 
> trivial test in repetition:
> {code}
> diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
> index 1ff423f..d5fd575 100644
> --- a/src/tests/scheduler_tests.cpp
> +++ b/src/tests/scheduler_tests.cpp
> @@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
>  #endif // USE_SSL_SOCKET
> +TEST_P(SchedulerSSLTest, LeakTest)
> +{
> +  ::sleep(1);
> +}
> +
> +
>  // Tests that a scheduler can subscribe, run a task, and then tear itself 
> down.
>  TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
>  {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6919) Libprocess reinit code leaks SSL server socket FD

2017-01-13 Thread Greg Mann (JIRA)

Greg Mann created MESOS-6919:


 Summary: Libprocess reinit code leaks SSL server socket FD
 Key: MESOS-6919
 URL: https://issues.apache.org/jira/browse/MESOS-6919
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Greg Mann


After [this commit|https://github.com/apache/mesos/commit/789e9f7], it was 
discovered that tests which use {{process::reinitialize}} to switch between SSL 
and non-SSL modes will leak the file descriptor associated with the server 
socket {{__s__}}. This can be reproduced by running the following trivial test 
in repetition:
{code}
diff --git a/src/tests/scheduler_tests.cpp b/src/tests/scheduler_tests.cpp
index 1ff423f..d5fd575 100644
--- a/src/tests/scheduler_tests.cpp
+++ b/src/tests/scheduler_tests.cpp
@@ -1821,6 +1821,12 @@ INSTANTIATE_TEST_CASE_P(
 #endif // USE_SSL_SOCKET


+TEST_P(SchedulerSSLTest, LeakTest)
+{
+  ::sleep(1);
+}
+
+
 // Tests that a scheduler can subscribe, run a task, and then tear itself down.
 TEST_P(SchedulerSSLTest, RunTaskAndTeardown)
 {
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6918) Prometheus exporter endpoints for metrics

2017-01-13 Thread James Peach (JIRA)

James Peach created MESOS-6918:
--

 Summary: Prometheus exporter endpoints for metrics
 Key: MESOS-6918
 URL: https://issues.apache.org/jira/browse/MESOS-6918
 Project: Mesos
  Issue Type: Bug
  Components: statistics
Reporter: James Peach


There are a couple of [Prometheus|https://prometheus.io] metrics exporters for 
Mesos, of varying quality. Since the Mesos stats system actually knows about 
statistics data types and semantics, and Mesos has reasonable HTTP support we 
could add Prometheus metrics endpoints to directly expose statistics in 
[Prometheus wire 
format|https://prometheus.io/docs/instrumenting/exposition_formats/], removing 
the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6917) Segfault when the executor sets a UUID that is not a valid v4 UUID

2017-01-13 Thread Aaron Wood (JIRA)

Aaron Wood created MESOS-6917:
-

 Summary: Segfault when the executor sets a UUID that is not a 
valid v4 UUID
 Key: MESOS-6917
 URL: https://issues.apache.org/jira/browse/MESOS-6917
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.1.0, 1.0.2, 1.0.1, 1.0.0
Reporter: Aaron Wood
Assignee: Aaron Wood
Priority: Blocker


A segfault occurs when an executor sets a UUID that's not a valid v4 UUID and 
sends it off to the agent:

{code}
ABORT: (../../3rdparty/stout/include/stout/try.hpp:77): Try::get() but state == 
ERROR: Not a valid UUID
*** Aborted at 1484262968 (unix time) try "date -d @1484262968" if you are 
using GNU date ***
PC: @ 0x7efeb6101428 (unknown)
*** SIGABRT (@0x36b7) received by PID 14007 (TID 0x7efeabd29700) from PID 
14007; stack trace: ***
@ 0x7efeb64a6390 (unknown)
@ 0x7efeb6101428 (unknown)
@ 0x7efeb610302a (unknown)
@ 0x560df739fa6e _Abort()
@ 0x560df739fa9c _Abort()
@ 0x7efebb53a5ad Try<>::get()
@ 0x7efebb5363d6 Try<>::get()
@ 0x7efebbd84809 
mesos::internal::slave::validation::executor::call::validate()
@ 0x7efebbb59b36 mesos::internal::slave::Slave::Http::executor()
@ 0x7efebbc773b8 
_ZZN5mesos8internal5slave5Slave10initializeEvENKUlRKN7process4http7RequestEE1_clES7_
@ 0x7efebbcb5808 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_E1_E9_M_invokeERKSt9_Any_dataS7_
@ 0x7efebbfb2aea std::function<>::operator()()
@ 0x7efebcb158b8 
_ZZZN7process11ProcessBase6_visitERKNS0_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSD_14authentication20AuthenticationResultEEE0_clESN_ENKUlbE1_clEb
@ 0x7efebcb1a10a 
_ZZZNK7process9_DeferredIZZNS_11ProcessBase6_visitERKNS1_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_5OwnedINS_4http7RequestNKUlRK6OptionINSE_14authentication20AuthenticationResultEEE0_clESO_EUlbE1_EcvSt8functionIFT_T0_EEINS_6FutureINSE_8ResponseEEERKbEEvENKUlS12_E_clES12_ENKUlvE_clEv
@ 0x7efebcb1c5f8 
_ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestNKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7efebb5ce8ca std::function<>::operator()()
@ 0x7efebb5c4b27 
_ZZN7process8internal8DispatchINS_6FutureINS_4http8ResponseclIRSt8functionIFS5_vS5_RKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESI_
@ 0x7efebb5d4e1e 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseclIRSt8functionIFS9_vS9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
@ 0x7efebcb30baf std::function<>::operator()()
@ 0x7efebcb13fd6 process::ProcessBase::visit()
@ 0x7efebcb1f3c8 process::DispatchEvent::visit()
@ 0x7efebb3ab2ea process::ProcessBase::serve()
@ 0x7efebcb0fe8a process::ProcessManager::resume()
@ 0x7efebcb0c5a3 _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
@ 0x7efebcb1ea34 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7efebcb1e98a 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
@ 0x7efebcb1e91a 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7efeb6980c80 (unknown)
@ 0x7efeb649c6ba start_thread
@ 0x7efeb61d282d (unknown)
Aborted (core dumped)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-6907) FutureTest.After3 is flaky

2017-01-13 Thread Alexander Rojas (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas updated MESOS-6907:
---
Description: 
There is apparently a race condition between the time an instance of 
{{Future}} goes out of scope and when the enclosing data is actually 
deleted, if {{Future::after(Duration, lambda::function)}} is called.

The issue is more likely to occur if the machine is under load or if it is not 
a very powerful one. The easiest way to reproduce it is to run:

{code}
$ stress -c 4 -t 2600 -d 2 -i 2 &
$ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
--gtest_break_on_failure
{code}

An exploratory fix for the issue is to change the test to:

{code}
TEST(FutureTest, After3)
{
  Future future;
  process::WeakFuture weak_future(future);

  EXPECT_SOME(weak_future.get());

  {
Clock::pause();
// The original future disappears here. After this call the
// original future goes out of scope and should not be reachable
// anymore.
future = future
  .after(Milliseconds(1), [](Future f) {
f.discard();
return Nothing();
  });

Clock::advance(Seconds(2));
Clock::settle();

AWAIT_READY(future);
  }

  if (weak_future.get().isSome()) {
os::sleep(Seconds(1));
  }

  EXPECT_NONE(weak_future.get());
  EXPECT_FALSE(future.hasDiscard());
}
{code}

The interesting thing of the fix is that both extra snippets are needed (either 
one or the other is not enough) to prevent the issue from happening.


  was:
After playing with the latest patch solving MESOS-6484 we found out that the 
modifications done introduce a flakyness in the test {{FutureTest.After3}}. The 
flakyness occurs, depending on the machine and the load of it between once 
every 1 runs and once every 50 runs, being most likely a race condition 
in the code.

To reproduce run:

{code}
${MESOS_BUILD_DIR}/3rdparty/libprocess/libprocess-tests 
--gtest_filter="*.After3" --gtest_repeat=-1 --gtest_break_on_failure
{code}


> FutureTest.After3 is flaky
> --
>
> Key: MESOS-6907
> URL: https://issues.apache.org/jira/browse/MESOS-6907
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rojas
>
> There is apparently a race condition between the time an instance of 
> {{Future}} goes out of scope and when the enclosing data is actually 
> deleted, if {{Future::after(Duration, lambda::function Future&)>)}} is called.
> The issue is more likely to occur if the machine is under load or if it is 
> not a very powerful one. The easiest way to reproduce it is to run:
> {code}
> $ stress -c 4 -t 2600 -d 2 -i 2 &
> $ ./libprocess-tests --gtest_filter="FutureTest.After3" --gtest_repeat=-1 
> --gtest_break_on_failure
> {code}
> An exploratory fix for the issue is to change the test to:
> {code}
> TEST(FutureTest, After3)
> {
>   Future future;
>   process::WeakFuture weak_future(future);
>   EXPECT_SOME(weak_future.get());
>   {
> Clock::pause();
> // The original future disappears here. After this call the
> // original future goes out of scope and should not be reachable
> // anymore.
> future = future
>   .after(Milliseconds(1), [](Future f) {
> f.discard();
> return Nothing();
>   });
> Clock::advance(Seconds(2));
> Clock::settle();
> AWAIT_READY(future);
>   }
>   if (weak_future.get().isSome()) {
> os::sleep(Seconds(1));
>   }
>   EXPECT_NONE(weak_future.get());
>   EXPECT_FALSE(future.hasDiscard());
> }
> {code}
> The interesting thing of the fix is that both extra snippets are needed 
> (either one or the other is not enough) to prevent the issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6864) Container Exec should be possible with tasks belonging to a task group

2017-01-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821809#comment-15821809
 ] 

Gastón Kleiman commented on MESOS-6864:
---

https://reviews.apache.org/r/55464/

> Container Exec should be possible with tasks belonging to a task group
> --
>
> Key: MESOS-6864
> URL: https://issues.apache.org/jira/browse/MESOS-6864
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: debugging, mesosphere
>
> {{LaunchNestedContainerSession}} currently requires the parent container to 
> be an Executor 
> (https://github.com/apache/mesos/blob/f89f28724f5837ff414dc6cc84e1afb63f3306e5/src/slave/http.cpp#L2189-L2211).
> This works for command tasks, because the task container id is the same as 
> the executor container id.
> But it won't work for pod tasks whose container id is different from 
> executor’s container id.
> In order to resolve this ticket, we need to allow launching a child container 
> at an arbitrary level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-6916) Improve health checks validation

2017-01-13 Thread JIRA

Gastón Kleiman created MESOS-6916:
-

 Summary: Improve health checks validation
 Key: MESOS-6916
 URL: https://issues.apache.org/jira/browse/MESOS-6916
 Project: Mesos
  Issue Type: Bug
Reporter: Gastón Kleiman


The "general"  fields should also be validated (i.e., `timeout_seconds`), 
similar to what's done in https://reviews.apache.org/r/55458/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6010) Docker registry puller shows decode error "No response decoded".

2017-01-13 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821580#comment-15821580
 ] 

Jan Schlicht commented on MESOS-6010:
-

The root cause isn't a problem or bug of {{http-parser}} but ambiguities on how 
to deal with HTTP responses. An HTTP response _should_ indicate the length of 
its body by setting the {{Content-Length}} header. But when this header isn't 
set, this could mean different things: a) The response doesn't have a body or 
b) We have to somehow figure out the length of the body.
b) is something we cannot do because {{process::ResponseDecoder}} should 
support parsing multiple HTTP responses in a single string and we wouldn't be 
able to tell where a body ends and a new response starts.
Hence a), assuming that a response doesn't have a body when {{Content-Length}} 
isn't set, can resolve this problem.

> Docker registry puller shows decode error "No response decoded".
> 
>
> Key: MESOS-6010
> URL: https://issues.apache.org/jira/browse/MESOS-6010
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Sunzhe
>Assignee: Jan Schlicht
>Priority: Critical
>  Labels: Docker, mesos-containerizer
>
> The {{mesos-agent}} flags:
> {code}
>  GLOG_v=1 ./bin/mesos-agent.sh \
>   --master=zk://${MESOS_MASTER_IP}:2181/mesos  \
>   --ip=10.100.3.3  \
>   --work_dir=${MESOS_WORK_DIR} \
>   
> --isolation=cgroups/devices,gpu/nvidia,disk/du,docker/runtime,filesystem/linux
>  \
>   --enforce_container_disk_quota \
>   --containerizers=mesos \
>   --image_providers=docker \
>   --executor_environment_variables="{}"
> {code}
> And the {{mesos-execute}} flags:
> {code}
>  ./src/mesos-execute \
>--master=${MESOS_MASTER_IP}:5050 \
>--name=${INSTANCE_NAME} \
>--docker_image=${DOCKER_IMAGE} \
>--framework_capabilities=GPU_RESOURCES \
>--shell=false
> {code}
> But when {{./src/mesos-execute}}, the errors like below:
> {code}
> I0809 16:11:46.207875 25583 scheduler.cpp:172] Version: 1.0.0
> I0809 16:11:46.212442 25582 scheduler.cpp:461] New master detected at 
> master@10.103.0.125:5050
> Subscribed with ID '168ab900-ee7e-4829-a59a-d16de956637e-0009'
> Submitted task 'test' to agent '168ab900-ee7e-4829-a59a-d16de956637e-S1'
> Received status update TASK_FAILED for task 'test'
>   message: 'Failed to launch container: Failed to decode HTTP responses: No 
> response decoded
> HTTP/1.1 200 Connection established
> HTTP/1.1 401 Unauthorized
> Content-Type: application/json; charset=utf-8
> Docker-Distribution-Api-Version: registry/2.0
> Www-Authenticate: Bearer 
> realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:library/redis:pull;
> Date: Tue, 09 Aug 2016 08:10:32 GMT
> Content-Length: 145
> Strict-Transport-Security: max-age=31536000
> {"errors":[{"code":"UNAUTHORIZED","message":"authentication 
> required","detail":[{"Type":"repository","Name":"library/redis","Action":"pull"}]}]}
> ; Container destroyed while provisioning images'
>   source: SOURCE_AGENT
>   reason: REASON_CONTAINER_LAUNCH_FAILED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

53 matches

Mail list logo