[jira] [Updated] (MESOS-5340) SSL-downgrading support may prevent new connections

2016-05-07 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5340:
--
Summary: SSL-downgrading support may prevent new connections  (was: 
SSL-downgrading support may hang libprocess)

> SSL-downgrading support may prevent new connections
> ---
>
> Key: MESOS-5340
> URL: https://issues.apache.org/jira/browse/MESOS-5340
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.29.0, 0.28.1
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: ssl
>
> When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
> support, any connection that does not actually transmit data will hang the 
> runnable (e.g. master).
> For reproducing the issue (on any platform)...
> Spin up a master with enabled SSL-downgrading:
> {noformat}
> $ export SSL_ENABLED=true
> $ export SSL_SUPPORT_DOWNGRADE=true
> $ export SSL_KEY_FILE=/path/to/your/foo.key
> $ export SSL_CERT_FILE=/path/to/your/foo.crt
> $ export SSL_CA_FILE=/path/to/your/ca.crt
> $ ./bin/mesos-master.sh --work_dir=/tmp/foo
> {noformat}
> Create some artificial HTTP request load for quickly spotting the problem in 
> both, the master logs as well as the output of CURL itself:
> {noformat}
> $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
> Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; 
> echo); done
> {noformat}
> Now create a connection to the master that does not transmit any data:
> {noformat}
> $ telnet localhost 5050
> {noformat}
> You should now see the CURL requests hanging, the master stops responding to 
> new connections. This will persist until either some data is transmitted via 
> the above telnet connection or it is closed.
> This problem has initially been observed when running Mesos on an AWS cluster 
> with enabled internal ELB health-checks for the master node. Those 
> health-checks are using long-lasting connections that do not transmit any 
> data and are closed after a configurable duration. In our test environment, 
> this duration was set to 60 seconds and hence we were seeing our master 
> getting repetitively unresponsive for 60 seconds, then getting "unstuck" for 
> a brief period until it got stuck again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5340) SSL-downgrading support may hang libprocess

2016-05-07 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-5340:
-

 Summary: SSL-downgrading support may hang libprocess
 Key: MESOS-5340
 URL: https://issues.apache.org/jira/browse/MESOS-5340
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.28.1, 0.29.0
Reporter: Till Toenshoff
Priority: Blocker


When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
support, any connection that does not actually transmit data will hang the 
runnable (e.g. master).

For reproducing the issue (on any platform)...

Spin up a master with enabled SSL-downgrading:
{noformat}
$ export SSL_ENABLED=true
$ export SSL_SUPPORT_DOWNGRADE=true
$ export SSL_KEY_FILE=/path/to/your/foo.key
$ export SSL_CERT_FILE=/path/to/your/foo.crt
$ export SSL_CA_FILE=/path/to/your/ca.crt
$ ./bin/mesos-master.sh --work_dir=/tmp/foo
{noformat}

Create some artificial HTTP request load for quickly spotting the problem in 
both, the master logs as well as the output of CURL itself:
{noformat}
$ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; echo); 
done
{noformat}

Now create a connection to the master that does not transmit any data:
{noformat}
$ telnet localhost 5050
{noformat}

You should now see the CURL requests hanging, the master stops responding to 
new connections. This will persist until either some data is transmitted via 
the above telnet connection or it is closed.

This problem has initially been observed when running Mesos on an AWS cluster 
with enabled internal ELB health-checks for the master node. Those 
health-checks are using long-lasting connections that do not transmit any data 
and are closed after a configurable duration. In our test environment, this 
duration was set to 60 seconds and hence we were seeing our master getting 
repetitively unresponsive for 60 seconds, then getting "unstuck" for a brief 
period until it got stuck again.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4658) process::Connection can lead to process::wait deadlock

2016-05-07 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-4658:
---
   Assignee: Benjamin Mahler  (was: Anand Mazumdar)
Description: 
The {{Connection}} abstraction is prone to deadlocks arising from last 
reference of {{Connection}} getting destructed by the {{ConnectionProcess}} 
execution context, at which point {{ConnectionProcess}} waits on itself 
(deadlock).

Consider this example:

{code}
Option connection = process::http::connect(...).get();

// When the ConnectionProcess completes the Future, if 'connection'
// is the last copy of the Connection it will wait on itself!
connection.disconnected()
  .onAny(defer(self(), , connection));

connection.disconnect();
connection = None();
{code}

In the above snippet, deadlock can occur as follows:

1. {{Connection = None() executes}}, the last copy of the {{Connection}} 
remains within the disconnected Future.
2. {{ConnectionProcess::disconnect}} completes the disconnection Future and 
executes SomeFunc. The Future then clears the callbacks which destructs the 
last copy of the {{Connection}}.
3. {{Connection::~Data}} waits on the {{ConnectionProcess}} from within the 
{{ConnectionProcess}} execution context. Deadlock.

We do have a snippet in our existing code that alludes to such occurrences 
happening: 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325

{code}
  // This is a one time request which will close the connection when
  // the response is received. Since 'Connection' is reference-counted,
  // we must keep a copy around until the disconnection occurs. Note
  // that in order to avoid a deadlock (Connection destruction occurring
  // from the ConnectionProcess execution context), we use 'async'.
{code}

  was:
The {{Connection}} abstraction is prone to deadlocks arising from the object 
being destroyed inside the same execution context.

Consider this example:

{code}
Option connection = process::http::connect(...).get();
connection.disconnected()
  .onAny(defer(self(), , connection));

connection.disconnect();
connection = None();
{code}

In the above snippet, if the {{connection = None()}} gets executed first before 
the actual dispatch to {{ConnectionProcess}} happens. You might loose the only 
existing reference to {{Connection}} object inside 
{{ConnectionProcess::disconnect}}. This would lead to the destruction of the 
{{Connection}} object in the {{ConnectionProcess}} execution context.

We do have a snippet in our existing code that alludes to such occurrences 
happening: 
https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/http.cpp#L1325

{code}
  // This is a one time request which will close the connection when
  // the response is received. Since 'Connection' is reference-counted,
  // we must keep a copy around until the disconnection occurs. Note
  // that in order to avoid a deadlock (Connection destruction occurring
  // from the ConnectionProcess execution context), we use 'async'.
{code}

AFAICT, for scenarios where we need to hold on to the {{Connection}} object for 
later, this approach does not suffice.


Summary: process::Connection can lead to process::wait deadlock  (was: 
process::Connection can lead to deadlock around execution in the same context.)

> process::Connection can lead to process::wait deadlock
> --
>
> Key: MESOS-4658
> URL: https://issues.apache.org/jira/browse/MESOS-4658
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: Anand Mazumdar
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> The {{Connection}} abstraction is prone to deadlocks arising from last 
> reference of {{Connection}} getting destructed by the {{ConnectionProcess}} 
> execution context, at which point {{ConnectionProcess}} waits on itself 
> (deadlock).
> Consider this example:
> {code}
> Option connection = process::http::connect(...).get();
> // When the ConnectionProcess completes the Future, if 'connection'
> // is the last copy of the Connection it will wait on itself!
> connection.disconnected()
>   .onAny(defer(self(), , connection));
> connection.disconnect();
> connection = None();
> {code}
> In the above snippet, deadlock can occur as follows:
> 1. {{Connection = None() executes}}, the last copy of the {{Connection}} 
> remains within the disconnected Future.
> 2. {{ConnectionProcess::disconnect}} completes the disconnection Future and 
> executes SomeFunc. The Future then clears the callbacks which destructs the 
> last copy of the {{Connection}}.
> 3. {{Connection::~Data}} waits on the {{ConnectionProcess}} from within the 
> {{ConnectionProcess}} execution context. Deadlock.
> We do have a snippet in our existing code that alludes to such occurrences 

[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-07 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275412#comment-15275412
 ] 

Anand Mazumdar commented on MESOS-5332:
---

It doesn't. {{send}} just provides at most once semantics. Messages can be 
dropped at any point of time. 

The responsibility to retry is left upon as the business logic of the 
client/library invoking the {{send}} operation. We don't explicitly do a retry 
here in the {{ExecutorDriver}} implementation as the communication happens on 
the same host. 

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, slave
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many 
> libprocess workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. messages seem to be received in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3709) Modularize the containerizer interface.

2016-05-07 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275378#comment-15275378
 ] 

Till Toenshoff commented on MESOS-3709:
---

As discussed in private, will get back to you after our sprint-planning on 
monday.

> Modularize the containerizer interface.
> ---
>
> Key: MESOS-3709
> URL: https://issues.apache.org/jira/browse/MESOS-3709
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, modules
>Reporter: Jie Yu
>Assignee: haosdent
>  Labels: containerizer, modularization, module
>
> So that people can implement their own containerizer as a module. That's more 
> efficient than having an external containerizer and shell out. The module 
> system also provides versioning support, this is definitely better than 
> unversioned external containerizer.
> Design Doc: 
> https://docs.google.com/document/d/1fj3G2-YFprqauQUd7fbHsD03vGAGg_k_EtH-s6fRkDo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-07 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated MESOS-5332:
---
Attachment: executor-logs.tar.gz

All executor logs (surviving and failed ones)

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, slave
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many 
> libprocess workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. messages seem to be received in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5222) Create a benchmark for scale testing HTTP frameworks

2016-05-07 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5222:
--
Sprint:   (was: Mesosphere Sprint 34)

> Create a benchmark for scale testing HTTP frameworks
> 
>
> Key: MESOS-5222
> URL: https://issues.apache.org/jira/browse/MESOS-5222
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> It would be good to add a benchmark for scale testing the HTTP frameworks wrt 
> driver based frameworks. The benchmark can be as simple as trying to launch N 
> tasks (parameterized) with the old/new API. We can then focus on fixing 
> performance issues that we find as a result of this exercise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2016-05-07 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275314#comment-15275314
 ] 

Stephan Erb commented on MESOS-5332:


The observation that it takes 5 seconds for a faulty executor to learn about 
the agent being down tends to be true for all failed executors:

{code}
$ ./analyse-executor-logs.sh
SURVIVING EXECUTOR: I0505 08:41:59.581074 21098 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.580765 10388 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.581061  8492 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.580983 17068 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.581048 27932 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.581203  3738 exec.cpp:456] Slave exited,
SURVIVING EXECUTOR: I0505 08:41:59.580838 32272 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:05.510376  3335 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:05.509397  2985 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:04.784864  8158 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:05.481689  2536 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:05.509380  2941 exec.cpp:456] Slave exited,
FAILED EXECUTOR: I0505 08:42:04.783973 17405 exec.cpp:456] Slave exited,
{code}

Question regarding Step 4: Shouldn't libprocess try to retry the send 
operation? 

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, slave
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
> Attachments: executor-stderr.log, executor-stderrV2.log, 
> mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> * It tends to be more likely if the host has many cores (30-40) and thus many 
> libprocess workers. 
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework 

[jira] [Commented] (MESOS-5337) Add Master Flag to enable fine-grained filtering of HTTP endpoints.

2016-05-07 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275145#comment-15275145
 ] 

Joerg Schad commented on MESOS-5337:


Then we still incur the performance penalty. But if acls are permissive it 
would mean that the user cannot see any frameworks/tasks...

I will add a warning in that case.

> Add Master Flag to enable fine-grained filtering of HTTP endpoints.
> ---
>
> Key: MESOS-5337
> URL: https://issues.apache.org/jira/browse/MESOS-5337
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> As the fine-grained filtering of endpoints can the rather expensive, we 
> should create a master flag to enable/disable this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)