[jira] [Assigned] (MESOS-5078) Document TaskStatus reasons

2017-08-08 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-5078:
--

Assignee: Benno Evers

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Benno Evers
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5078) Document TaskStatus reasons

2017-08-09 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-5078:
---
Sprint: Mesosphere Sprint 61

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Benno Evers
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7876) Investigate jemalloc as a possible malloc for mesos

2017-08-10 Thread Benno Evers (JIRA)
Benno Evers created MESOS-7876:
--

 Summary: Investigate jemalloc as a possible malloc for mesos
 Key: MESOS-7876
 URL: https://issues.apache.org/jira/browse/MESOS-7876
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers
Assignee: Benno Evers


It is currently very hard to debug memory issues, in particular memory leaks, 
in mesos.

An alluring way to improve the situation would be to change the default malloc 
to jemalloc, which has built-in heap-tracking capabilities.

However, some care needs to be taken when considering to change such a 
fundamental part of mesos:

  * Would such a switch have any adverse impact on performance?
  * Is it available and will it compile on all our target platforms?
  * Is the jemalloc-licensing compatible with bundling as third-party library?





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5078) Document TaskStatus reasons

2017-08-11 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123313#comment-16123313
 ] 

Benno Evers commented on MESOS-5078:


Review: https://reviews.apache.org/r/61495/

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Benno Evers
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7773) HTTP request validation stage is not explicit.

2017-08-14 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125532#comment-16125532
 ] 

Benno Evers commented on MESOS-7773:


While we're at it, we should also make sure that we always return BadRequest on 
malformed user input instead of `CHECK`-ing and aborting. Right now, there are 
some places where it looks like we're asserting certain properties of 
user-passed protobuf messages, for example the local authorizer seems to 
`CHECK` that certain fields of the passed protobuf message was set. 
(src/authorizer/local/authorizer.cpp:312)

> HTTP request validation stage is not explicit.
> --
>
> Key: MESOS-7773
> URL: https://issues.apache.org/jira/browse/MESOS-7773
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rukletsov
>  Labels: mesosphere, reliability
>
> Currently we validate HTTP requests in multiple places in libprocess, for 
> instance {{ProcessManager::handle()}}, {{StreamingRequestDecoder::decode()}}, 
> {{process::parse()}}. To improve error handling when dealing with malformed 
> HTTP requests (including libprocess messages), consider introducing a 
> validation stage and / or make sure {{Request}} and all its components are in 
> valid state before we start using it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7876) Investigate alternative malloc's for mesos

2017-08-15 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7876:
---
Summary: Investigate alternative malloc's for mesos  (was: Investigate 
jemalloc as a possible malloc for mesos)

> Investigate alternative malloc's for mesos
> --
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7876) Investigate alternative malloc's for mesos

2017-08-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127194#comment-16127194
 ] 

Benno Evers commented on MESOS-7876:


Licensing: 2-clause BSD, there should be no problem.

Availability: jemalloc uses a standard autotools-based build, so adding it to 
our build should be no problem. As far as I know, mesos allocates all memory 
using operator new which is a standard interface, so there should be no 
platform-specific problems.

Performance: To test malloc performance, I compiled two versions of jemalloc 
4.5.0 with the default configuration options used in 
[https://www.freebsd.org/cgi/man.cgi?jemalloc(3)](FreeBSD), i.e. `--enable-fill 
--enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace 
--enable-xmalloc`. For one of them, I addtionally specified the flags 
`--enable-stats --enable-prof` to disable heap statistics gathering and 
profiling options, for the other I specified `--disable-stats --disable-prof`.

Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had 
each thread do 125.000 allocation and deallocation operations with memory 
regions uniformly distributed between 1 byte and 64 MiB. All three allocators 
were running at the same time to ensure the system base load was the same for 
all of them.

!noprof.png|Results run 1!

!prof.png|Results run 2!

More or less as predicted by other peoples experience, these results show that 
the heap tracking functionality has almost no runtime impact when enabled but 
not actively used, and as a bonus jemalloc actually seems to have a substantial 
speedup for multi-threaded allocations, although its debatable if this will be 
noticable during normal operation. I didn't manage to get a clean measurement 
from mesos own' benchmark tests yet.

This post by Facebook describes some implementation details of jemalloc, along 
with a very extensive comparison of several malloc implementations, although it 
seems the actual results are missing from the page:   
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

> Investigate alternative malloc's for mesos
> --
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7876) Investigate alternative malloc's for mesos

2017-08-15 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7876:
---
Attachment: noprof.png
prof.png
malloc.cpp
jemalloc_benchmark_raw.txt

> Investigate alternative malloc's for mesos
> --
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7876) Investigate alternative malloc's for mesos

2017-08-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127205#comment-16127205
 ] 

Benno Evers commented on MESOS-7876:


One other option we should probably keep in mind is _tcmalloc_, which is 
another malloc implementation created at google that has a lot of the same 
promises that `jemalloc` has (i.e. drastically faster allocation times and 
optional heap tracking support) and is already included in our dependencies 
because it is part of {{gperftools}}.

On the one hand this would avoid adding an additional dependency, on the other 
hand it could also lead to additional problems because some other 
3rdparty-dependencies also try to link against tcmalloc if it is available at 
build-time, so we might end up using several different versions of it if the 
bundled version is different than the one installed on the system and one of 
the involved the build systems doesn't handle this situation correctly.

> Investigate alternative malloc's for mesos
> --
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7876) Investigate alternative malloc's for mesos

2017-08-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127194#comment-16127194
 ] 

Benno Evers edited comment on MESOS-7876 at 8/15/17 1:24 PM:
-

Licensing: 2-clause BSD, there should be no problem.

Availability: jemalloc uses a standard autotools-based build, so adding it to 
our build should be no problem. As far as I know, mesos allocates all memory 
using operator new which is a standard interface, so there should be no 
platform-specific problems.

Performance: To test malloc performance, I compiled two versions of jemalloc 
4.5.0 with the default configuration options used in FreeBSD ( 
[https://www.freebsd.org/cgi/man.cgi?jemalloc(3)] ), i.e. {{--enable-fill 
--enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace 
--enable-xmalloc}}. For one of them, I addtionally specified the flags 
`--enable-stats --enable-prof` to disable heap statistics gathering and 
profiling options, for the other I specified `--disable-stats --disable-prof`.

Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had 
each thread do 125.000 allocation and deallocation operations with memory 
regions uniformly distributed between 1 byte and 64 MiB. All three allocators 
were running at the same time to ensure the system base load was the same for 
all of them.

!noprof.png|Results run 1!

!prof.png|Results run 2!

More or less as predicted by other peoples experience, these results show that 
the heap tracking functionality has almost no runtime impact when enabled but 
not actively used, and as a bonus jemalloc actually seems to have a substantial 
speedup for multi-threaded allocations, although its debatable if this will be 
noticable during normal operation. I didn't manage to get a clean measurement 
from mesos own' benchmark tests yet.

This post by Facebook describes some implementation details of jemalloc, along 
with a very extensive comparison of several malloc implementations, although it 
seems the actual results are missing from the page:   
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/


was (Author: bennoe):
Licensing: 2-clause BSD, there should be no problem.

Availability: jemalloc uses a standard autotools-based build, so adding it to 
our build should be no problem. As far as I know, mesos allocates all memory 
using operator new which is a standard interface, so there should be no 
platform-specific problems.

Performance: To test malloc performance, I compiled two versions of jemalloc 
4.5.0 with the default configuration options used in 
[https://www.freebsd.org/cgi/man.cgi?jemalloc(3)](FreeBSD), i.e. `--enable-fill 
--enable-lazy-lock --enable-munmap --enable-tcache --enable-tls --enable-utrace 
--enable-xmalloc`. For one of them, I addtionally specified the flags 
`--enable-stats --enable-prof` to disable heap statistics gathering and 
profiling options, for the other I specified `--disable-stats --disable-prof`.

Next, I spawned n threads per allocator (i.e. 3*n threads in total) and had 
each thread do 125.000 allocation and deallocation operations with memory 
regions uniformly distributed between 1 byte and 64 MiB. All three allocators 
were running at the same time to ensure the system base load was the same for 
all of them.

!noprof.png|Results run 1!

!prof.png|Results run 2!

More or less as predicted by other peoples experience, these results show that 
the heap tracking functionality has almost no runtime impact when enabled but 
not actively used, and as a bonus jemalloc actually seems to have a substantial 
speedup for multi-threaded allocations, although its debatable if this will be 
noticable during normal operation. I didn't manage to get a clean measurement 
from mesos own' benchmark tests yet.

This post by Facebook describes some implementation details of jemalloc, along 
with a very extensive comparison of several malloc implementations, although it 
seems the actual results are missing from the page:   
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

> Investigate alternative malloc's for mesos
> --
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such

[jira] [Commented] (MESOS-7819) Libprocess internal state is not monitored by metrics.

2017-08-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127362#comment-16127362
 ] 

Benno Evers commented on MESOS-7819:


For the metrics where we think they might be occasionally useful for debugging 
but are worried about exposing too much internal state (points 1,2,5), maybe 
another idea would be to introduce something like private metrics, which would 
essentially be something like a {{volatile static int64_t}} (so all 
modifications are preserved even at high optimization levels, but the only way 
to actually see the value would be through a debugger)

Some thoughts about the individual proposed metrics, it seems to me like any 
single one wouldn't be very useful because it's hard to say in isolation how 
many actors/connections/messages are "normal" for the different parts of mesos, 
but having multiple of them it would become possible to compare their ratios to 
known "normal" ranges and maybe pinpoint the fault location more precisely.

In particular, average number of pending messages might be useful not only for 
debugging but also for performance regression tests in the future.

> Libprocess internal state is not monitored by metrics.
> --
>
> Key: MESOS-7819
> URL: https://issues.apache.org/jira/browse/MESOS-7819
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Alexander Rukletsov
>  Labels: metrics, newbie++
>
> Libprocess does not expose its internal state via metrics. Active sockets, 
> number of HTTP proxies, number of running actors, number of pending messages 
> for all active sockets, etc — may be of interest when monitoring and 
> debugging Mesos clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7876) Investigate alternative malloc implementations for mesos

2017-08-15 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7876:
---
Summary: Investigate alternative malloc implementations for mesos  (was: 
Investigate alternative malloc's for mesos)

> Investigate alternative malloc implementations for mesos
> 
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7876) Investigate alternative malloc implementations for mesos

2017-08-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127588#comment-16127588
 ] 

Benno Evers commented on MESOS-7876:


Spot-checking some of the mesos benchmarks using jemalloc vs. system malloc, I 
can observe a small but consistent speedup from 1% to 6% using jemalloc over 
glibc. There certainly is no indication that switching to jemalloc would lead 
to performance regressions.

With jemalloc:
{code}
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/1 
(575213 ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/1 (1963 ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/10 (18756 
ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/11 (37044 
ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/12 (97298 
ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 
(302 ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 
(2311 ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 
(12104 ms)
{code}

With default malloc:
{code}
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.DeclineOffers/1 
(610002 ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/1 (2065 ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/10 (20207 
ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/11 (38086 
ms)
[   OK ] 
SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.Metrics/12 (98475 
ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/0 
(281 ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/1 
(2448 ms)
[   OK ] Tasks/SchedulerReconcileTasks_BENCHMARK_Test.SchedulerLibrary/2 
(12673 ms)
{code}

> Investigate alternative malloc implementations for mesos
> 
>
> Key: MESOS-7876
> URL: https://issues.apache.org/jira/browse/MESOS-7876
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
> Attachments: jemalloc_benchmark_raw.txt, malloc.cpp, noprof.png, 
> prof.png
>
>
> It is currently very hard to debug memory issues, in particular memory leaks, 
> in mesos.
> An alluring way to improve the situation would be to change the default 
> malloc to jemalloc, which has built-in heap-tracking capabilities.
> However, some care needs to be taken when considering to change such a 
> fundamental part of mesos:
>   * Would such a switch have any adverse impact on performance?
>   * Is it available and will it compile on all our target platforms?
>   * Is the jemalloc-licensing compatible with bundling as third-party library?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2017-08-30 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147024#comment-16147024
 ] 

Benno Evers commented on MESOS-7699:


I also  experienced this, and I think the correct way to handle it is to revert 
the usage of `-isystem` back to `-I`, and then to either disable building with 
`-Werror` by default (my preferred choice) or to add 
`-Wno-deprecated-declarations` to the default build flags.

The reasoning here is that using `-Werror` implies that we're committing to fix 
at least all warnings that occur with our supported list of compilers and 
dependencies, but as the original boost bug showed we are not willing and don't 
have the resources to do that. (I think the fact that we have a 
`--disable-werror` configure flag also shows that this would be a useful thing 
to do)

Alternatively, while I agree with the view that `-Wno-deprecated-declarations` 
will potentially hide useful warnings, having these warnings is in my opinion 
less important than being able to build with non-bundled versions of boost and 
protobuf.

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-09-06 Thread Benno Evers (JIRA)
Benno Evers created MESOS-7941:
--

 Summary: Send TASK_STARTING status from built-in executors
 Key: MESOS-7941
 URL: https://issues.apache.org/jira/browse/MESOS-7941
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers
Assignee: Benno Evers


All executors have the option to send out a TASK_STARTING status update to 
signal to the scheduler that they received the command to launch the task.

It would be good if our built-in executors would do this, for reasons laid out 
in 
https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E

This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-09-06 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155483#comment-16155483
 ] 

Benno Evers commented on MESOS-7941:


Review: https://reviews.apache.org/r/62123/

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-09-06 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155487#comment-16155487
 ] 

Benno Evers commented on MESOS-7941:


PR to update Chronos to correctly handle these new updates: 
https://github.com/mesos/chronos/pull/854

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7944) Implement jemalloc support for Mesos

2017-09-07 Thread Benno Evers (JIRA)
Benno Evers created MESOS-7944:
--

 Summary: Implement jemalloc support for Mesos
 Key: MESOS-7944
 URL: https://issues.apache.org/jira/browse/MESOS-7944
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers
Assignee: Benno Evers


After investigation in MESOS-7876 and discussion on the mailing list, this task 
is for tracking progress on adding out-of-the-box memory profiling support 
using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7944) Implement jemalloc support for Mesos

2017-09-11 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161001#comment-16161001
 ] 

Benno Evers commented on MESOS-7944:


Since I've started to work on this, I have now a much sharper idea of what 
needs to be done.

First of all, since the added features are not mesos-specific, I think it's 
best to add them directly to libprocess. However, the choice of preferred 
malloc should be up the binary, not enforced by a shared library, so instead 
compiling against jemalloc we should detect at runtime whether we're running 
under jemalloc or not. (similar to what folly does here: 
https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150)

At the endpoint, the minimum features I would like are the ability to get the 
(exact) heap allocation statistics as JSON, or download current (stochastic) 
heap profile dumps as files. Depending on the complexity of it, we should also 
think about providing a way to have the master dump profiles periodically and 
store them on disk, and a way to generate jeprof-graphs automatically.

Finally, the new `--enable-memory-profiling` configure option (tentative name) 
for mesos would build a bundled version of jemalloc with all the necessary 
configuration options enabled, and link the mesos-master and mesos-slave 
binaries against this library.

> Implement jemalloc support for Mesos
> 
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7944) Implement jemalloc support for Mesos

2017-09-11 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161001#comment-16161001
 ] 

Benno Evers edited comment on MESOS-7944 at 9/11/17 9:59 AM:
-

Since I've started to work on this, I have now a much better idea of what needs 
to be done.

First of all, since the added features are not mesos-specific, I think it's 
best to add them directly to libprocess. However, the choice of preferred 
malloc should be up the binary, not enforced by a shared library, so instead 
compiling against jemalloc we should detect at runtime whether we're running 
under jemalloc or not. (similar to what folly does here: 
https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150)

At the endpoint, the minimum features I would like are the ability to get the 
(exact) heap allocation statistics as JSON, or download current (stochastic) 
heap profile dumps as files. Depending on the complexity of it, we should also 
think about providing a way to have the master dump profiles periodically and 
store them on disk, and a way to generate jeprof-graphs automatically.

Finally, the new `--enable-memory-profiling` configure option (tentative name) 
for mesos would build a bundled version of jemalloc with all the necessary 
configuration options enabled, and link the mesos-master and mesos-slave 
binaries against this library.


was (Author: bennoe):
Since I've started to work on this, I have now a much sharper idea of what 
needs to be done.

First of all, since the added features are not mesos-specific, I think it's 
best to add them directly to libprocess. However, the choice of preferred 
malloc should be up the binary, not enforced by a shared library, so instead 
compiling against jemalloc we should detect at runtime whether we're running 
under jemalloc or not. (similar to what folly does here: 
https://github.com/facebook/folly/blob/master/folly/Malloc.h#L150)

At the endpoint, the minimum features I would like are the ability to get the 
(exact) heap allocation statistics as JSON, or download current (stochastic) 
heap profile dumps as files. Depending on the complexity of it, we should also 
think about providing a way to have the master dump profiles periodically and 
store them on disk, and a way to generate jeprof-graphs automatically.

Finally, the new `--enable-memory-profiling` configure option (tentative name) 
for mesos would build a bundled version of jemalloc with all the necessary 
configuration options enabled, and link the mesos-master and mesos-slave 
binaries against this library.

> Implement jemalloc support for Mesos
> 
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-09-15 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155483#comment-16155483
 ] 

Benno Evers edited comment on MESOS-7941 at 9/15/17 8:51 AM:
-

Review: https://reviews.apache.org/r/62212/


was (Author: bennoe):
Review: https://reviews.apache.org/r/62123/

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7994) Hard-coded protobuf version in mesos.pom.in

2017-09-21 Thread Benno Evers (JIRA)
Benno Evers created MESOS-7994:
--

 Summary: Hard-coded protobuf version in mesos.pom.in
 Key: MESOS-7994
 URL: https://issues.apache.org/jira/browse/MESOS-7994
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, the version of protobuf.jar used by maven is hardcoded in 
`src/java/mesos.pom.in` to be 3.3.0.

When building against a non-bundled version of protobuf, this will likely cause 
a version mismatch which can lead to build errors because the java build is 
trying to compile the java source files created by the protoc of the 
non-bundled protobuf.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-09-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8005:
--

 Summary: Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
 Key: MESOS-8005
 URL: https://issues.apache.org/jira/browse/MESOS-8005
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Executed on Ubuntu 17.04 w/ SSL enabled:

{code}
../../src/tests/cluster.cpp:580
Value of: containers->empty()
  Actual: false
Expected: true
Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab }
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-09-22 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8005:
---
Attachment: jenkins.log.gz

Full log.

> Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
> -
>
> Key: MESOS-8005
> URL: https://issues.apache.org/jira/browse/MESOS-8005
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
> Attachments: jenkins.log.gz
>
>
> Executed on Ubuntu 17.04 w/ SSL enabled:
> {code}
> ../../src/tests/cluster.cpp:580
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-09-22 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176745#comment-16176745
 ] 

Benno Evers edited comment on MESOS-8005 at 9/22/17 5:19 PM:
-

Sure, I attached the full log. The build was started for commit 
548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly 
doubt that this commit caused the test failure.


was (Author: bennoe):
Full log.

> Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
> -
>
> Key: MESOS-8005
> URL: https://issues.apache.org/jira/browse/MESOS-8005
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
> Attachments: jenkins.log.gz
>
>
> Executed on Ubuntu 17.04 w/ SSL enabled:
> {code}
> ../../src/tests/cluster.cpp:580
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-09-22 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176745#comment-16176745
 ] 

Benno Evers edited comment on MESOS-8005 at 9/22/17 5:20 PM:
-

Sure, I attached the full log. The build was started for commit 
548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly 
doubt that this commit caused the test failure:

{code}
bevers@poincare:~/src/mesos/worktrees/master$ git show 
548aaee3a8f5935457767db1e3b761d873b09cbf --stat
commit 548aaee3a8f5935457767db1e3b761d873b09cbf
Author: Tomasz Janiszewski 
Date:   Thu Sep 21 16:16:06 2017 -0700

Display task state counters in the framework page.

Fixes MESOS-7962.

This closes #234

 src/webui/master/static/framework.html| 42 
++
 src/webui/master/static/js/controllers.js | 30 ++
 2 files changed, 72 insertions(+)
{code}


was (Author: bennoe):
Sure, I attached the full log. The build was started for commit 
548aaee3a8f5935457767db1e3b761d873b09cbf on the master branch, but I highly 
doubt that this commit caused the test failure.

> Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
> -
>
> Key: MESOS-8005
> URL: https://issues.apache.org/jira/browse/MESOS-8005
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
> Attachments: jenkins.log.gz
>
>
> Executed on Ubuntu 17.04 w/ SSL enabled:
> {code}
> ../../src/tests/cluster.cpp:580
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8023) Warn users trying to use HTTP Basic Authentication over non-secure channels

2017-09-27 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8023:
--

 Summary: Warn users trying to use HTTP Basic Authentication over 
non-secure channels
 Key: MESOS-8023
 URL: https://issues.apache.org/jira/browse/MESOS-8023
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


Since the Basic authentication submits passwords and usernames in plain text, 
it should only be used when the connection is already secured through another 
layer, e.g. when using HTTPS.

Since many users are not aware of this fact, Mesos should try to detect warn 
about this situation where possible, to prevent accidental leaking of passwords.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8047) SubprocessTest.Status does not always receive a signal

2017-10-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8047:
--

 Summary: SubprocessTest.Status does not always receive a signal
 Key: MESOS-8047
 URL: https://issues.apache.org/jira/browse/MESOS-8047
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


This one seems to be different from MESOS-1705 and MESOS-1738. It might be that 
previous test runs leave a mesos process running in the background, but I 
didn't investigate very deeply:

{code}
[ RUN  ] SubprocessTest.Status
/home/bevers/src/mesos/worktrees/master/3rdparty/libprocess/src/tests/subprocess_tests.cpp:281:
 Failure
Expecting WIFSIGNALED(s.get().status()()->get()) but  
WIFEXITED(s.get().status()()->get()) is true and 
WEXITSTATUS(s.get().status()()->get()) is 0
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7941) Send TASK_STARTING status from built-in executors

2017-10-06 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7941:
---
Sprint: Mesosphere Sprint 65

> Send TASK_STARTING status from built-in executors
> -
>
> Key: MESOS-7941
> URL: https://issues.apache.org/jira/browse/MESOS-7941
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> All executors have the option to send out a TASK_STARTING status update to 
> signal to the scheduler that they received the command to launch the task.
> It would be good if our built-in executors would do this, for reasons laid 
> out in 
> https://mail-archives.apache.org/mod_mbox/mesos-dev/201708.mbox/%3CCA%2B9TLTzkEVM0CKvY%2B%3D0%3DwjrN6hYFAt0401Y7b8tysDWx1WZzdw%40mail.gmail.com%3E
> This will also fix MESOS-6790.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7944) Implement jemalloc support for Mesos

2017-10-06 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7944:
---
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 65  (was: Mesosphere Sprint 
63)

> Implement jemalloc support for Mesos
> 
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6790) Wrong task started time in webui

2017-10-11 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-6790:
---
Sprint: Mesosphere Sprint 65

> Wrong task started time in webui
> 
>
> Key: MESOS-6790
> URL: https://issues.apache.org/jira/browse/MESOS-6790
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: haosdent
>Assignee: Benno Evers
>  Labels: health-check, mesosphere, observability, webui
>
> Reported by [~janisz]
> {quote}
> Hi
> When task has enabled Mesos healthcheck start time in UI can show wrong
> time. This happens because UI assumes that first status is task started
> [0]. This is not always true because Mesos keeps only recent tasks statuses
> [1] so when healthcheck updates tasks status it can override task start
> time displayed in webui.
> Best
> Tomek
> [0]
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140
> [1]
> https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6790) Wrong task started time in webui

2017-10-11 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-6790:
--

Assignee: Benno Evers  (was: Tomasz Janiszewski)

> Wrong task started time in webui
> 
>
> Key: MESOS-6790
> URL: https://issues.apache.org/jira/browse/MESOS-6790
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: haosdent
>Assignee: Benno Evers
>  Labels: health-check, mesosphere, observability, webui
>
> Reported by [~janisz]
> {quote}
> Hi
> When task has enabled Mesos healthcheck start time in UI can show wrong
> time. This happens because UI assumes that first status is task started
> [0]. This is not always true because Mesos keeps only recent tasks statuses
> [1] so when healthcheck updates tasks status it can override task start
> time displayed in webui.
> Best
> Tomek
> [0]
> https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L140
> [1]
> https://github.com/apache/mesos/blob/f2adc8a95afda943f6a10e771aad64300da19047/src/common/protobuf_utils.cpp#L263-L265
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2017-10-12 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-7699:
--

Assignee: Benno Evers

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2017-10-12 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202283#comment-16202283
 ] 

Benno Evers commented on MESOS-7699:


I posted a review chain to fix this (along with follow-up issues when building 
against unbundled versions of boost and protobuf) at 
https://reviews.apache.org/r/62160/

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2017-10-12 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7699:
---
Shepherd: Benjamin Bannier
  Sprint: Mesosphere Sprint 66
Story Points: 3

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8217) Don't run linters on every commit

2017-11-13 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8217:
--

 Summary: Don't run linters on every commit
 Key: MESOS-8217
 URL: https://issues.apache.org/jira/browse/MESOS-8217
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The mesos `pre-commit`  hook is currently running several linters on the source 
code, some of which are even dynamically installed from the internet during a 
commit.

This can hinder development because it also applies to local commits that are 
not intended to be ever published, and can quickly become annoying when 
rebasing old branches.

Instead, we should think about putting these hooks into a separate 
`support/verify-reviews.py` which would be executed when trying to post a 
review, since at this point the patches should be cleaned up and pass all 
linter checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8273) Incorrect master state due to fast agent re-registration

2017-11-28 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8273:
--

 Summary: Incorrect master state due to fast agent re-registration
 Key: MESOS-8273
 URL: https://issues.apache.org/jira/browse/MESOS-8273
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, when a mesos agent attempts to reregister while a previous 
reregistration attempt is still on-going, the new attempt is discarded and the 
old is allowed to continue. This can lead to an inconsistent master state, when 
the agent gained new capabilities or a new version between restarts which are 
only present in the newer reregistration message.

Ideally, we should abort the old reregistration attempt and let the new one 
continue, but this requires some restructuring of the agent reregistration 
codepath.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8303) Add user doc for agent reconfiguration

2017-12-07 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8303:
---
Sprint: Mesosphere Sprint 70

> Add user doc for agent reconfiguration
> --
>
> Key: MESOS-8303
> URL: https://issues.apache.org/jira/browse/MESOS-8303
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Benno Evers
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8291) Add documentation about fault domains

2017-12-07 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8291:
---
Sprint: Mesosphere Sprint 70

> Add documentation about fault domains
> -
>
> Key: MESOS-8291
> URL: https://issues.apache.org/jira/browse/MESOS-8291
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Benno Evers
>
> We need some user docs for fault domains.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8245) SlaveRecoveryTest/0.ReconnectExecutor is flaky.

2017-12-13 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8245:
---
  Sprint: Mesosphere Sprint 70
Story Points: 3

> SlaveRecoveryTest/0.ReconnectExecutor is flaky.
> ---
>
> Key: MESOS-8245
> URL: https://issues.apache.org/jira/browse/MESOS-8245
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 17.04
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>  Labels: flaky-test
> Attachments: ReconnectExecutor-badrun.txt, 
> ReconnectExecutor-goodrun.txt
>
>
> Observed it today in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8115) Add a master flag to disallow agents that are not configured with fault domain

2017-12-14 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290982#comment-16290982
 ] 

Benno Evers commented on MESOS-8115:


Review: https://reviews.apache.org/r/64507/

> Add a master flag to disallow agents that are not configured with fault domain
> --
>
> Key: MESOS-8115
> URL: https://issues.apache.org/jira/browse/MESOS-8115
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Benno Evers
>
> Once mesos masters and agents in a cluster are *all* upgraded to a version 
> where the fault domains feature is available, it is beneficial to enforce 
> that agents without a fault domain configured are not allowed to join the 
> cluster. 
> This is a safety net for operators who could forget to configure the fault 
> domain of a remote agent and let it join the cluster. If this happens, an 
> agent in a remote region will be considered a local agent by the master and 
> frameworks (because agent's fault domain is not configured) causing tasks to 
> potentially land in a remote agent which is undesirable.
> Note that this has to be a configurable flag and not enforced by default 
> because otherwise upgrades from a fault domain non-configured cluster to a 
> configured cluster will not be possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2017-12-15 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8336:
--

 Summary: MasterTest.RegistryUpdateAfterReconfiguration is flaky
 Key: MESOS-8336
 URL: https://issues.apache.org/jira/browse/MESOS-8336
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Observed here: 
https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/

The test here failed because the registry contained 2 slaves, when it should 
have only one.

Looking through the log, everything seems normal (in particular, only 1 slave 
id appears throughout this test). The only thing out of the ordinary seems to 
be the agent sending two `RegisterSlaveMessage`s and two 
`ReregisterSlaveMessage`s, but looking at the code for generating the random 
backoff factor in the slave that seems to be more or less normal, and shouldn't 
break the test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8341) Agent can become stuck in (re-)registering state during upgrades

2017-12-18 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8341:
--

 Summary: Agent can become stuck in (re-)registering state during 
upgrades
 Key: MESOS-8341
 URL: https://issues.apache.org/jira/browse/MESOS-8341
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, an agent will not be erased from the set of currently 
(re-)registering agents if

 - it tries to (re-)register with a malformed version string
 - it tries to (re-)register with a version smaller than the minimum supported 
version
 - it tries to (re-)register with a domain when the master has no domain 
configured
 - the operator marks the slave as gone while the (re-)registration is ongoing

Afterwards, all further (re-)registration attempts with the same agent id will 
be discarded, because the master still  thinks that the original 
(re-)registration is ongoing.

Since most realistic way to encounter this issue would be during cluster 
upgrades, and it will fix itself with a master restart, it is unlikely to be 
reported externally.

Review: https://reviews.apache.org/r/64506



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-04 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311352#comment-16311352
 ] 

Benno Evers commented on MESOS-8391:


I could confirm this behaviour with mesos 1.5 on a DC/OS 1.11 cluster.

For case (2), while the system state eventually returns to normal and marathon 
correctly re-schedules the two tasks, the original task seems to stay in the 
`TASK_KILLING` state indefinitely.

>From a quick look at the logs, the agent gets as far as "Checkpointing 
>termination state to nested container's runtime directory", but never attempts 
>to destroy the parent container afterwards. I'm currently looking at the 
>container destruction code path to see what the expected behaviour would be.

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Priority: Critical
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-04 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311619#comment-16311619
 ] 

Benno Evers commented on MESOS-8391:


Attached a log where

Jan 04 11:30:06 <- Started two sleep tasks
Jan 04 11:33:14 <- Agent restart
Jan 04 11:33:53 <- Killed one of the tasks
Jan 04 11:35:08 <- Second agent restart

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Priority: Critical
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-04 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8391:
---
Attachment: agent.log.gz

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Priority: Critical
> Attachments: agent.log.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8359) Health checks are flapping for all tasks on the slave if one task has no enough resources to run

2018-01-08 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316519#comment-16316519
 ] 

Benno Evers commented on MESOS-8359:


>From what I gather, the following conditions need to be met to reproduce:

- The other tasks on the slave need to be health-checked by a `COMMAND`-type 
health check
- Docker executor must be used for all launched tasks

I'm also wondering which command was actually used for the command health 
check, and if the executor and/or master logs at the time the bug is observed 
show anything interesting? 

Finally, since I'm not very experienced with Marathon, can you give some more 
details on what exactly it means to "create a marathon application from your 
image"?

> Health checks are flapping for all tasks on the slave if one task has no 
> enough resources to run
> 
>
> Key: MESOS-8359
> URL: https://issues.apache.org/jira/browse/MESOS-8359
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Viacheslav Valyavskiy
> Attachments: logs2
>
>
> I have attached some logs from the affected 
> slave(newappmv_qagame_testapp.green_csahttp - name of the 'bad' application)
> Steps to reproduce:
> 1. Run multiple tasks on the slave
> 2. Create marathon application from our image ( docker pull 
> vvalyavskiy/csa-http ) and set memory limit to 16MB for it.
> 3. Wait some time and then observe flapping of all tasks on the slave where 
> our task is started



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-10 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320849#comment-16320849
 ] 

Benno Evers commented on MESOS-8410:


The issue was caused by an incorrect handling of multiple resources with the 
same name. I've opened a review with a fix at 
https://reviews.apache.org/r/65074/

> Reconfiguration policy fails to handle mount disk resources.
> 
>
> Key: MESOS-8410
> URL: https://issues.apache.org/jira/browse/MESOS-8410
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: Benno Evers
>
> We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos 
> agents that had mount disk resources configured, and it looks like the agent 
> confused the size of the mount disk with the size of the work directory 
> resource:
> {noformat}
> E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
> perform recovery: Configuration change not permitted under 'additive' policy: 
> Value of scalar resource 'disk' decreased from 183 to 868000
> {noformat}
> The {{--resources}} flag is
> {noformat}
> --resources="[
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 868000
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/a"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/b"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/c"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/d"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/e"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/f"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/g"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/h"
> }
>   }
> }
>   }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8410) Reconfiguration policy fails to handle mount disk resources.

2018-01-11 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8410:
---
Priority: Blocker  (was: Major)

> Reconfiguration policy fails to handle mount disk resources.
> 
>
> Key: MESOS-8410
> URL: https://issues.apache.org/jira/browse/MESOS-8410
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>Assignee: Benno Evers
>Priority: Blocker
>
> We deployed {{--reconfiguration_policy="additive"}} on a number of Mesos 
> agents that had mount disk resources configured, and it looks like the agent 
> confused the size of the mount disk with the size of the work directory 
> resource:
> {noformat}
> E0106 01:54:15.000123 1310889 slave.cpp:6733] EXIT with status 1: Failed to 
> perform recovery: Configuration change not permitted under 'additive' policy: 
> Value of scalar resource 'disk' decreased from 183 to 868000
> {noformat}
> The {{--resources}} flag is
> {noformat}
> --resources="[
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 868000
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/a"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/b"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/c"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/d"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/e"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/f"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/g"
> }
>   }
> }
>   }
>   ,
>   {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
>   "value": 183
> },
> "disk": {
>   "source": {
> "type": "MOUNT",
> "mount": {
>   "root" : "/srv/mesos/volumes/h"
> }
>   }
> }
>   }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7944) Implement jemalloc support for Mesos

2018-01-11 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7944:
---
Sprint: Mesosphere Sprint 63, Mesosphere Sprint 65, Mesosphere Sprint 66, 
Mesosphere Sprint 67, Mesosphere Sprint 68, Mesosphere Sprint 72  (was: 
Mesosphere Sprint 63, Mesosphere Sprint 65, Mesosphere Sprint 66, Mesosphere 
Sprint 67, Mesosphere Sprint 68)

> Implement jemalloc support for Mesos
> 
>
> Key: MESOS-7944
> URL: https://issues.apache.org/jira/browse/MESOS-7944
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>  Labels: mesosphere
>
> After investigation in MESOS-7876 and discussion on the mailing list, this 
> task is for tracking progress on adding out-of-the-box memory profiling 
> support using jemalloc to Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-10-12 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569370#comment-15569370
 ] 

Benno Evers commented on MESOS-6238:


Hm, the `url` seems pretty random. I can't remember putting it there for a 
specific reason, so I guess its some merge artifact from a previous revision.

I pushed a new commit to github (d2d122ab057c93e9136577db5030f9976eb623c3) 
which fixes this issue, at least for me mesos now builds with --enable-ssl on 
ubuntu trusty  and xenial.

> SSL / libevent support broken in IPv6 patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6238
> URL: https://issues.apache.org/jira/browse/MESOS-6238
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 
> make fails when configure options --enable-ssl --enable-libevent were given.
> Error message:
> {noformat}
> ...
> ...
> ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void 
> process::SocketManager::link_connect(const process::Future&, 
> process::network::Socket, const process::UPID&)’:
> ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not 
> declared in this scope
>Try ip = url.ip;
>  ^
> Makefile:997: recipe for target 'libprocess_la-process.lo' failed
> make[5]: *** [libprocess_la-process.lo] Error 1
> ...
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-10-12 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569389#comment-15569389
 ] 

Benno Evers commented on MESOS-6237:


Hm, one place that definitely needs to be fixed is in master/http/http.cpp:

Try hostname = info.has_hostname()
  ? info.hostname()
  : net::getHostname(net::IP(ntohl(info.ip(;

However, this shouldn't affect the agent display if I understand the code 
correctly.

Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I 
found it hard to start an agent for testing purposes without mesos figuring out 
the hostname automatically, 


> Agent Sandbox inaccessible when using IPv6 address in patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6237
> URL: https://issues.apache.org/jira/browse/MESOS-6237
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9
> When using IPs instead of hostnames the Agent Sandbox is inaccessible in the 
> Web UI. The problem seems to be that there's no brackets around the IP so it 
> tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of 
> http://[2001:41d0:1000:ab9::]:5051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-10-12 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15569389#comment-15569389
 ] 

Benno Evers edited comment on MESOS-6237 at 10/12/16 5:46 PM:
--

So, one place that definitely needs to be fixed is in master/http/http.cpp:

Try hostname = info.has_hostname()
  ? info.hostname()
  : net::getHostname(net::IP(ntohl(info.ip(;

However, this shouldn't affect the agent display if I understand the code 
correctly.

Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I 
found it hard to start an agent for testing purposes without mesos figuring out 
the hostname automatically, 



was (Author: bennoe):
Hm, one place that definitely needs to be fixed is in master/http/http.cpp:

Try hostname = info.has_hostname()
  ? info.hostname()
  : net::getHostname(net::IP(ntohl(info.ip(;

However, this shouldn't affect the agent display if I understand the code 
correctly.

Can I ask how you are getting a raw IP displayed in the mesos UI anyways? I 
found it hard to start an agent for testing purposes without mesos figuring out 
the hostname automatically, 


> Agent Sandbox inaccessible when using IPv6 address in patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6237
> URL: https://issues.apache.org/jira/browse/MESOS-6237
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9
> When using IPs instead of hostnames the Agent Sandbox is inaccessible in the 
> Web UI. The problem seems to be that there's no brackets around the IP so it 
> tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of 
> http://[2001:41d0:1000:ab9::]:5051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork

2016-02-05 Thread Benno Evers (JIRA)
Benno Evers created MESOS-4606:
--

 Summary: Add IPv6 support to net::IP and net::IPNetwork
 Key: MESOS-4606
 URL: https://issues.apache.org/jira/browse/MESOS-4606
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Benno Evers
Assignee: Benno Evers
Priority: Minor


The classes net::IP and net::IPNetwork should to be able to store IPv6 
addresses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork

2016-09-07 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470145#comment-15470145
 ] 

Benno Evers commented on MESOS-4606:


Yes, an implementation is available at 
https://github.com/lava/mesos/commit/8b83489a5cd5e3fe81c98cae3dfe58a7e945376f

There were no shepherds willing to take on this task, maybe this will change 
after a design document for the bigger issue (IPv6 support in mesos) is 
finished, which should be ready in the next few days to weeks.

> Add IPv6 support to net::IP and net::IPNetwork
> --
>
> Key: MESOS-4606
> URL: https://issues.apache.org/jira/browse/MESOS-4606
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Minor
>  Labels: network, stout
>
> The classes net::IP and net::IPNetwork should to be able to store IPv6 
> addresses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6237) Agent Sandbox inaccessible when using IPv6 address in patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-09-26 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-6237:
--

Assignee: Benno Evers

> Agent Sandbox inaccessible when using IPv6 address in patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6237
> URL: https://issues.apache.org/jira/browse/MESOS-6237
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9
> When using IPs instead of hostnames the Agent Sandbox is inaccessible in the 
> Web UI. The problem seems to be that there's no brackets around the IP so it 
> tries to access e.g. http://2001:41d0:1000:ab9:::5051 instead of 
> http://[2001:41d0:1000:ab9::]:5051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6238) SSL / libevent support broken in IPv6 patch from https://github.com/lava/mesos/tree/bennoe/ipv6

2016-10-06 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-6238:
--

Assignee: Benno Evers

> SSL / libevent support broken in IPv6 patch from 
> https://github.com/lava/mesos/tree/bennoe/ipv6
> ---
>
> Key: MESOS-6238
> URL: https://issues.apache.org/jira/browse/MESOS-6238
> Project: Mesos
>  Issue Type: Bug
>Reporter: Lukas Loesche
>Assignee: Benno Evers
>
> Affects https://github.com/lava/mesos/tree/bennoe/ipv6 at commit 
> 2199a24c0b7a782a0381aad8cceacbc95ec3d5c9 
> make fails when configure options --enable-ssl --enable-libevent were given.
> Error message:
> {noformat}
> ...
> ...
> ../../../3rdparty/libprocess/src/process.cpp: In member function ‘void 
> process::SocketManager::link_connect(const process::Future&, 
> process::network::Socket, const process::UPID&)’:
> ../../../3rdparty/libprocess/src/process.cpp:1457:25: error: ‘url’ was not 
> declared in this scope
>Try ip = url.ip;
>  ^
> Makefile:997: recipe for target 'libprocess_la-process.lo' failed
> make[5]: *** [libprocess_la-process.lo] Error 1
> ...
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-243) driver stop() should block until outstanding requests have been persisted

2016-06-07 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-243:
-

Assignee: Benno Evers

> driver stop() should block until outstanding requests have been persisted
> -
>
> Key: MESOS-243
> URL: https://issues.apache.org/jira/browse/MESOS-243
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1, 
> 0.14.2, 0.15.0
>Reporter: brian wickman
>Assignee: Benno Evers
>
> in our executor, we send a terminal status update message and immediately 
> call driver.stop().  it turns out that the status update is dispatched 
> asynchronously and races with driver shutdown, causing tasks to instead 
> periodically go into LOST state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-243) driver stop() should block until outstanding requests have been persisted

2016-06-13 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-243:
--
Assignee: Vladimir Petrovic  (was: Benno Evers)

> driver stop() should block until outstanding requests have been persisted
> -
>
> Key: MESOS-243
> URL: https://issues.apache.org/jira/browse/MESOS-243
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.14.1, 
> 0.14.2, 0.15.0
>Reporter: brian wickman
>Assignee: Vladimir Petrovic
>
> in our executor, we send a terminal status update message and immediately 
> call driver.stop().  it turns out that the status update is dispatched 
> asynchronously and races with driver shutdown, causing tasks to instead 
> periodically go into LOST state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-8450) SlaveInfo comparison is unnecessarily expensive

2018-01-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8450:
--

 Summary: SlaveInfo comparison is unnecessarily expensive
 Key: MESOS-8450
 URL: https://issues.apache.org/jira/browse/MESOS-8450
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Currently, the comparison operator of `struct SlaveInfo` is creating two 
temporary `Resources` objects and two temporary `Attributes` objects. All of 
these constructors do a bunch of work and allocate memory.

 

Instead of passing around `SlaveInfo` in the master, we should probably use 
some wrapper that stores the raw message as well as caching the lazily 
generated `Resources` and `Attributes` objects associated with that `SlaveInfo`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8451) Unhandled Interference between registration and reregistration code paths

2018-01-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8451:
--

 Summary: Unhandled Interference between registration and 
reregistration code paths
 Key: MESOS-8451
 URL: https://issues.apache.org/jira/browse/MESOS-8451
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Right now, the code paths for agent registration and agent re-registration run 
independent of each other, probably on the assumption that re-registration 
requires an agent ID from the master which is only given out after successful 
registration, so the code paths cannot interfere.

 

However, it is not so hard to construct some examples where this fails, e.g.

 

- Agent sends out registration message 1

- Timeout expires, agent sends out registration message 2

- Agent gets registration message 1, updates agent id, is restarted

- Agent send reregistration message 1 after restart

 

 

Most likely, a proper solution will require to introduce some kind of counter 
or uuid to the (re-)registration messages, which is also required for proper 
handling of multiple reregistration messages as described in MESOS-8273.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8452) Prevent zero-length timeout for exponential backoff

2018-01-17 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8452:
--

 Summary: Prevent zero-length timeout for exponential backoff
 Key: MESOS-8452
 URL: https://issues.apache.org/jira/browse/MESOS-8452
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


The current implementation of exponential backoff for registration attempts in 
the agent seems to have a high probability of generating zero-length timeouts, 
producing registration attempts that the master has no chance of responding in 
time.

 

Most likely, a minimum time between attemps should be introduced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8482) Signed/Unsigned comparisons in tests

2018-01-24 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8482:
--

 Summary: Signed/Unsigned comparisons in tests
 Key: MESOS-8482
 URL: https://issues.apache.org/jira/browse/MESOS-8482
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


Many tests in mesos currently have comparisons between signed and unsigned 
integers, eg
{noformat}
    ASSERT_EQ(4, v1Response->read_file().size());
{noformat}
or comparisons between values of different enums, e.g. TaskState and 
v1::TaskState:
{noformat}
  ASSERT_EQ(TASK_STARTING, startingUpdate->status().state());
{noformat}
Usually, the compiler would catch these and emit a warning, but these are 
currently silenced because gtest headers are included using the `-isystem` 
command line flag.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-01-24 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-8485:
--

Assignee: Benno Evers

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the 
> registry in 743168ns
> 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c)
> 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> 3: I0123 19:22:05.939159 16002 slave.cpp:1764] Will retry registration in 
> 26.332876ms if necessary
>

[jira] [Commented] (MESOS-8484) stout test NumifyTest.HexNumberTest fails.

2018-01-24 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338220#comment-16338220
 ] 

Benno Evers commented on MESOS-8484:


In boost 1.53, lexical_cast implements its own parser that doesnt handle the 
'0x' prefix, therefore parsing the two strings in the test would return an 
error.

 

In boost 1.65, lexical_cast calls std::istream::operator>>, which on mac (i.e. 
using libc++) can successfully parse strings of the form "0x10.9" or "0x1p-5", 
and returns the correct number. On linux platforms (i.e. using libstdc++), 
std::istream::operator>> is not able to parse these strings and thus returns an 
error.

 

The function stout::numify wants to achieve platform independence by forbidding 
these kinds of literals on all platforms. However, the checks are only 
happening *after* boost was already given the chance to parse the string, which 
has platform-dependent behaviour.

> stout test NumifyTest.HexNumberTest fails. 
> ---
>
> Key: MESOS-8484
> URL: https://issues.apache.org/jira/browse/MESOS-8484
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.6.0
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.37)
> ../configure && make check -j6
>Reporter: Till Toenshoff
>Assignee: Benjamin Bannier
>Priority: Blocker
>
> The current Mesos master shows the following on my machine:
> {noformat}
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../3rdparty/stout/tests/numify_tests.cpp:57: Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> ../../../3rdparty/stout/tests/numify_tests.cpp:58: Failure
> Value of: numify("0x1p-5").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> {noformat}
> This problem disappears for me when reverting the latest boost upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2018-01-25 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339357#comment-16339357
 ] 

Benno Evers commented on MESOS-7699:


Updated review chain after fixing a bunch of other issues blocking this: 
https://reviews.apache.org/r/62447/

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>Priority: Major
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2018-01-25 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339357#comment-16339357
 ] 

Benno Evers edited comment on MESOS-7699 at 1/25/18 3:25 PM:
-

Updated review chain after fixing a bunch of other issues blocking this:

 

[https://reviews.apache.org/r/62447/]

[https://reviews.apache.org/r/65289/]

[https://reviews.apache.org/r/65290/]

 


was (Author: bennoe):
Updated review chain after fixing a bunch of other issues blocking this: 
https://reviews.apache.org/r/62447/

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>Priority: Major
>  Labels: autotools
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-01-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341216#comment-16341216
 ] 

Benno Evers commented on MESOS-8485:


This is fairly reproducible when putting the test machine under heavy load 
(i.e. ca. 1 failure per 3000 runs when I'm compiling Mesos with 24 threads at 
the same time)

 

What happens is the following:

The test case is starting two different instances of `mesos-agent`, marking 
both of them as gone, and forcing one of them to be garbage collected. It 
expects that after this is done, one of the slaves will be marked as "gone" and 
the other be unknown. To get the agent id of the agents it registers, the 
following code is used:

 
{noformat}
  Future slaveRegisteredMessage =
    FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _);
  Try> slave = StartSlave(detector.get(), slaveFlags);
  AWAIT_READY(slaveRegisteredMessage);

  [...] (the slave is marked as gone here)

  Future slaveRegisteredMessage2 =
    FUTURE_PROTOBUF(SlaveRegisteredMessage(), master.get()->pid, _);
  Try> slave2 = StartSlave(detector.get(), slaveFlags2);
  AWAIT_READY(slaveRegisteredMessage2);{noformat}
 

In the failure case, the registration of the first agent works as follows:
{noformat}
agent0: Sends RegisterSlaveMessage
master: Does registration, adds SlaveRegisteredMessage to outbound message queue
agent0: Didn't get an answer after timeout, resends RegisterSlaveMessage
agent0: Gets the previously sent SlaveRegisteredMessage
master: Gets the second RegisterSlaveMessage, notices that agent0 is already 
registered and just resends the Slave
test: Proceeds to mark agent0 as gone, creates the 
Future for agent1
test: The future is satisfied by the second SlaveRegisteredMessage sent by the 
master{noformat}
Leading the test code to think that agent1 has the agent id of agent0, which 
leads to the subsequent test failure.

 

Mesos basically works correctly here, so the correct fix seems to be to rewrite 
the test to wait for a `SlaveRegisteredMessage` that is actually destined for 
the correct pid.

 

 

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 

[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-01-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341377#comment-16341377
 ] 

Benno Evers commented on MESOS-8485:


Review posted at: https://reviews.apache.org/r/65354

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the 
> registry in 743168ns
> 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c)
> 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> 3: I0123 19:22:05.939159 

[jira] [Created] (MESOS-8508) Missing map header when compiling against unbundled protobuf

2018-01-30 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8508:
--

 Summary: Missing map header when compiling against unbundled 
protobuf
 Key: MESOS-8508
 URL: https://issues.apache.org/jira/browse/MESOS-8508
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


When compiling mesos against the system-default version of protobuf on Ubuntu 
17.04, the build fails due to a missing include.

 

Explanation for the error by [~kaysoky]:
Note that the reason why this doesn't compile in protobuf 3.0.x is due to how 
the c++ files are generated.  In protobuf 3.0.x (and 3.1.x and 3.2.x) generated 
code only includes the protobuf map headers if there is a map present in the 
.proto 
file:[https://github.com/google/protobuf/blob/3.0.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L817-L827]

>From 3.3.x onwards, all generated files include 
>{{google/protobuf/generated_message_table_driven.h}}, which in turn includes 
>the map 
>headers:[https://github.com/google/protobuf/blob/3.3.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L1006]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-3915) Upgrade vendored Boost

2018-02-01 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-3915:
---
  Sprint: Mesosphere Sprint 73
Story Points: 5

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
> Fix For: 1.6.0
>
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7699) "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable freshly released)

2018-02-01 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-7699:
---
Sprint: Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68, 
Mesosphere Sprint 73  (was: Mesosphere Sprint 66, Mesosphere Sprint 67, 
Mesosphere Sprint 68)

> "stdlib.h: No such file or directory" when building with GCC 6 (Debian stable 
> freshly released)
> ---
>
> Key: MESOS-7699
> URL: https://issues.apache.org/jira/browse/MESOS-7699
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.2.0
>Reporter: Adam Cecile
>Assignee: Benno Evers
>Priority: Major
>  Labels: autotools
> Fix For: 1.6.0
>
>
> Hi,
> It seems the issue comes from a workaround added a while ago:
> https://reviews.apache.org/r/40326/
> https://reviews.apache.org/r/40327/
> When building with external libraries it turns out creating build commands 
> line with -isystem /usr/include which is clearly stated as being wrong, 
> according to GCC guys:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70129
> I'll do some testing by reverting all -isystem to -I and I'll let it know if 
> it gets built.
> Regards, Adam.
> {noformat}
> configure:21642: result: no
> configure:21642: checking glog/logging.h presence
> configure:21642: g++ -E -I/usr/include -I/usr/include/apr-1 
> -I/usr/include/apr-1.0 -Wdate-time -D_FORTIFY_SOURCE=2 -isystem /usr/include 
> -I/usr/include conftest.cpp
> In file included from /usr/include/c++/6/ext/string_conversions.h:41:0,
>  from /usr/include/c++/6/bits/basic_string.h:5417,
>  from /usr/include/c++/6/string:52,
>  from /usr/include/c++/6/bits/locale_classes.h:40,
>  from /usr/include/c++/6/bits/ios_base.h:41,
>  from /usr/include/c++/6/ios:42,
>  from /usr/include/c++/6/ostream:38,
>  from /usr/include/glog/logging.h:43,
>  from conftest.cpp:32:
> /usr/include/c++/6/cstdlib:75:25: fatal error: stdlib.h: No such file or 
> directory
>  #include_next 
>  ^
> compilation terminated.
> configure:21642: $? = 1
> configure: failed program was:
> | /* confdefs.h */
> | #define PACKAGE_NAME "mesos"
> | #define PACKAGE_TARNAME "mesos"
> | #define PACKAGE_VERSION "1.2.0"
> | #define PACKAGE_STRING "mesos 1.2.0"
> | #define PACKAGE_BUGREPORT ""
> | #define PACKAGE_URL ""
> | #define PACKAGE "mesos"
> | #define VERSION "1.2.0"
> | #define STDC_HEADERS 1
> | #define HAVE_SYS_TYPES_H 1
> | #define HAVE_SYS_STAT_H 1
> | #define HAVE_STDLIB_H 1
> | #define HAVE_STRING_H 1
> | #define HAVE_MEMORY_H 1
> | #define HAVE_STRINGS_H 1
> | #define HAVE_INTTYPES_H 1
> | #define HAVE_STDINT_H 1
> | #define HAVE_UNISTD_H 1
> | #define HAVE_DLFCN_H 1
> | #define LT_OBJDIR ".libs/"
> | #define HAVE_CXX11 1
> | #define HAVE_PTHREAD_PRIO_INHERIT 1
> | #define HAVE_PTHREAD 1
> | #define HAVE_LIBZ 1
> | #define HAVE_FTS_H 1
> | #define HAVE_APR_POOLS_H 1
> | #define HAVE_LIBAPR_1 1
> | #define HAVE_BOOST_VERSION_HPP 1
> | #define HAVE_LIBCURL 1
> | /* end confdefs.h.  */
> | #include 
> configure:21642: result: no
> configure:21642: checking for glog/logging.h
> configure:21642: result: no
> configure:21674: error: cannot find glog
> ---
> You have requested the use of a non-bundled glog but no suitable
> glog could be found.
> You may want specify the location of glog by providing a prefix
> path via --with-glog=DIR, or check that the path you provided is
> correct if you're already doing this.
> ---
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8508) Missing map header when compiling against unbundled protobuf

2018-02-01 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers updated MESOS-8508:
---
  Sprint: Mesosphere Sprint 73
Story Points: 1

> Missing map header when compiling against unbundled protobuf
> 
>
> Key: MESOS-8508
> URL: https://issues.apache.org/jira/browse/MESOS-8508
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
> Fix For: 1.6.0
>
>
> When compiling mesos against the system-default version of protobuf on Ubuntu 
> 17.04, the build fails due to a missing include.
>  
> Explanation for the error by [~kaysoky]:
> Note that the reason why this doesn't compile in protobuf 3.0.x is due to how 
> the c++ files are generated.  In protobuf 3.0.x (and 3.1.x and 3.2.x) 
> generated code only includes the protobuf map headers if there is a map 
> present in the .proto 
> file:[https://github.com/google/protobuf/blob/3.0.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L817-L827]
> From 3.3.x onwards, all generated files include 
> {{google/protobuf/generated_message_table_driven.h}}, which in turn includes 
> the map 
> headers:[https://github.com/google/protobuf/blob/3.3.x/src/google/protobuf/compiler/cpp/cpp_file.cc#L1006]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-02-05 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352625#comment-16352625
 ] 

Benno Evers commented on MESOS-8485:


[~abudnik] I did not attempt to do that, because it requires suspending and 
resuming most of the involved individual processes separately, and as far as 
I'm aware our test tools don't provide such fine-grained control.

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the 
> registry in 743168ns
> 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c)
> 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent 
> ee

[jira] [Commented] (MESOS-8359) Health checks are flapping for all tasks on the slave if one task has no enough resources to run

2018-02-06 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354054#comment-16354054
 ] 

Benno Evers commented on MESOS-8359:


I'm afraid I cant reproduce this:

 - I started a task `python3 -m http.server` along with a `MESOS_HTTP` health 
check that is succeeding

 - I started the `csa-http` container from the JSON app definition supplied 
above.

 - This fails almost immediately with the log output
{noformat}
I0206 15:35:04.468765 21810 exec.cpp:162] Version: 1.5.0
I0206 15:35:04.480106 21817 exec.cpp:236] Executor registered on agent 
c4c3a4b7-afa1-4e8d-a723-51777de3d429-S1
I0206 15:35:04.480873 21813 executor.cpp:120] Registered docker executor on 
10.0.3.249
I0206 15:35:04.481027 21817 executor.cpp:160] Starting task 
newappmv_qagame_testapp.green_csahttp.4cb3e8b2-0b53-11e8-b4d2-d6e8ac5e6a60
Picked up JAVA_TOOL_OPTIONS: -Xmx32m
Killed
I0206 15:35:12.424013 21814 executor.cpp:552] Container exited with status 137
I0206 15:35:13.424458 21810 checker_process.cpp:247] Stopped HTTP health check 
for task 
'newappmv_qagame_testapp.green_csahttp.4cb3e8b2-0b53-11e8-b4d2-d6e8ac5e6a60'{noformat}
 * After waiting for 15m, the health check for the python3 task is not flapping 
but stably succeding

Is it maybe possible to further reduce the test case? In particular, do you 
still observe the same behaviour if you remove docker and marathon from the 
picture by just starting the jar directly on the slave node?

> Health checks are flapping for all tasks on the slave if one task has no 
> enough resources to run
> 
>
> Key: MESOS-8359
> URL: https://issues.apache.org/jira/browse/MESOS-8359
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.2
>Reporter: Viacheslav Valyavskiy
>Priority: Major
> Attachments: logs2
>
>
> I have attached some logs from the affected 
> slave(newappmv_qagame_testapp.green_csahttp - name of the 'bad' application)
> Steps to reproduce:
> 1. Run multiple tasks on the slave
> 2. Create marathon application from our image ( docker pull 
> vvalyavskiy/csa-http ) and set memory limit to 16MB for it.
> 3. Wait some time and then observe flapping of all tasks on the slave where 
> our task is started



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`

2018-02-08 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357207#comment-16357207
 ] 

Benno Evers commented on MESOS-8550:


Andrei's analysis seems to be right, the code was indeed calling 
`leader->has_domain()` on an `Option` without checking that it was 
not `None` first.

 

I posted a fix in in the following review: https://reviews.apache.org/r/65571/

> Bug in `Master::detected()` leads to coredump in 
> `MasterZooKeeperTest.MasterInfoAddress`
> 
>
> Key: MESOS-8550
> URL: https://issues.apache.org/jira/browse/MESOS-8550
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master
>Reporter: Andrei Budnik
>Priority: Major
> Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt
>
>
> {code:java}
> 15:55:17 Assertion failed: (isSome()), function get, file 
> ../../3rdparty/stout/include/stout/option.hpp, line 119.
> 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if 
> you are using GNU date ***
> 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill
> 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID 
> 0x70427000) stack trace: ***
> 15:55:17 @ 0x7fff4fa24f5a _sigtramp
> 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session 
> expired
> 15:55:17 @ 0x70425500 (unknown)
> 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: 
> Client 
> environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU
> 15:55:17 @ 0x7fff4f84f312 abort
> 15:55:17 2018-02-07 
> 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating 
> client connection, host=127.0.0.1:52197 sessionTimeout=1 
> watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 
> flags=0
> 15:55:17 @ 0x7fff4f817368 __assert_rtn
> 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv
> 15:55:17 @0x10bbb04b5 Option<>::operator->()
> 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected()
> 15:55:17 @0x10bf54558 
> _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_
> 15:55:17 @0x10bf54310 
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_
> 15:55:17 @0x10bf542bb 
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_
> 15:55:17 @0x10bf541f3 
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_
> 15:55:17 @0x10bf540bd 
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_
> 15:55:17 @0x10bf54081 
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_
> 15:55:17 @0x10bf53e06 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_
> 15:55:17 @0x10ebf464f 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> 15:55:17 @0x10ebf44c4 process::ProcessBase:

[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369238#comment-16369238
 ] 

Benno Evers commented on MESOS-8594:


The analysis by [~abudnik] seems to be correct, the actual site of the crash 
looks completely harmless with no dangling pointers or anything, and the call 
stack is very deep, going repeatedly through `process::internal::send()` and 
`process::internal::_send()`. (although

 

The root cause seems to be this ancient TODO in `Future::onAny()`
{noformat}
  synchronized (data->lock) {
    if (data->state == PENDING) {
  data->onAnyCallbacks.emplace_back(std::move(callback));
    } else {
  run = true;
    }
  }

  // TODO(*): Invoke callback in another execution context.
  if (run) {
    std::move(callback)(*this); // NOLINT(misc-use-after-move)
  }{noformat}
 

so whenever we arrive in `send()` and the future returned by the socket is 
already finished, we add another 5-10 functions to the stack frame.

 

Most likely, due the large number of big packets being sent over a loopback 
interface, there is always enough data to allow a large enough build-up to 
cause the program to run out of stack space.

 

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Priority: Major
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik] 
> {quote}
> it’s the stack overflow bug in libprocess due to a way `internal::send()` and 
> `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-02-21 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371439#comment-16371439
 ] 

Benno Evers commented on MESOS-8336:


The root cause here is a very familiar one, that has already rendered countless 
other tests flaky. In particular, I'm talking about this line in `slave.cpp`:
{noformat}
    // Wait for a random amount of time before authentication or
    // registration.
    Duration duration =
  flags.registration_backoff_factor * ((double) os::random() / 
RAND_MAX);{noformat}
Here, the agent is sending the re-tried `RegisterSlaveMessage` after 9ms, 
*just* before shutting down, and the master notices that the network link is 
down before it gets to processing the message.

This leads to the master assigning a second slave ID, almost immediately 
removing the slave again because the network link is broken as well, and 
finally the test seeing the remnants of this second slave in the registry.

 

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-02-22 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16373066#comment-16373066
 ] 

Benno Evers commented on MESOS-8336:


https://reviews.apache.org/r/65758/

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

2018-02-22 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-8336:
--

Assignee: Benno Evers

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> --
>
> Key: MESOS-8336
> URL: https://issues.apache.org/jira/browse/MESOS-8336
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
> Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: 
> https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should 
> have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave 
> id appears throughout this test). The only thing out of the ordinary seems to 
> be the agent sending two `RegisterSlaveMessage`s and two 
> `ReregisterSlaveMessage`s, but looking at the code for generating the random 
> backoff factor in the slave that seems to be more or less normal, and 
> shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8600) Add more permissive reconfiguration policies

2018-02-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8600:
--

 Summary: Add more permissive reconfiguration policies
 Key: MESOS-8600
 URL: https://issues.apache.org/jira/browse/MESOS-8600
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


With Mesos 1.5, the `reconfiguration_policy` flag was added to allow 
reconfiguration of agents without necessarily draining all tasks.

However, the current implementation only allows a limited set of changes, with 
the `–reconfiguration_policy=all` setting laid out in the original design doc 
not yet being implemented.

This ticket is intended to track progress on implementing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8704) Removing `work_dir` can trigger assertion failure in the mesos containerizer

2018-03-21 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8704:
--

 Summary: Removing `work_dir` can trigger assertion failure in the 
mesos containerizer
 Key: MESOS-8704
 URL: https://issues.apache.org/jira/browse/MESOS-8704
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


This was reported to me by [~jeschkies], so I might be missing some details.

After starting a Mesos agent with the flag `–containerizer=mesos,docker` and 
using Marathon to run a task group on this agent, then stopping the agent and 
removing the `work_dir` folder, and then restarting the agent with the flag 
`–containerizer=mesos` leads to the following crash during recovery:
{noformat}
I0319 15:58:03.865108 121364480 containerizer.cpp:674] Recovering containerizer
F0319 15:58:03.867717 121364480 containerizer.cpp:919] 
CHECK_SOME(container->directory): is NONE
*** Check failure stack trace: ***{noformat}
After a reboot, things seemed to be working fine again.

 

Since we're reading container id's from `runtime_dir` during recovery, and that 
wasn't cleaned between agent restarts, it seems like we're missing some 
validation for the case where the agent restarts from a half-dirty state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8703) Mesos master can`t reconnect to zookeeper

2018-03-21 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408442#comment-16408442
 ] 

Benno Evers commented on MESOS-8703:


The original zookeeper crash might well be caused by MESOS-8550.

However, usually this should just result in a crash and subsequent restart of 
the master. Instead, the master seems to lock up during shutdown. The cause 
might be a similar issue as in MESOS-1477, although I couldn't see any 
suspicious changes to the related files for version 1.4.1.

If this issue is somewhat reproducible, it would probably be helpful to include 
stack traces for all threads when the master becomes unresponsive.

 

> Mesos master can`t reconnect to zookeeper 
> --
>
> Key: MESOS-8703
> URL: https://issues.apache.org/jira/browse/MESOS-8703
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.1
>Reporter: Anton Malevich
>Priority: Blocker
>
> Mesos master can`t reconnect to zookeeper after zookeeper hangs.
> {noformat}
> 2018-03-20 
> 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket 
> [:2181] zk retcode=-7, errno=110(Connection timed out): connection 
> to :2181 timed out (exceeded timeout by 3ms)
> 2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728: 
> initiated connection to server [:2181]
> 2018-03-20 
> 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket 
> [:2181] zk retcode=-112, errno=116(Stale file handle): 
> sessionId=0x5623d0e483dd435 has expired.
> I0320 10:16:45.62060418 group.cpp:511] ZooKeeper session expired
> I0320 10:16:45.62080216 detector.cpp:152] Detected a new leader: None
> I0320 10:16:45.62095716 master.cpp:2176] The newly elected leader is None
> mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T& 
> Option::get() & [with T = mesos::MasterInfo]: Assertion `isSome()' failed.
> *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are 
> using GNU date ***
> PC: @ 0x2ae63d2b9428 (unknown)
> *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack 
> trace: ***
> @ 0x2ae63d078390 (unknown)
> @ 0x2ae63d2b9428 (unknown)
> @ 0x2ae63d2bb02a (unknown)
> @ 0x2ae63d2b1bd7 (unknown)
> @ 0x2ae63d2b1c82 (unknown)
> 2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543: 
> Freeing zookeeper resources for sessionId=0x5623d0e483dd435
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client 
> environment:zookeeper.version=zookeeper C client 3.4.8
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client 
> environment:host.name=
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client 
> environment:os.name=Linux
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client 
> environment:os.arch=4.8.15-1.el7.wg.x86_64
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client 
> environment:os.version=#1 SMP Mon Dec 26 14:34:45 UTC 2016
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client 
> environment:user.name=(null)
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client 
> environment:user.home=/root
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client 
> environment:user.dir=/
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800: 
> Initiating client connection, host= sessionTimeout=1 
> watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd= 
> context=0x2ae6900036f8 flags=0
> @ 0x2ae63ad6b55b mesos::internal::master::Master::detected()
> @ 0x2ae63b9e4cfc process::ProcessBase::visit()
> 2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728: 
> initiated connection to server [:2181]
> @ 0x2ae63b9fac84 process::ProcessManager::resume()
> @ 0x2ae63b9fd5e6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x2ae63c87ec80 (unknown)
> @ 0x2ae63d06e6ba start_thread
> @ 0x2ae63d38b3dd (unknown)
> 2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session 
> establishment complete on server [:2181], 
> sessionId=0x1623f43348692c7, negotiated timeout=1
> I0320 10:16:45.65168415 group.cpp:341] Group process 
> (zookeeper-group(2)@:5050) connected to ZooKeeper
> I0320 10:16:45.65173315 group.cpp:831] Syncing group operations: queue 
> size (joins, cancels, datas) = (0, 0, 0)
> I0320 10:16:45.65174315 group.cpp:419] Trying to create path '/mesos' in 
> ZooKeeper
> I0320 10:16:45.67673615 detector.cpp:152] Detected a new leader: 
> (id='704')
> I0320 10:16:45.67684415 group.cpp:700] Trying to get 
> '/mesos/json.info

[jira] [Created] (MESOS-8721) Unnecessary cropping of agent id's in the web ui

2018-03-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8721:
--

 Summary: Unnecessary cropping of agent id's in the web ui
 Key: MESOS-8721
 URL: https://issues.apache.org/jira/browse/MESOS-8721
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers
 Attachments: cropped_ids.png

As seen in the attached image (captured from Firefox 59 and Mesos 1.2.3), the 
agents page of the web ui appears to be cropping agent ids even if the column 
would have enough space to display the full name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8722) Hard-coded timeout for authentication failures

2018-03-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8722:
--

 Summary: Hard-coded timeout for authentication failures
 Key: MESOS-8722
 URL: https://issues.apache.org/jira/browse/MESOS-8722
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


In the mesos agent there is a hard-coded 5 second timeout for any 
authentication attempt:
{noformat}
void Slave::authenticate()
{
 [...]

  delay(Seconds(5), self(), &Self::authenticationTimeout, authenticating.get());
}
{noformat}
When the network is poor, this can lead to the situation where an agent doesn't 
get to authorize for a long time, preventing it from re-joining the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build

2018-03-22 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8724:
--

 Summary: G++ Warning about libc system macros `major` and `minor` 
prevents Mesos build
 Key: MESOS-8724
 URL: https://issues.apache.org/jira/browse/MESOS-8724
 Project: Mesos
  Issue Type: Bug
Reporter: Benno Evers


On linux systems, the header `` defines three macros called 
makedev(), major() and minor(). (See also 
http://man7.org/linux/man-pages/man3/makedev.3.html)

Trying to compile Mesos using g++ 7.2.0 leads to the following warning:
{noformat}
../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is defined
 by . For historical compatibility, it is
 currently defined by  as well, but we plan to
 remove this soon. To use "minor", include 
 directly. If you did not intend to use a system-defined macro
 "minor", you should undefine it after including . [-Werror]
 inline ::google::protobuf::uint32 Version::minor() const {
{noformat}
The root cause is that csi.proto defines the following protobuf message:
{noformat}
message Version {
  uint32 major = 1;  // This field is REQUIRED.
  uint32 minor = 2;  // This field is REQUIRED.
  uint32 patch = 3;  // This field is REQUIRED.
}
{noformat}
The generated C++ in `csi.pb.h` headers will contain, amongst others, the 
following function:
{noformat}
#include 

// [6000 lines of code...]

inline ::google::protobuf::uint32 Version::major() const {
  // @@protoc_insertion_point(field_get:csi.Version.major)
  return major_;
}
{noformat}
And the recursive include structure of the header `` leads to 
`stdlib.h` as follows:
{noformat}
.   /usr/include/c++/7/string
..  /usr/include/c++/7/bits/basic_string.h
... /usr/include/c++/7/ext/string_conversions.h
    /usr/include/c++/7/cstdlib
.   /usr/include/stdlib.h{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8724) G++ Warning about libc system macros `major` and `minor` prevents Mesos build

2018-03-22 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409916#comment-16409916
 ] 

Benno Evers commented on MESOS-8724:


One subtle thing to keep in mind, if we decide to "properly" fix it by getting 
protoc to add the correct #undef's for minor and major, we should take care to 
*not* backport the patch to older mesos versions, since that would remove the 
previously defined function `csi::Version::gnu_dev_major()`, causing ABI 
incompatibility for people upgrading libmesos.so.

> G++ Warning about libc system macros `major` and `minor` prevents Mesos build
> -
>
> Key: MESOS-8724
> URL: https://issues.apache.org/jira/browse/MESOS-8724
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> On linux systems, the header `` defines three macros called 
> makedev(), major() and minor(). (See also 
> [http://man7.org/linux/man-pages/man3/makedev.3.html])
> Trying to compile Mesos using g++ 7.2.0 leads to the following warning:
> {noformat}
> ../include/csi/csi.pb.h:6042:13: error: In the GNU C Library, "minor" is 
> defined
>  by . For historical compatibility, it is
>  currently defined by  as well, but we plan to
>  remove this soon. To use "minor", include 
>  directly. If you did not intend to use a system-defined macro
>  "minor", you should undefine it after including . [-Werror]
>  inline ::google::protobuf::uint32 Version::minor() const {
> {noformat}
> The root cause is that csi.proto defines the following protobuf message:
> {noformat}
> message Version {
>   uint32 major = 1;  // This field is REQUIRED.
>   uint32 minor = 2;  // This field is REQUIRED.
>   uint32 patch = 3;  // This field is REQUIRED.
> }
> {noformat}
> The generated C++ in `csi.pb.h` headers will contain, amongst others, the 
> following function:
> {noformat}
> #include 
> // [6000 lines of code...]
> inline ::google::protobuf::uint32 Version::major() const {
>   // @@protoc_insertion_point(field_get:csi.Version.major)
>   return major_;
> }
> {noformat}
> And the recursive include structure of the header `` leads to 
> `stdlib.h` as follows:
> {noformat}
> .   /usr/include/c++/7/string
> ..  /usr/include/c++/7/bits/basic_string.h
> ... /usr/include/c++/7/ext/string_conversions.h
>     /usr/include/c++/7/cstdlib
> .   /usr/include/stdlib.h
> ..  /usr/include/x86_64-linux-gnu/sys/types.h
> ... /usr/include/x86_64-linux-gnu/sys/sysmacros.h{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8728) Don't print full usage for invocation errors

2018-03-23 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8728:
--

 Summary: Don't print full usage for invocation errors
 Key: MESOS-8728
 URL: https://issues.apache.org/jira/browse/MESOS-8728
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The current usage string for mesos-master comes in at 399 lines, and for 
mesos-agent at 685 lines.

 

Printing such a wall of text will overflow most terminal windows, making it 
necessary to scroll up to see the actual error when invoking mesos with an 
incorrect command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8728) Don't print full usage for invocation errors

2018-03-23 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411655#comment-16411655
 ] 

Benno Evers commented on MESOS-8728:


https://reviews.apache.org/r/63733/

> Don't print full usage for invocation errors
> 
>
> Key: MESOS-8728
> URL: https://issues.apache.org/jira/browse/MESOS-8728
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> The current usage string for mesos-master comes in at 399 lines, and for 
> mesos-agent at 685 lines.
>  
> Printing such a wall of text will overflow most terminal windows, making it 
> necessary to scroll up to see the actual error when invoking mesos with an 
> incorrect command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8711) SlaveTest.ChangeDomain is disabled.

2018-03-23 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411712#comment-16411712
 ] 

Benno Evers commented on MESOS-8711:


https://reviews.apache.org/r/66248/

> SlaveTest.ChangeDomain is disabled.
> ---
>
> Key: MESOS-8711
> URL: https://issues.apache.org/jira/browse/MESOS-8711
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: disabled-test, flaky-test
>
> This test has been disabled in 
> https://github.com/apache/mesos/commit/c0468b240842d4aaf04249cb0a58c59c43d1850d.
>  We should either fix or remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7616) Consider supporting changes to agent's domain without full drain.

2018-03-27 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415463#comment-16415463
 ] 

Benno Evers commented on MESOS-7616:


Bookkeeping note: I've assigned the same number of story points to this and the 
corresponding epic MESOS-1739, please correct if this isn't the correct 
accounting method @[~vinodkone].

> Consider supporting changes to agent's domain without full drain.
> -
>
> Key: MESOS-7616
> URL: https://issues.apache.org/jira/browse/MESOS-7616
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> In the initial review chain, any change to an agent's domain requires a full 
> drain. This is simple and straightforward, but it makes it more difficult for 
> operators to opt-in to using fault domains.
> We should consider allowing agents to transition from "no configured domain" 
> to "configured domain" without requiring an agent drain. This has some 
> complications, however: e.g., without an API for communicating changes in an 
> agent's configuration to frameworks, they might not realize that an agent's 
> domain has changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2018-03-27 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-1466:
--

Resolution: Fixed
  Assignee: Meng Zhu

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2018-03-27 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415612#comment-16415612
 ] 

Benno Evers commented on MESOS-1466:


If I understand the issue correctly, this race seems to have been eliminated as 
a side-effect of introducing the `launch_executor` flag in Mesos 1.5:

When the master sends the `RunTaskMessage` to the agent, it thinks that the 
specified executor is still running on the agent, so it will set 
`launch_executor = false`:
{noformat}
// src/master/master.cpp:3841
bool Master::isLaunchExecutor(
    const ExecutorID& executorId,
    Framework* framework,
    Slave* slave) const
{
  CHECK_NOTNULL(framework);
  CHECK_NOTNULL(slave);

  if (!slave->hasExecutor(framework->id(), executorId)) {
    CHECK(!framework->hasExecutor(slave->id, executorId))
  << "Executor '" << executorId
  << "' known to the framework " << *framework
  << " but unknown to the agent " << *slave;

    return true;
  }

  return false;
}{noformat}
On the slave, when the executor doesn't exist anymore, the task is dropped with 
reason `REASON_EXECUTOR_TERMINATED`:
{noformat}
// src/slave/slave.cpp:2881

    // Master does not want to launch executor.
    if (executor == nullptr) {
  // Master wants no new executor launched and there is none running on
  // the agent. This could happen if the task expects some previous
  // tasks to launch the executor. However, the earlier task got killed
  // or dropped hence did not launch the executor but the master doesn't
  // know about it yet because the `ExitedExecutorMessage` is still in
  // flight. In this case, we will drop the task.
  //
  // We report TASK_DROPPED to the framework because the task was
  // never launched. For non-partition-aware frameworks, we report
  // TASK_LOST for backward compatibility.
  mesos::TaskState taskState = TASK_DROPPED;
  if (!protobuf::frameworkHasCapability(
  frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) {
    taskState = TASK_LOST;
  }

  foreach (const TaskInfo& _task, tasks) {
    const StatusUpdate update = protobuf::createStatusUpdate(
    frameworkId,
    info.id(),
    _task.task_id(),
    taskState,
    TaskStatus::SOURCE_SLAVE,
    id::UUID::random(),
    "No executor is expected to launch and there is none running",
    TaskStatus::REASON_EXECUTOR_TERMINATED,
    executorId);

    statusUpdate(update, UPID());
  }

  // We do not send `ExitedExecutorMessage` here because the expectation
  // is that there is already one on the fly to master. If the message
  // gets dropped, we will hopefully reconcile with the master later.

  return;
    }{noformat}

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Priority: Major
>  Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8801) Add jemalloc as optional third-party memory allocator

2018-04-18 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8801:
--

 Summary: Add jemalloc as optional third-party memory allocator
 Key: MESOS-8801
 URL: https://issues.apache.org/jira/browse/MESOS-8801
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


As seen MESOS-7876, using jemalloc over the default memory allocator can have 
performance benefits.

 

Additionally, this is also supports the use case of MESOS-7944 by providing an 
out-of-the-box option to enable memory profiling. (which is also the ticket 
referenced in the mailing list discussion about this)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8801) Add jemalloc as optional third-party memory allocator

2018-04-18 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442743#comment-16442743
 ] 

Benno Evers commented on MESOS-8801:


Review: https://reviews.apache.org/r/63366

> Add jemalloc as optional third-party memory allocator
> -
>
> Key: MESOS-8801
> URL: https://issues.apache.org/jira/browse/MESOS-8801
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> As seen MESOS-7876, using jemalloc over the default memory allocator can have 
> performance benefits.
>  
> Additionally, this is also supports the use case of MESOS-7944 by providing 
> an out-of-the-box option to enable memory profiling. (which is also the 
> ticket referenced in the mailing list discussion about this)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8834) libprocess底层internal::send和internal::_send相互调用, 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454219#comment-16454219
 ] 

Benno Evers commented on MESOS-8834:


While I can't really understand the text, judging from the send -> _send -> 
send -> ... -> coredump sequence this looks like it might be the same issue as 
MESOS-8594?

> libprocess底层internal::send和internal::_send相互调用, 
> 当outgoing[socket]里一直有数据包要发送时,那么存在栈耗尽 core dump问题
> 
>
> Key: MESOS-8834
> URL: https://issues.apache.org/jira/browse/MESOS-8834
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.5.0
>Reporter: liwuqi
>Priority: Blocker
>  Labels: core, libprocess, send
>
> 如果某个process 
> while(true)发消息,将导致大量消息缓存在outgoing[socket]里,而在底层由internal::send和internal::_send去执行消息的发送,那么就会出现递归调用:
> _send -> send -> _send ->send -> ... ->_send -> send -> 
> 导致调用栈不断增加,最终栈耗尽发生core dump问题.
> 我本地测试,发现当栈层次达到40,000+时发生core dump
> 为了解决这个问题,需要修改底层消息发送机制
>  
> 请关注这个问题,谢谢
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8797) Check failed in the default executor while running `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454390#comment-16454390
 ] 

Benno Evers commented on MESOS-8797:


https://reviews.apache.org/r/66815/

> Check failed in the default executor while running 
> `MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0` test.
> 
>
> Key: MESOS-8797
> URL: https://issues.apache.org/jira/browse/MESOS-8797
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
> Environment: Centos 7 SSL (internal CI)
> master-[a95d9b8|https://github.com/apache/mesos/commit/a95d9b8fb53ab8fbf4a7b6d762c9e0749b4c013a]
>  (17-Apr-2018 14:03:14)
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test
> Attachments: DefaultExecutorTest.TaskUsesExecutor-badrun.txt
>
>
> {code:java}
> lt-mesos-default-executor: ../../3rdparty/stout/include/stout/option.hpp:119: 
> T& Option::get() & [with T = std::basic_string]: Assertion 
> `isSome()' failed.
> *** Aborted at 1523976443 (unix time) try "date -d @1523976443" if you are 
> using GNU date ***
> PC: @ 0x7efcfd11f1f7 __GI_raise
> *** SIGABRT (@0x4d44) received by PID 19780 (TID 0x7efcf5adb700) from PID 
> 19780; stack trace: ***
> @ 0x7efcfd9da5e0 (unknown)
> @ 0x7efcfd11f1f7 __GI_raise
> @ 0x7efcfd1208e8 __GI_abort
> @ 0x7efcfd118266 __assert_fail_base
> @ 0x7efcfd118312 __GI___assert_fail
> @ 0x55a05fa269f7 mesos::internal::DefaultExecutor::waited()
> @ 0x7efd002212d1 process::ProcessBase::consume()
> @ 0x7efd0023a52a process::ProcessManager::resume()
> @ 0x7efd0023dfa6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7efd003f9470 execute_native_thread_routine
> @ 0x7efcfd9d2e25 start_thread
> @ 0x7efcfd1e234d __clone
> {code}
> Observed this failure in internal CI for test
> {code:java}
>  MesosContainerizer/DefaultExecutorTest.TaskUsesExecutor/0{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8687) Check failure in `ProcessBase::_consume()`.

2018-04-26 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454395#comment-16454395
 ] 

Benno Evers commented on MESOS-8687:


Review for the test fix: https://reviews.apache.org/r/66799/

> Check failure in `ProcessBase::_consume()`.
> ---
>
> Key: MESOS-8687
> URL: https://issues.apache.org/jira/browse/MESOS-8687
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.6.0
> Environment: ec2 CentOS 7 with SSL
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test, reliability
> Attachments: MasterAPITest.MasterFailover-with-CHECK.txt, 
> MasterFailover-badrun.txt
>
>
> Observed a segfault in the {{MasterAPITest.MasterFailover}} test:
> {noformat}
> 10:59:04 I0319 10:59:04.312197  3274 master.cpp:649] Authorization enabled
> 10:59:04 F0319 10:59:04.312772  3274 owned.hpp:110] Check failed: 'get()' 
> Must be non NULL
> 10:59:04 *** Check failure stack trace: ***
> 10:59:04 I0319 10:59:04.313470  3279 hierarchical.cpp:175] Initialized 
> hierarchical allocator process
> 10:59:04 I0319 10:59:04.313500  3279 whitelist_watcher.cpp:77] No whitelist 
> given
> 10:59:04 @ 0x7fe82d44e0cd  google::LogMessage::Fail()
> 10:59:04 @ 0x7fe82d44ff1d  google::LogMessage::SendToLog()
> 10:59:04 @ 0x7fe82d44dcb3  google::LogMessage::Flush()
> 10:59:04 @ 0x7fe82d450919  google::LogMessageFatal::~LogMessageFatal()
> 10:59:04 @ 0x7fe82d3cee16  google::CheckNotNull<>()
> 10:59:04 @ 0x7fe82d3b4253  process::ProcessBase::_consume()
> 10:59:04 @ 0x7fe82d3b4a66  
> _ZNO6lambda12CallableOnceIFN7process6FutureINS1_4http8ResponseEEEvEE10CallableFnINS_8internal7PartialIZNS1_11ProcessBase7consumeEONS1_9HttpEventEEUlRKNS1_5OwnedINS3_7Request_JSG_clEv
> 10:59:04 @ 0x7fe82c39c3ca  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureINS1_4http8ResponseclINS0_IFSE_vESE_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISD_EESt14default_deleteISQ_EEOSI_S3_E_JST_SI_St12_PlaceholderILi1EEclEOS3_
> 10:59:04 @ 0x7fe82d39f2c1  process::ProcessBase::consume()
> 10:59:04 @ 0x7fe82d3b84da  process::ProcessManager::resume()
> 10:59:04 @ 0x7fe82d3bbf56  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 10:59:04 @ 0x7fe82d577870  execute_native_thread_routine
> 10:59:04 @ 0x7fe82a761e25  start_thread
> 10:59:04 @ 0x7fe82986334d  __clone
> {noformat}
> Full test log is attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8869) Re-think semantics of os::system()

2018-05-02 Thread Benno Evers (JIRA)
Benno Evers created MESOS-8869:
--

 Summary: Re-think semantics of os::system()
 Key: MESOS-8869
 URL: https://issues.apache.org/jira/browse/MESOS-8869
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The current posix implementation of stout's os::system() has two deficiencies 
that make its use harder than necessary:
 * Contrary to its documentation, in the case of an exec failure we don't 
return None but rather an exit code of 127.
 * The status obtained from waitpid() is returned directly, without 
WEXITSTATUS() being applied

Together, these imply that code relying on some particular return value must 
apply WEXITSTATUS() itself (breaking the platform-indepence afforded by 
os::system()), and it cannot check if the program returned a value of 127/-1 at 
all.

 

Intuitively, it seems the function might be more useful by only returning 0 if 
the call exited successfully, or None if any kind of error happened. We could 
also think about an additional platform-specific function
{code:java}
os::posix::system()`
{code}
 that returns the raw return value of the executed function.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7966) check for maintenance on agent causes fatal error

2018-05-28 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492785#comment-16492785
 ] 

Benno Evers commented on MESOS-7966:


I tried to reproduce it running a custom Mesos 1.2 (compiled from 
de306b5786de3c221bae1457c6f2ccaeb38eef9f), modifying the provided call.py 
script by changing the hostnames and moving the timestamp into the future and 
then running it via
{noformat}
while :;
  python call.py;
done
{noformat}
for a few minutes, but could not create a master crash.

Looking at the code, I don't see any obvious race. The 
`Master::updateUnavailability()` handler in the master dispatches deletions for 
all existing inverse offers to the allocator actor, removes the offers from its 
own internal data structures, and afterwards dispatches a deletion for the 
maintenance to the allocator actor.

The assertion triggers because the allocator gets a request to update an 
inverse offer when the maintenance doesn't exist yet/anymore, but I havent 
really found a code path that could lead to this.

If you could update your filtered log to include the log lines generated by the 
following block in master.cpp, I think this would help to pin down the exact 
sequence of deletions/additions that triggers the crash:

{noformat}
  if (unavailability.isSome()) {
// TODO(jmlvanre): Add stream operator for unavailability.
LOG(INFO) << "Updating unavailability of agent " << *slave
  << ", starting at "
  << Nanoseconds(unavailability.get().start().nanoseconds());
  } else {
LOG(INFO) << "Removing unavailability of agent " << *slave;
  }
{noformat}

> check for maintenance on agent causes fatal error
> -
>
> Key: MESOS-7966
> URL: https://issues.apache.org/jira/browse/MESOS-7966
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Rob Johnson
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere, reliability
>
> We interact with the maintenance API frequently to orchestrate gracefully 
> draining agents of tasks without impacting service availability.
> Occasionally we seem to trigger a fatal error in Mesos when interacting with 
> the api. This happens relatively frequently, and impacts us when downstream 
> frameworks (marathon) react badly to leader elections.
> Here is the log line that we see when the master dies:
> {code}
> F0911 12:18:49.543401 123748 hierarchical.cpp:872] Check failed: 
> slaves[slaveId].maintenance.isSome()
> {code}
> It's quite possibly we're using the maintenance API in the wrong way. We're 
> happy to provide any other logs you need - please let me know what would be 
> useful for debugging.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >