[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184109#comment-16184109
 ] 

Andrei Budnik commented on MESOS-7500:
--

Example of related failing tests:
[ FAILED ] CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] CommandExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] DefaultExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSeesParentsEnv
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSharesWorkDirWithTask

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184072#comment-16184072
 ] 

Andrei Budnik commented on MESOS-7500:
--

Command health checks are executed via `LAUNCH_NESTED_CONTAINER_SESSION` call 
and launched inside DEBUG container.
DEBUG container is always launched in pair with `mesos-io-switcboard` process. 
After spawning `mesos-io-switcboard` agent tries to connect to it via unix 
domain socket. If DEBUG container exits before `mesos-io-switcboard` exits, 
agent sends SIGTERM to switchboard process after 5 second delay. If 
`mesos-switchboard-process` exits after being killed by signal, then 
`LAUNCH_NESTED_CONTAINER_SESSION` call is considered to be failed as well as 
corresponding health check.
It turned out that `mesos-io-switchboard` is not an executable, but a special 
wrapper script generated by libtool. First time this script is executed, 
relinking of an executable triggered. Relinking takes quite a while on slow 
machines (e.g. in Apache CI): I've seen 8 seconds and more. It turned out, that 
when DEBUG container exits, agent sends SIGTERM (as described above) to a 
process which is still being relinking. This happens each time health check is 
launched and as the result we see a bunch of failed tests in Apache CI.
To fix this issue we need to force libtool/autotools to generate binary instead 
of wrapper script, see:
1. https://autotools.io/libtool/wrappers.html
2. `info libtool`

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179000#comment-16179000
 ] 

Andrei Budnik commented on MESOS-7500:
--

The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass 
[--disable-fast-install|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-21 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174785#comment-16174785
 ] 

Andrei Budnik commented on MESOS-7500:
--

Another example from the failed run, including debug output 
(https://reviews.apache.org/r/59107):
https://pastebin.com/iKA1WaZB

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-05-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026856#comment-16026856
 ] 

Gastón Kleiman commented on MESOS-7500:
---

The failures seem to be related to the agent not being able to attach to the 
DEBUG container launched by the health checker.

This is however not really necessary for checks, so I created a [design 
document|https://docs.google.com/document/d/1YCMtH8i2-ovTVtKDsCTrXdygS7ieaSrJLVnFbR66qfA/]
 with two proposals that'd make it possible to start DEBUG containers without 
an I/O switchboard.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)