[jira] [Updated] (MESOS-7643) The order of isolators provided in '--isolation' flag is not preserved and instead sorted alphabetically

2017-06-27 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-7643:

Target Version/s: 1.2.2, 1.3.1, 1.4.0, 1.1.3
  Labels: isolation  (was: )

> The order of isolators provided in '--isolation' flag is not preserved and 
> instead sorted alphabetically
> 
>
> Key: MESOS-7643
> URL: https://issues.apache.org/jira/browse/MESOS-7643
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.2, 1.2.0, 1.3.0
>Reporter: Michael Cherny
>Assignee: Gilbert Song
>  Labels: isolation
>
> According to documentation and comments in code the order of the entries in 
> the --isolation flag should specify the ordering of the isolators. 
> Specifically, the
> `create` and `prepare` calls for each isolator should run serially in the 
> order in which they appear in the --isolation flag, while the `cleanup` call 
> should be serialized in reverse order (with exception of filesystem isolator 
> which is always first).
> But in fact, the isolators provided in '--isolation' flag are sorted 
> alphabetically.
> That happens in [this line of 
> code|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L377].
>  In this line use of 'set' is done (apparently instead of list or 
> vector) and set is a sorted container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7173) CMake does not define `GIT_SHA` etc. in build.cpp

2017-06-27 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer updated MESOS-7173:

Shepherd: Joseph Wu  (was: Alex Clemmer)

> CMake does not define `GIT_SHA` etc. in build.cpp
> -
>
> Key: MESOS-7173
> URL: https://issues.apache.org/jira/browse/MESOS-7173
> Project: Mesos
>  Issue Type: Bug
> Environment: CMake
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>Priority: Minor
>  Labels: cmake, microsoft, windows
>
> `build.cpp` expects `BUILD_GIT_{SHA,BRANCH,TAG}` to be defined by the build 
> system (CMake) but they are not currently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7603) longjmp error in libcurl

2017-06-27 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065463#comment-16065463
 ] 

Charles Allen commented on MESOS-7603:
--

This was resolved internally by linking libcurl against c-ares

> longjmp error in libcurl
> 
>
> Key: MESOS-7603
> URL: https://issues.apache.org/jira/browse/MESOS-7603
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2.0
>Reporter: Charles Allen
>
> We encountered the following error when the fetcher tries to run on a mesos 
> 1.2.0 agent through systemd:
> {code}
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: *** longjmp causes 
> uninitialized stack frame ***: /usr/sbin/mesos-agent terminated
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: === Backtrace: 
> =
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libc.so.6(+0x71c07)[0x7f8d08f5fc07]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libc.so.6(__fortify_fail+0x47)[0x7f8d08fedb17]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libc.so.6(+0xff56d)[0x7f8d08fed56d]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libc.so.6(__longjmp_chk+0x38)[0x7f8d08fed4c8]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libcurl.so.4(+0xae34)[0x7f8d08519e34]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libpthread.so.0(+0x116b0)[0x7f8d098386b0]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libpthread.so.0(pthread_cond_wait+0xbf)[0x7f8d0983448f]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libstdc++.so.6(_ZNSt18condition_variable4waitERSt11unique_lockISt5mutexE+0x2b)[0x7f8d095968ab]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libmesos-1.2.0.so(_ZN7process14ProcessManager4waitERKNS_4UPIDE+0x328)[0x7f8d0b47f3d8]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libmesos-1.2.0.so(_ZN7process4waitERKNS_4UPIDERK8Duration+0x2e7)[0x7f8d0b486117]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /usr/sbin/mesos-agent(+0x12810)[0x557e1d691810]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /lib64/libc.so.6(__libc_start_main+0xfc)[0x7f8d08f0e93c]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: 
> /usr/sbin/mesos-agent(+0x139c9)[0x557e1d6929c9]
> Jun 01 22:55:53 ip-172-19-68-109 mesos-agent[103454]: === Memory map: 
> 
> {code}
> It looks like this error:
> https://stackoverflow.com/questions/9191668/error-longjmp-causes-uninitialized-stack-frame
>  
> Where the solution is either set {{curl_easy_setopt(curl, CURLOPT_NOSIGNAL, 
> 1)}} or use a special config option to libcurl



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7643) The order of isolators provided in '--isolation' flag is not preserved and instead sorted alphabetically

2017-06-27 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7643:
-

Assignee: Gilbert Song

> The order of isolators provided in '--isolation' flag is not preserved and 
> instead sorted alphabetically
> 
>
> Key: MESOS-7643
> URL: https://issues.apache.org/jira/browse/MESOS-7643
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.2, 1.2.0, 1.3.0
>Reporter: Michael Cherny
>Assignee: Gilbert Song
>
> According to documentation and comments in code the order of the entries in 
> the --isolation flag should specify the ordering of the isolators. 
> Specifically, the
> `create` and `prepare` calls for each isolator should run serially in the 
> order in which they appear in the --isolation flag, while the `cleanup` call 
> should be serialized in reverse order (with exception of filesystem isolator 
> which is always first).
> But in fact, the isolators provided in '--isolation' flag are sorted 
> alphabetically.
> That happens in [this line of 
> code|https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L377].
>  In this line use of 'set' is done (apparently instead of list or 
> vector) and set is a sorted container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

2017-06-27 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065357#comment-16065357
 ] 

Ilya Pronin commented on MESOS-4092:


Looks like our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.

> Try to re-establish connection on ping timeouts with agent before removing it
> -
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

2017-06-27 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065357#comment-16065357
 ] 

Ilya Pronin edited comment on MESOS-4092 at 6/27/17 7:43 PM:
-

Seems that our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.


was (Author: ipronin):
Looks like our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.

> Try to re-establish connection on ping timeouts with agent before removing it
> -
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7634) OsTest.ChownNoAccess fails on s390x machines

2017-06-27 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065327#comment-16065327
 ] 

Vinod Kone commented on MESOS-7634:
---

Sorry for the delay in getting back. 

It definitely did fail on the CI job.

https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-s390x-WIP/14/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,label_exp=mesos/console

https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-s390x-WIP/14/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,label_exp=mesos/console

Do you have ssh access to the VM that runs these Jenkins jobs? It is labeled 
"mesos1". Maybe try to see if you can repro on that particular VM (running as 
Jenkins user).


> OsTest.ChownNoAccess fails on s390x machines
> 
>
> Key: MESOS-7634
> URL: https://issues.apache.org/jira/browse/MESOS-7634
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>
> Running a custom branch of Mesos (with some fixes in docker build scripts for 
> s390x) on s390x based CI machines throws the following error when running 
> stout tests.
> {code}
> [ RUN  ] OsTest.ChownNoAccess
> ../../../../3rdparty/stout/tests/os_tests.cpp:839: Failure
> Value of: os::chown(uid.get(), gid.get(), "one", true).isError()
>   Actual: false
> Expected: true
> ../../../../3rdparty/stout/tests/os_tests.cpp:840: Failure
> Value of: os::chown(uid.get(), gid.get(), "one/two", true).isError()
>   Actual: false
> {code}
> One can repro this by building Mesos from my custom branch here: 
> https://github.com/vinodkone/mesos/tree/vinod/s390x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7728) Java HTTP adapter crashes JVM when leading master disconnects.

2017-06-27 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7728:
---
Target Version/s: 1.2.2, 1.3.1, 1.4.0, 1.1.3

> Java HTTP adapter crashes JVM when leading master disconnects.
> --
>
> Key: MESOS-7728
> URL: https://issues.apache.org/jira/browse/MESOS-7728
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 1.1.2, 1.2.1, 1.3.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> When a Java scheduler using HTTP v0-v1 adapter loses the leading Mesos 
> master, {{V0ToV1AdapterProcess::disconnected()}} is invoked, which in turn 
> invokes Java scheduler [code via 
> JNI|https://github.com/apache/mesos/blob/87c38b9e2bc5b1030a071ddf0aab69db70d64781/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L446].
>  This call uses the wrong object, {{jmesos}} instead of {{jscheduler}}, which 
> crashes JVM:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4bca3849bf, pid=21, tid=0x7f4b2ac45700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 
> 1.8.0_131-b11)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x6d39bf]  jni_invoke_nonstatic(JNIEnv_*, JavaValue*, 
> _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x1af
> {noformat}
> {noformat}
> Stack: [0x7f4b2a445000,0x7f4b2ac46000],  sp=0x7f4b2ac44a80,  free 
> space=8190k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x6d39bf]  jni_invoke_nonstatic(JNIEnv_*, JavaValue*, 
> _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x1af
> V  [libjvm.so+0x6d7fef]  jni_CallVoidMethodV+0x10f
> C  [libmesos-1.2.0.so+0x1aa32d3]  JNIEnv_::CallVoidMethod(_jobject*, 
> _jmethodID*, ...)+0x93
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7729) ExamplesTest.DynamicReservationFramework is flaky

2017-06-27 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7729:
-

 Summary: ExamplesTest.DynamicReservationFramework is flaky
 Key: MESOS-7729
 URL: https://issues.apache.org/jira/browse/MESOS-7729
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


Observed this on ASF CI

{code}
[ RUN  ] ExamplesTest.DynamicReservationFramework
Using temporary directory '/tmp/ExamplesTest_DynamicReservationFramework_uPVIaN'
/mesos/mesos-1.4.0/src/tests/dynamic_reservation_framework_test.sh: line 19: 
/mesos/mesos-1.4.0/_build/src/colors.sh: No such file or directory
/mesos/mesos-1.4.0/src/tests/dynamic_reservation_framework_test.sh: line 20: 
/mesos/mesos-1.4.0/_build/src/atexit.sh: No such file or directory
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0627 16:04:20.661948  8847 process.cpp:1282] libprocess is initialized on 
172.17.0.3:37113 with 16 worker threads
I0627 16:04:20.662199  8847 logging.cpp:199] Logging to STDERR
I0627 16:04:20.674317  8847 leveldb.cpp:174] Opened db in 3.343216ms
I0627 16:04:20.675655  8847 leveldb.cpp:181] Compacted db in 1.264481ms
I0627 16:04:20.675797  8847 leveldb.cpp:196] Created db iterator in 89655ns
I0627 16:04:20.675829  8847 leveldb.cpp:202] Seeked to beginning of db in 5551ns
I0627 16:04:20.675848  8847 leveldb.cpp:271] Iterated through 0 keys in the db 
in 1133ns
I0627 16:04:20.676103  8847 replica.cpp:779] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0627 16:04:20.680465  8873 recover.cpp:451] Starting replica recovery
I0627 16:04:20.681649  8873 recover.cpp:477] Replica is in EMPTY status
I0627 16:04:20.682160  8847 local.cpp:272] Creating default 'local' authorizer
I0627 16:04:20.684504  8884 replica.cpp:676] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(1)@172.17.0.3:37113
I0627 16:04:20.685750  8882 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I0627 16:04:20.686617  8877 recover.cpp:568] Updating replica status to STARTING
I0627 16:04:20.688508  8877 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 741914ns
I0627 16:04:20.688544  8881 master.cpp:438] Master 
c1c3a180-5bd3-42fa-b84e-e2c30aba7364 (089cb2cc2625) started on 172.17.0.3:37113
I0627 16:04:20.688551  8877 replica.cpp:322] Persisted replica status to 
STARTING
I0627 16:04:20.689095  8878 recover.cpp:477] Replica is in STARTING status
I0627 16:04:20.688582  8881 master.cpp:440] Flags at startup: 
--acls="permissive: true
register_frameworks {
  principals {
type: ANY
  }
  roles {
type: SOME
values: "test"
  }
}
" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" 
--credentials="/tmp/ExamplesTest_DynamicReservationFramework_uPVIaN/credentials"
 --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="20secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.4.0/src/webui" 
--work_dir="/tmp/mesos-9H3Est/master" --zk_session_timeout="10secs"
I0627 16:04:20.689460  8881 master.cpp:492] Master allowing unauthenticated 
frameworks to register
I0627 16:04:20.689476  8881 master.cpp:506] Master allowing unauthenticated 
agents to register
I0627 16:04:20.689482  8881 master.cpp:520] Master allowing HTTP frameworks to 
register without authentication
I0627 16:04:20.689494  8881 credentials.hpp:37] Loading credentials for 
authentication from 
'/tmp/ExamplesTest_DynamicReservationFramework_uPVIaN/credentials'
W0627 16:04:20.689620  8881 credentials.hpp:52] Permissions on credentials file 
'/tmp/ExamplesTest_DynamicReservationFramework_uPVIaN/credentials' are too 
open; it is recommended that your credentials file is NOT accessible by others
I0627 16:04:20.689817  8881 master.cpp:562] Using default 'crammd5' 
authenticator
I0627 16:04:20.690021  8881 authenticator.cpp:520] Initializing server SASL
I0627 16:04:20.690615  8878 replica.cpp:676] Replica in STARTING status 
received a broadcasted recover request from __req_res__(2)@172.17.0.3:37113
I0627 

[jira] [Created] (MESOS-7728) Java HTTP adapter crashes JVM when leading master disconnects.

2017-06-27 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-7728:
--

 Summary: Java HTTP adapter crashes JVM when leading master 
disconnects.
 Key: MESOS-7728
 URL: https://issues.apache.org/jira/browse/MESOS-7728
 Project: Mesos
  Issue Type: Bug
  Components: java api
Affects Versions: 1.3.0, 1.2.1, 1.1.2
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


When a Java scheduler using HTTP v0-v1 adapter loses the leading Mesos master, 
{{V0ToV1AdapterProcess::disconnected()}} is invoked, which in turn invokes Java 
scheduler [code via 
JNI|https://github.com/apache/mesos/blob/87c38b9e2bc5b1030a071ddf0aab69db70d64781/src/java/jni/org_apache_mesos_v1_scheduler_V0Mesos.cpp#L446].
 This call uses the wrong object, {{jmesos}} instead of {{jscheduler}}, which 
crashes JVM:
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f4bca3849bf, pid=21, tid=0x7f4b2ac45700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 
1.8.0_131-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6d39bf]  jni_invoke_nonstatic(JNIEnv_*, JavaValue*, 
_jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x1af
{noformat}
{noformat}
Stack: [0x7f4b2a445000,0x7f4b2ac46000],  sp=0x7f4b2ac44a80,  free 
space=8190k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x6d39bf]  jni_invoke_nonstatic(JNIEnv_*, JavaValue*, _jobject*, 
JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x1af
V  [libjvm.so+0x6d7fef]  jni_CallVoidMethodV+0x10f
C  [libmesos-1.2.0.so+0x1aa32d3]  JNIEnv_::CallVoidMethod(_jobject*, 
_jmethodID*, ...)+0x93
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7725) PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval test is flaky

2017-06-27 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7725:
--
Description: 
Observed this on ASF CI.

{code}
[ RUN  ] PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval
I0627 15:20:33.687146 30773 cluster.cpp:162] Creating default 'local' authorizer
I0627 15:20:33.691745 30795 master.cpp:438] Master 
d8d232e5-1689-4780-b232-c91e5c3277b1 (0b1049f05548) started on 172.17.0.2:44357
I0627 15:20:33.691800 30795 master.cpp:440] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="50ms" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/Wg4Ouh/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/Wg4Ouh/master" 
--zk_session_timeout="10secs"
I0627 15:20:33.692142 30795 master.cpp:490] Master only allowing authenticated 
frameworks to register
I0627 15:20:33.692150 30795 master.cpp:504] Master only allowing authenticated 
agents to register
I0627 15:20:33.692154 30795 master.cpp:517] Master only allowing authenticated 
HTTP frameworks to register
I0627 15:20:33.692160 30795 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/Wg4Ouh/credentials'
I0627 15:20:33.692463 30795 master.cpp:562] Using default 'crammd5' 
authenticator
I0627 15:20:33.692612 30795 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0627 15:20:33.692831 30795 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0627 15:20:33.692942 30795 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0627 15:20:33.693061 30795 master.cpp:642] Authorization enabled
W0627 15:20:33.693076 30795 master.cpp:705] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I0627 15:20:33.693354 30780 hierarchical.cpp:169] Initialized hierarchical 
allocator process
I0627 15:20:33.693359 30782 whitelist_watcher.cpp:77] No whitelist given
I0627 15:20:33.695943 30795 master.cpp:2161] Elected as the leading master!
I0627 15:20:33.695960 30795 master.cpp:1700] Recovering from registrar
I0627 15:20:33.696193 30795 registrar.cpp:345] Recovering registrar
I0627 15:20:33.697032 30795 registrar.cpp:389] Successfully fetched the 
registry (0B) in 811008ns
I0627 15:20:33.697147 30795 registrar.cpp:493] Applied 1 operations in 40183ns; 
attempting to update the registry
I0627 15:20:33.697922 30792 registrar.cpp:550] Successfully updated the 
registry in 709120ns
I0627 15:20:33.698020 30792 registrar.cpp:422] Successfully recovered registrar
I0627 15:20:33.698490 30789 master.cpp:1799] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0627 15:20:33.698511 30784 hierarchical.cpp:207] Skipping recovery of 
hierarchical allocator: nothing to recover
I0627 15:20:33.707849 30773 containerizer.cpp:230] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
W0627 15:20:33.708729 30773 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W0627 15:20:33.708909 30773 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0627 15:20:33.708955 30773 provisioner.cpp:255] Using default backend 'copy'
I0627 15:20:33.711526 30773 cluster.cpp:448] Creating default 'local' authorizer
I0627 15:20:33.714450 30776 slave.cpp:249] Mesos agent started on 
(451)@172.17.0.2:44357
I0627 15:20:33.714649 30776 slave.cpp:250] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/PersistentVolumeEndpointsTest_ReserveAndSlaveRemoval_RnxQRd/store/appc"
 --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 

[jira] [Updated] (MESOS-7725) PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval test is flaky

2017-06-27 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7725:
--
Shepherd: Vinod Kone
  Sprint: Mesosphere Sprint 58
Story Points: 3

> PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval test is flaky
> --
>
> Key: MESOS-7725
> URL: https://issues.apache.org/jira/browse/MESOS-7725
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Neil Conway
>  Labels: flaky-test, mesosphere-oncall
>
> Observed this on ASF CI.
> {code}
> [ RUN  ] PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval
> I0627 15:20:33.687146 30773 cluster.cpp:162] Creating default 'local' 
> authorizer
> I0627 15:20:33.691745 30795 master.cpp:438] Master 
> d8d232e5-1689-4780-b232-c91e5c3277b1 (0b1049f05548) started on 
> 172.17.0.2:44357
> I0627 15:20:33.691800 30795 master.cpp:440] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="50ms" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/Wg4Ouh/credentials" 
> --filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/Wg4Ouh/master" 
> --zk_session_timeout="10secs"
> I0627 15:20:33.692142 30795 master.cpp:490] Master only allowing 
> authenticated frameworks to register
> I0627 15:20:33.692150 30795 master.cpp:504] Master only allowing 
> authenticated agents to register
> I0627 15:20:33.692154 30795 master.cpp:517] Master only allowing 
> authenticated HTTP frameworks to register
> I0627 15:20:33.692160 30795 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/Wg4Ouh/credentials'
> I0627 15:20:33.692463 30795 master.cpp:562] Using default 'crammd5' 
> authenticator
> I0627 15:20:33.692612 30795 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0627 15:20:33.692831 30795 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0627 15:20:33.692942 30795 http.cpp:974] Creating default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0627 15:20:33.693061 30795 master.cpp:642] Authorization enabled
> W0627 15:20:33.693076 30795 master.cpp:705] The '--roles' flag is deprecated. 
> This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
> more information
> I0627 15:20:33.693354 30780 hierarchical.cpp:169] Initialized hierarchical 
> allocator process
> I0627 15:20:33.693359 30782 whitelist_watcher.cpp:77] No whitelist given
> I0627 15:20:33.695943 30795 master.cpp:2161] Elected as the leading master!
> I0627 15:20:33.695960 30795 master.cpp:1700] Recovering from registrar
> I0627 15:20:33.696193 30795 registrar.cpp:345] Recovering registrar
> I0627 15:20:33.697032 30795 registrar.cpp:389] Successfully fetched the 
> registry (0B) in 811008ns
> I0627 15:20:33.697147 30795 registrar.cpp:493] Applied 1 operations in 
> 40183ns; attempting to update the registry
> I0627 15:20:33.697922 30792 registrar.cpp:550] Successfully updated the 
> registry in 709120ns
> I0627 15:20:33.698020 30792 registrar.cpp:422] Successfully recovered 
> registrar
> I0627 15:20:33.698490 30789 master.cpp:1799] Recovered 0 agents from the 
> registry (129B); allowing 10mins for agents to re-register
> I0627 15:20:33.698511 30784 hierarchical.cpp:207] Skipping recovery of 
> hierarchical allocator: nothing to recover
> I0627 15:20:33.707849 30773 containerizer.cpp:230] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
> W0627 15:20:33.708729 30773 backend.cpp:76] Failed to create 'aufs' backend: 
> AufsBackend requires root privileges
> W0627 15:20:33.708909 30773 backend.cpp:76] Failed to create 'bind' backend: 

[jira] [Commented] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky

2017-06-27 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065232#comment-16065232
 ] 

Vinod Kone commented on MESOS-3968:
---

Observed the same thing Neil observed in ASF CI.

{code}
[ RUN  ] DiskQuotaTest.SlaveRecovery
I0627 11:28:25.636018  4587 cluster.cpp:162] Creating default 'local' authorizer
I0627 11:28:25.641643  4609 master.cpp:438] Master 
08082988-3d7b-4a23-8092-66781efb5f6f (18fe836728c1) started on 172.17.0.4:34439
I0627 11:28:25.641682  4609 master.cpp:440] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/I85Ccm/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.4.0/_inst/share/mesos/webui" 
--work_dir="/tmp/I85Ccm/master" --zk_session_timeout="10secs"
I0627 11:28:25.642195  4609 master.cpp:490] Master only allowing authenticated 
frameworks to register
I0627 11:28:25.642216  4609 master.cpp:504] Master only allowing authenticated 
agents to register
I0627 11:28:25.642230  4609 master.cpp:517] Master only allowing authenticated 
HTTP frameworks to register
I0627 11:28:25.642247  4609 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/I85Ccm/credentials'
I0627 11:28:25.642676  4609 master.cpp:562] Using default 'crammd5' 
authenticator
I0627 11:28:25.642896  4609 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0627 11:28:25.643079  4609 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0627 11:28:25.643203  4609 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0627 11:28:25.643312  4609 master.cpp:642] Authorization enabled
I0627 11:28:25.643540  4611 hierarchical.cpp:169] Initialized hierarchical 
allocator process
I0627 11:28:25.643767  4613 whitelist_watcher.cpp:77] No whitelist given
I0627 11:28:25.647075  4607 master.cpp:2161] Elected as the leading master!
I0627 11:28:25.647130  4607 master.cpp:1700] Recovering from registrar
I0627 11:28:25.647503  4610 registrar.cpp:345] Recovering registrar
I0627 11:28:25.652940  4610 registrar.cpp:389] Successfully fetched the 
registry (0B) in 5.362176ms
I0627 11:28:25.653300  4610 registrar.cpp:493] Applied 1 operations in 
161908ns; attempting to update the registry
I0627 11:28:25.654299  4610 registrar.cpp:550] Successfully updated the 
registry in 913920ns
I0627 11:28:25.654633  4610 registrar.cpp:422] Successfully recovered registrar
I0627 11:28:25.655278  4611 master.cpp:1799] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0627 11:28:25.655741  4612 hierarchical.cpp:207] Skipping recovery of 
hierarchical allocator: nothing to recover
I0627 11:28:25.661547  4587 containerizer.cpp:230] Using isolation: 
posix/cpu,posix/mem,disk/du,filesystem/posix,network/cni,environment_secret
W0627 11:28:25.662441  4587 backend.cpp:76] Failed to create 'overlay' backend: 
OverlayBackend requires root privileges
W0627 11:28:25.662691  4587 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0627 11:28:25.662744  4587 provisioner.cpp:255] Using default backend 'copy'
I0627 11:28:25.672664  4587 cluster.cpp:448] Creating default 'local' authorizer
I0627 11:28:25.677569  4612 slave.cpp:249] Mesos agent started on 
(42)@172.17.0.4:34439
I0627 11:28:25.677613  4612 slave.cpp:250] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/DiskQuotaTest_SlaveRecovery_IXTvqp/store/appc" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 
--authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" 
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 

[jira] [Updated] (MESOS-3968) DiskQuotaTest.SlaveRecovery is flaky

2017-06-27 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-3968:
--
Labels: flaky-test mesosphere mesosphere-oncall  (was: flaky-test 
mesosphere)

> DiskQuotaTest.SlaveRecovery is flaky
> 
>
> Key: MESOS-3968
> URL: https://issues.apache.org/jira/browse/MESOS-3968
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>  Labels: flaky-test, mesosphere, mesosphere-oncall
>
> {noformat: title=Failed Run}
> [ RUN  ] DiskQuotaTest.SlaveRecovery
> I1120 12:02:54.015383 29806 leveldb.cpp:176] Opened db in 2.965411ms
> I1120 12:02:54.018033 29806 leveldb.cpp:183] Compacted db in 2.585354ms
> I1120 12:02:54.018175 29806 leveldb.cpp:198] Created db iterator in 27134ns
> I1120 12:02:54.018275 29806 leveldb.cpp:204] Seeked to beginning of db in 
> 3025ns
> I1120 12:02:54.018375 29806 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 679ns
> I1120 12:02:54.018491 29806 replica.cpp:780] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1120 12:02:54.021386 29838 recover.cpp:449] Starting replica recovery
> I1120 12:02:54.021692 29838 recover.cpp:475] Replica is in EMPTY status
> I1120 12:02:54.022189 29827 master.cpp:367] Master 
> 9a3c45ec-28b3-49e6-a83f-1f2035cc1105 (a51e6bb03b55) started on 
> 172.17.5.188:41228
> I1120 12:02:54.022212 29827 master.cpp:369] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/DsMniF/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-0.26.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/DsMniF/master" --zk_session_timeout="10secs"
> I1120 12:02:54.022557 29827 master.cpp:414] Master only allowing 
> authenticated frameworks to register
> I1120 12:02:54.022569 29827 master.cpp:419] Master only allowing 
> authenticated slaves to register
> I1120 12:02:54.022578 29827 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/DsMniF/credentials'
> I1120 12:02:54.022896 29827 master.cpp:458] Using default 'crammd5' 
> authenticator
> I1120 12:02:54.023217 29827 master.cpp:495] Authorization enabled
> I1120 12:02:54.023512 29831 whitelist_watcher.cpp:79] No whitelist given
> I1120 12:02:54.023814 29833 replica.cpp:676] Replica in EMPTY status received 
> a broadcasted recover request from (562)@172.17.5.188:41228
> I1120 12:02:54.023519 29832 hierarchical.cpp:153] Initialized hierarchical 
> allocator process
> I1120 12:02:54.025997 29831 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1120 12:02:54.027042 29832 recover.cpp:566] Updating replica status to 
> STARTING
> I1120 12:02:54.027354 29830 master.cpp:1612] The newly elected leader is 
> master@172.17.5.188:41228 with id 9a3c45ec-28b3-49e6-a83f-1f2035cc1105
> I1120 12:02:54.027385 29830 master.cpp:1625] Elected as the leading master!
> I1120 12:02:54.027403 29830 master.cpp:1385] Recovering from registrar
> I1120 12:02:54.027679 29830 registrar.cpp:309] Recovering registrar
> I1120 12:02:54.028439 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 1.195171ms
> I1120 12:02:54.028539 29840 replica.cpp:323] Persisted replica status to 
> STARTING
> I1120 12:02:54.028944 29840 recover.cpp:475] Replica is in STARTING status
> I1120 12:02:54.030910 29840 replica.cpp:676] Replica in STARTING status 
> received a broadcasted recover request from (563)@172.17.5.188:41228
> I1120 12:02:54.031429 29840 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1120 12:02:54.032032 29840 recover.cpp:566] Updating replica status to VOTING
> I1120 12:02:54.032816 29840 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 496492ns
> I1120 12:02:54.032982 29840 replica.cpp:323] Persisted replica status to 
> VOTING
> I1120 12:02:54.033254 29840 recover.cpp:580] Successfully joined the Paxos 
> group
> I1120 12:02:54.033562 29840 recover.cpp:464] Recover process terminated
> I1120 12:02:54.034631 29839 log.cpp:661] Attempting to start the writer
> I1120 12:02:54.036386 29834 replica.cpp:496] Replica received implicit 
> promise request from (564)@172.17.5.188:41228 

[jira] [Created] (MESOS-7727) Scheme/HTTPTest.Get segfaults

2017-06-27 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7727:
-

 Summary: Scheme/HTTPTest.Get segfaults
 Key: MESOS-7727
 URL: https://issues.apache.org/jira/browse/MESOS-7727
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Till Toenshoff


Observed this on ASF CI

{code}
[ RUN  ] Scheme/HTTPTest.Get/0
I0627 09:58:16.931704  2483 openssl.cpp:419] CA file path is unspecified! NOTE: 
Set CA file path with LIBPROCESS_SSL_CA_FILE=
I0627 09:58:16.931727  2483 openssl.cpp:424] CA directory path unspecified! 
NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
I0627 09:58:16.931732  2483 openssl.cpp:429] Will not verify peer certificate!
NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
I0627 09:58:16.931740  2483 openssl.cpp:435] Will only verify peer certificate 
if presented!
NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate verification
I0627 09:58:16.932193  3504 process.cpp:968] Failed to accept socket: future 
discarded
*** Aborted at 1498557496 (unix time) try "date -d @1498557496" if you are 
using GNU date ***
PC: @ 0x7f5397f30912 (unknown)
*** SIGSEGV (@0x7f5349e18068) received by PID 2483 (TID 0x7f53937cd700) from 
PID 1239515240; stack trace: ***
I0627 09:58:16.934547  2483 process.cpp:1282] libprocess is initialized on 
172.17.0.4:50357 with 16 worker threads
@ 0x7f53987ac370 (unknown)
@ 0x7f5397f30912 (unknown)
@ 0x7f5397f30f8c (unknown)
@   0x42b1a3 process::UPID::UPID()
@   0x8fcdec process::DispatchEvent::DispatchEvent()
I0627 09:58:16.940096  3518 process.cpp:3779] Handling HTTP event for process 
'(80)' with path: '/(80)/get'
@   0x8f5275 process::internal::dispatch()
@   0x910002 process::dispatch<>()
I0627 09:58:16.945485  3519 process.cpp:3779] Handling HTTP event for process 
'(80)' with path: '/(80)/get'
@   0x8f4184 process::ProcessBase::route()
[   OK ] Scheme/HTTPTest.Get/0 (463 ms)
[ RUN  ] Scheme/HTTPTest.Get/1
@   0x9e88b9 process::ProcessBase::route<>()
@   0x9e4bb2 process::Help::initialize()
@   0x8ed69a process::ProcessManager::resume()
@   0x8e9a98 _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
@   0x8fc38c 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@   0x8fc2d0 
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
@   0x8fc25a 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f5397f27230 (unknown)
@ 0x7f53987a4dc5 start_thread
@ 0x7f539769076d __clone
make[7]: *** [check-local] Segmentation fault
{code}

[~tillt] can you triage this? looks related to SSL



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7726) MasterTest.IgnoreOldAgentReregistration test is flaky

2017-06-27 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7726:
-

 Summary: MasterTest.IgnoreOldAgentReregistration test is flaky
 Key: MESOS-7726
 URL: https://issues.apache.org/jira/browse/MESOS-7726
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Neil Conway


Observed this on ASF CI.

{code}
[ RUN  ] MasterTest.IgnoreOldAgentReregistration
I0627 05:23:06.031154  4917 cluster.cpp:162] Creating default 'local' authorizer
I0627 05:23:06.033433  4945 master.cpp:438] Master 
a8778782-0da1-49a5-9cb8-9f6d11701733 (c43debbe7e32) started on 172.17.0.4:41747
I0627 05:23:06.033457  4945 master.cpp:440] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/2BARnF/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.4.0/_inst/share/mesos/webui" 
--work_dir="/tmp/2BARnF/master" --zk_session_timeout="10secs"
I0627 05:23:06.033771  4945 master.cpp:490] Master only allowing authenticated 
frameworks to register
I0627 05:23:06.033787  4945 master.cpp:504] Master only allowing authenticated 
agents to register
I0627 05:23:06.033798  4945 master.cpp:517] Master only allowing authenticated 
HTTP frameworks to register
I0627 05:23:06.033812  4945 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/2BARnF/credentials'
I0627 05:23:06.034080  4945 master.cpp:562] Using default 'crammd5' 
authenticator
I0627 05:23:06.034221  4945 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0627 05:23:06.034409  4945 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0627 05:23:06.034569  4945 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0627 05:23:06.034688  4945 master.cpp:642] Authorization enabled
I0627 05:23:06.034862  4938 whitelist_watcher.cpp:77] No whitelist given
I0627 05:23:06.034868  4950 hierarchical.cpp:169] Initialized hierarchical 
allocator process
I0627 05:23:06.037211  4957 master.cpp:2161] Elected as the leading master!
I0627 05:23:06.037236  4957 master.cpp:1700] Recovering from registrar
I0627 05:23:06.037333  4938 registrar.cpp:345] Recovering registrar
I0627 05:23:06.038146  4938 registrar.cpp:389] Successfully fetched the 
registry (0B) in 768256ns
I0627 05:23:06.038290  4938 registrar.cpp:493] Applied 1 operations in 30798ns; 
attempting to update the registry
I0627 05:23:06.038861  4938 registrar.cpp:550] Successfully updated the 
registry in 510976ns
I0627 05:23:06.038960  4938 registrar.cpp:422] Successfully recovered registrar
I0627 05:23:06.039364  4941 hierarchical.cpp:207] Skipping recovery of 
hierarchical allocator: nothing to recover
I0627 05:23:06.039594  4958 master.cpp:1799] Recovered 0 agents from the 
registry (129B); allowing 10mins for agents to re-register
I0627 05:23:06.043999  4917 containerizer.cpp:230] Using isolation: 
posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret
W0627 05:23:06.044456  4917 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W0627 05:23:06.044548  4917 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0627 05:23:06.044580  4917 provisioner.cpp:255] Using default backend 'copy'
I0627 05:23:06.046222  4917 cluster.cpp:448] Creating default 'local' authorizer
I0627 05:23:06.047572  4950 slave.cpp:249] Mesos agent started on 
(269)@172.17.0.4:41747
I0627 05:23:06.047591  4950 slave.cpp:250] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/MasterTest_IgnoreOldAgentReregistration_Bgz7OK/store/appc"
 --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticatee="crammd5" --authentication_backoff_factor="1secs" 
--authorizer="local" 

[jira] [Created] (MESOS-7725) PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval test is flaky

2017-06-27 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7725:
-

 Summary: PersistentVolumeEndpointsTest.ReserveAndSlaveRemoval test 
is flaky
 Key: MESOS-7725
 URL: https://issues.apache.org/jira/browse/MESOS-7725
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Neil Conway


Observed this on ASF CI.

Will paste the log once I find a failing build whose logs are not rotated out.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7724) MasterAPITest.Subscribe test segfaults

2017-06-27 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-7724:
-

 Summary: MasterAPITest.Subscribe test segfaults
 Key: MESOS-7724
 URL: https://issues.apache.org/jira/browse/MESOS-7724
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Reporter: Vinod Kone


Found this on ASF CI

{code}
[ RUN  ] ContentType/MasterAPITest.Subscribe/1
I0625 05:38:37.009217 30646 cluster.cpp:162] Creating default 'local' authorizer
I0625 05:38:37.014230 30650 master.cpp:438] Master 
7395bba4-e83c-4a4a-9010-d2e89629edeb (7bd48084f726) started on 172.17.0.2:41689
I0625 05:38:37.014291 30650 master.cpp:440] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/D1sbIK/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/D1sbIK/master" 
--zk_session_timeout="10secs"
I0625 05:38:37.014972 30650 master.cpp:490] Master only allowing authenticated 
frameworks to register
I0625 05:38:37.014992 30650 master.cpp:504] Master only allowing authenticated 
agents to register
I0625 05:38:37.015008 30650 master.cpp:517] Master only allowing authenticated 
HTTP frameworks to register
I0625 05:38:37.015019 30650 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/D1sbIK/credentials'
I0625 05:38:37.015575 30650 master.cpp:562] Using default 'crammd5' 
authenticator
I0625 05:38:37.016842 30650 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0625 05:38:37.017230 30650 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0625 05:38:37.017542 30650 http.cpp:974] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0625 05:38:37.017822 30650 master.cpp:642] Authorization enabled
I0625 05:38:37.018196 30651 hierarchical.cpp:169] Initialized hierarchical 
allocator process
I0625 05:38:37.018357 30653 whitelist_watcher.cpp:77] No whitelist given
*** Aborted at 1498369117 (unix time) try "date -d @1498369117" if you are 
using GNU date ***
PC: @ 0x2b0f5603bc41 std::_Hashtable<>::_M_bucket_begin()
*** SIGSEGV (@0x5d0) received by PID 30646 (TID 0x2b0f641b6700) from PID 1488; 
stack trace: ***
@ 0x2b0f5bfcc330 (unknown)
@ 0x2b0f5603bc41 std::_Hashtable<>::_M_bucket_begin()
@ 0x2b0f5603bb2d std::_Hashtable<>::count()
@ 0x2b0f5603bacd std::unordered_map<>::count()
@ 0x2b0f56033b68 hashmap<>::contains()
@ 0x2b0f56008d5e mesos::internal::master::Master::exited()
@ 0x2b0f5600aeaa 
mesos::internal::master::Master::subscribe()::$_36::operator()()
@ 0x2b0f5600ae71 
_ZZZNK7process9_DeferredIZN5mesos8internal6master6Master9subscribeERKNS3_14HttpConnectionEE4$_36EcvSt8functionIFvT_EEIRKNS_6FutureI7NothingvENKUlSJ_E_clESJ_ENKUlvE_clEv
@ 0x2b0f5600abdd 
_ZNSt17_Function_handlerIFvvEZZNK7process9_DeferredIZN5mesos8internal6master6Master9subscribeERKNS5_14HttpConnectionEE4$_36EcvSt8functionIFvT_EEIRKNS1_6FutureI7NothingvENKUlSL_E_clESL_EUlvE_E9_M_invokeERKSt9_Any_data
@  0x1af707e std::function<>::operator()()
@  0x1eabc09 
_ZZN7process8internal8DispatchIvEclIRSt8functionIFvvvRKNS_4UPIDEOT_ENKUlPNS_11ProcessBaseEE_clESE_
@  0x1eab9c2 
_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchIvEclIRSt8functionIFvvvRKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
@ 0x2b0f582e7608 std::function<>::operator()()
@ 0x2b0f582cdab4 process::ProcessBase::visit()
@ 0x2b0f5835da8e process::DispatchEvent::visit()
@  0x1af66b1 process::ProcessBase::serve()
@ 0x2b0f582cb7a4 process::ProcessManager::resume()
@ 0x2b0f582d982c 
process::ProcessManager::init_threads()::$_2::operator()()
@ 0x2b0f582d9735 

[jira] [Created] (MESOS-7723) Support lxcfs for serving special proc files for containers.

2017-06-27 Thread Jie Yu (JIRA)
Jie Yu created MESOS-7723:
-

 Summary: Support lxcfs for serving special proc files for 
containers.
 Key: MESOS-7723
 URL: https://issues.apache.org/jira/browse/MESOS-7723
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Jie Yu


LXCFS is a small FUSE filesystem written with the intention of making Linux 
containers feel more like a virtual machine. It started as a side-project of 
LXC but is useable by any runtime.
https://github.com/lxc/lxcfs

Some legacy applications will read /proc/cpuinfo or /proc/meminfo to get the 
available cpus and memory. Without this, the application will assume it has all 
the cores and memory on the host.

We can potentially build an isolator for this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6345) ExamplesTest.PersistentVolumeFramework failing due to double free corruption on Ubuntu 14.04

2017-06-27 Thread Dmitry Zhuk (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064894#comment-16064894
 ] 

Dmitry Zhuk commented on MESOS-6345:


https://reviews.apache.org/r/60467/

> ExamplesTest.PersistentVolumeFramework failing due to double free corruption 
> on Ubuntu 14.04
> 
>
> Key: MESOS-6345
> URL: https://issues.apache.org/jira/browse/MESOS-6345
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Reporter: Avinash Sridharan
>  Labels: mesosphere
>
> PersistentVolumeFramework tests if failing on Ubuntu 14
> {code}
> [Step 10/10] *** Error in 
> `/mnt/teamcity/work/4240ba9ddd0997c3/build/src/.libs/lt-persistent-volume-framework':
>  double free or corruption (fasttop): 0x7f1ae0006a20 ***
> [04:56:48]W:   [Step 10/10] *** Aborted at 1475902608 (unix time) try "date 
> -d @1475902608" if you are using GNU date ***
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592744 25425 state.cpp:57] 
> Recovering state from '/mnt/teamcity/temp/buildTmp/mesos-8KiPML/2/meta'
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592808 25423 state.cpp:57] 
> Recovering state from '/mnt/teamcity/temp/buildTmp/mesos-8KiPML/1/meta'
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592952 25425 
> status_update_manager.cpp:203] Recovering status update manager
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592957 25423 
> status_update_manager.cpp:203] Recovering status update manager
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593010 25424 
> containerizer.cpp:557] Recovering containerizer
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593143 25396 sched.cpp:226] 
> Version: 1.1.0
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593158 25425 master.cpp:2013] 
> Elected as the leading master!
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593173 25425 master.cpp:1560] 
> Recovering from registrar
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593211 25424 registrar.cpp:329] 
> Recovering registrar
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593250 25425 sched.cpp:330] New 
> master detected at master@172.30.2.21:45167
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593282 25425 sched.cpp:341] No 
> credentials provided. Attempting to register without authentication
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593293 25425 sched.cpp:820] 
> Sending SUBSCRIBE call to master@172.30.2.21:45167
> [04:56:48]W:   [Step 10/10] PC: @ 0x7f1b0bbaccc9 (unknown)
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593339 25425 sched.cpp:853] Will 
> retry registration in 32.354951ms if necessary
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593364 25421 master.cpp:1387] 
> Dropping 'mesos.scheduler.Call' message since not recovered yet
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593413 25428 provisioner.cpp:253] 
> Provisioner recovery complete
> [04:56:48]W:   [Step 10/10] *** SIGABRT (@0x6334) received by PID 25396 (TID 
> 0x7f1b02ed6700) from PID 25396; stack trace: ***
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593520 25421 
> containerizer.cpp:557] Recovering containerizer
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593529 25425 slave.cpp:5276] 
> Finished recovery
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593627 25422 leveldb.cpp:304] 
> Persisting metadata (8 bytes) to leveldb took 4.546422ms
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593695 25428 provisioner.cpp:253] 
> Provisioner recovery complete
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593701 25422 replica.cpp:320] 
> Persisted replica status to VOTING
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593760 25424 slave.cpp:5276] 
> Finished recovery
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593864 25427 recover.cpp:582] 
> Successfully joined the Paxos group
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593896 25425 slave.cpp:5448] 
> Querying resource estimator for oversubscribable resources
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593922 25427 recover.cpp:466] 
> Recover process terminated
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593976 25427 slave.cpp:5462] 
> Received oversubscribable resources {} from the resource estimator
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594002 25424 slave.cpp:5448] 
> Querying resource estimator for oversubscribable resources
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594017 25422 log.cpp:553] 
> Attempting to start the writer
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594030 25428 
> status_update_manager.cpp:177] Pausing sending status updates
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594032 25427 slave.cpp:915] New 
> master detected at master@172.30.2.21:45167
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594055 25423 slave.cpp:915] New 
> master detected at master@172.30.2.21:45167
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.594048 25428 
> 

[jira] [Commented] (MESOS-6345) ExamplesTest.PersistentVolumeFramework failing due to double free corruption on Ubuntu 14.04

2017-06-27 Thread Dmitry Zhuk (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064891#comment-16064891
 ] 

Dmitry Zhuk commented on MESOS-6345:


Similar crash on CentOS7 (in ExamplesTest.PersistentVolumeFramework and 
ExamplesTest.DynamicReservationFramework) presumably due to race condition for 
{{signaledWrapper}} in {{configureSignal}}.
{noformat}
[ RUN  ] ExamplesTest.DynamicReservationFramework
*** Error in `mesos/build/src/.libs/lt-dynamic-reservation-framework': double 
free or corruption (fasttop): 0x7fdfa0002e60 ***
=== Backtrace: =
/lib64/libc.so.6(+0x7c503)[0x7fdfc6da7503]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt14_Function_base13_Base_managerIZN7process5deferIN5mesos8internal5slave5SlaveEiiSt12_PlaceholderILi1EES7_ILi2NS1_9_DeferredIDTcl4bindadsrSt8functionIFvT0_T1_EEclcvSF__Efp1_fp2_RKNS1_3PIDIT_EEMSJ_FvSC_SD_ET2_T3_EUliiE_E10_M_destroyERSt9_Any_dataSt17integral_constantIbLb0EE+0x31)[0x7fdfcca9165c]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt14_Function_base13_Base_managerIZN7process5deferIN5mesos8internal5slave5SlaveEiiSt12_PlaceholderILi1EES7_ILi2NS1_9_DeferredIDTcl4bindadsrSt8functionIFvT0_T1_EEclcvSF__Efp1_fp2_RKNS1_3PIDIT_EEMSJ_FvSC_SD_ET2_T3_EUliiE_E10_M_managerERSt9_Any_dataRKST_St18_Manager_operation+0xa2)[0x7fdfcca79857]
mesos/build/src/.libs/lt-dynamic-reservation-framework(_ZNSt14_Function_baseD1Ev+0x33)[0x560e50f40ae7]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt8functionIFviiEED1Ev+0x18)[0x7fdfcca2ec98]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt10_Head_baseILm0ESt8functionIFviiEELb0EED1Ev+0x18)[0x7fdfcca300ce]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt11_Tuple_implILm0EISt8functionIFviiEESt12_PlaceholderILi1EES3_ILi2D1Ev+0x18)[0x7fdfcca300e8]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt5tupleIISt8functionIFviiEESt12_PlaceholderILi1EES3_ILi2D1Ev+0x18)[0x7fdfcca30102]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt5_BindIFSt7_Mem_fnIMSt8functionIFviiEEKFviiEES3_St12_PlaceholderILi1EES7_ILi2D1Ev+0x1c)[0x7fdfcca30120]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt14_Function_base13_Base_managerISt5_BindIFSt7_Mem_fnIMSt8functionIFviiEEKFviiEES5_St12_PlaceholderILi1EES9_ILi2E10_M_destroyERSt9_Any_dataSt17integral_constantIbLb0EE+0x29)[0x7fdfcca91873]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt14_Function_base13_Base_managerISt5_BindIFSt7_Mem_fnIMSt8functionIFviiEEKFviiEES5_St12_PlaceholderILi1EES9_ILi2E10_M_managerERSt9_Any_dataRKSF_St18_Manager_operation+0xa2)[0x7fdfcca79ba3]
mesos/build/src/.libs/lt-dynamic-reservation-framework(_ZNSt14_Function_baseD1Ev+0x33)[0x560e50f40ae7]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZNSt8functionIFviiEED1Ev+0x18)[0x7fdfcca2ec98]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZN2os8internal15configureSignalEPKSt8functionIFviiEE+0x4a)[0x7fdfcc9db47d]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZN5mesos8internal5slave5Slave10initializeEv+0x3d5e)[0x7fdfcc9e0a78]
mesos/build/src/.libs/libmesos-1.4.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x284)[0x7fdfcd93fedc]
mesos/build/src/.libs/libmesos-1.4.0.so(+0x61152da)[0x7fdfcd93c2da]
mesos/build/src/.libs/libmesos-1.4.0.so(+0x6127bce)[0x7fdfcd94ebce]
mesos/build/src/.libs/libmesos-1.4.0.so(+0x6127b12)[0x7fdfcd94eb12]
mesos/build/src/.libs/libmesos-1.4.0.so(+0x6127a9c)[0x7fdfcd94ea9c]
/lib64/libstdc++.so.6(+0xb5230)[0x7fdfc73b7230]
/lib64/libpthread.so.0(+0x7dc5)[0x7fdfc7612dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fdfc6e2276d]
{noformat}

> ExamplesTest.PersistentVolumeFramework failing due to double free corruption 
> on Ubuntu 14.04
> 
>
> Key: MESOS-6345
> URL: https://issues.apache.org/jira/browse/MESOS-6345
> Project: Mesos
>  Issue Type: Bug
>  Components: framework
>Reporter: Avinash Sridharan
>  Labels: mesosphere
>
> PersistentVolumeFramework tests if failing on Ubuntu 14
> {code}
> [Step 10/10] *** Error in 
> `/mnt/teamcity/work/4240ba9ddd0997c3/build/src/.libs/lt-persistent-volume-framework':
>  double free or corruption (fasttop): 0x7f1ae0006a20 ***
> [04:56:48]W:   [Step 10/10] *** Aborted at 1475902608 (unix time) try "date 
> -d @1475902608" if you are using GNU date ***
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592744 25425 state.cpp:57] 
> Recovering state from '/mnt/teamcity/temp/buildTmp/mesos-8KiPML/2/meta'
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592808 25423 state.cpp:57] 
> Recovering state from '/mnt/teamcity/temp/buildTmp/mesos-8KiPML/1/meta'
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592952 25425 
> status_update_manager.cpp:203] Recovering status update manager
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.592957 25423 
> status_update_manager.cpp:203] Recovering status update manager
> [04:56:48]W:   [Step 10/10] I1008 04:56:48.593010 25424 

[jira] [Updated] (MESOS-7160) Parsing of perf version segfaults

2017-06-27 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7160:
-
Sprint: Mesosphere Sprint 58

> Parsing of perf version segfaults
> -
>
> Key: MESOS-7160
> URL: https://issues.apache.org/jira/browse/MESOS-7160
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Andrei Budnik
>
> Parsing the perf version [fails with a segfault in ASF 
> CI|https://builds.apache.org/job/Mesos-Buildbot/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/3294/],
> {noformat}
> E0222 20:54:03.033464   805 perf.cpp:237] Failed to get perf version: Failed 
> to execute perf: terminated with signal Aborted (core dumped)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7709) Add --dns flag to the agent.

2017-06-27 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061870#comment-16061870
 ] 

Qian Zhang edited comment on MESOS-7709 at 6/27/17 6:52 AM:


{quote}
The problem becomes even more acerbated when you have a mix of v4 and v6 
containers, since if now you rely only on `/etc/resolv.conf` to provide the 
default you will have pick some of the 3 possible nameservers to v4 and some to 
v6 again making it inflexible.
{quote}
Do you mean the case that there are some v4 containers and some v6 containers 
in the same agent host? And if we introduce a {{--dns}} agent flag, how will 
the issue you mentioned be resolved? Thanks.

Update:
Had a sync up with Avinash in Slack, the idea is, in a Mesos cluster which has 
both IPv4 containers and IPv6 containers, without the {{\--dns}} agent flag 
either the frameworks will have to explicitly set a IPv6 DNS entry for v6 
containers using the {{\--dns}} parameter to {{docker run}}, or we will need to 
have IPv6 entry for {{nameservers}} in our {{/etc/resolv.conf}}. With the 
introduction of the {{\--dns}} flag this problem goes away since for IPv6 
networks the operator can just set a nameserver (multiple of them if necessary) 
for a given network and we can pass these values to the docker daemon when 
launching the docker container on that IPv6 network.


was (Author: qianzhang):
{quote}
The problem becomes even more acerbated when you have a mix of v4 and v6 
containers, since if now you rely only on `/etc/resolv.conf` to provide the 
default you will have pick some of the 3 possible nameservers to v4 and some to 
v6 again making it inflexible.
{quote}
Do you mean the case that there are some v4 containers and some v6 containers 
in the same agent host? And if we introduce a {{--dns}} agent flag, how will 
the issue you mentioned be resolved? Thanks.

> Add --dns flag to the agent.
> 
>
> Key: MESOS-7709
> URL: https://issues.apache.org/jira/browse/MESOS-7709
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>
> Mesos support both CNI (through `network/cni` isolator) and CNM (through 
> docker) specification. Both these specifications allow for DNS entries for 
> containers to be set on a per-container, and per-network basis. 
> Currently, the behavior of the agent is to use the DNS nameservers set in 
> /etc/resolv.conf when the CNI or CNM plugin that is used to attached the 
> container to the CNI/CNM network doesnt' explicitly set the DNS for the 
> container. This is a bit inflexible especially when we have a mix of v4 and 
> v6 networks. 
> The operator should be able to specify DNS nameservers for the networks he 
> installs either the override the ones provided by the plugin or as defaults 
> when the plugins are not going to specify DNS name servers.
> In order to achieve the above goal we need to introduce a `\--dns` flag to 
> the agent. The `\--dns` flag should support a JSON (or a JSON file) with the 
> following schema:
> {code}
> {
>   "mesos": {
> [ 
>   {
> "network" : ,
> "nameservers": []
>   }
> ]
>   },
>   "docker": {
> [ 
>   {
> "network" : ,
> "nameservers": []
>   }
> ]
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)