[jira] [Commented] (MESOS-2484) libprocess Clock messages delivered
[ https://issues.apache.org/jira/browse/MESOS-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359108#comment-14359108 ] Dominic Hamon commented on MESOS-2484: -- see also https://issues.apache.org/jira/browse/MESOS-1456 regarding metrics. there's been some discussion about enforcing unique PID ids for every process but I can't find the relevant JIRA ticket. libprocess Clock messages delivered Key: MESOS-2484 URL: https://issues.apache.org/jira/browse/MESOS-2484 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Cody Maloney Found in / discussed at: https://reviews.apache.org/r/30587/#rc118737-72676 When a process is terminated, any outstanding delay() destined for that process aren't terminated, meaning they arrive whenever the clock happens to get there. With uniquely named processes this isn't an issue, but with names that are reused (master), it could potentially lead to odd test flakiness, and is artifacts carrying across / between tests which shouldn't be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2216) The configure phase breaks with the IBM JVM.
[ https://issues.apache.org/jira/browse/MESOS-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359101#comment-14359101 ] Dominic Hamon commented on MESOS-2216: -- see also https://issues.apache.org/jira/browse/HADOOP-9435 so a change in include path or explicit linking against libdl should help. The configure phase breaks with the IBM JVM. -- Key: MESOS-2216 URL: https://issues.apache.org/jira/browse/MESOS-2216 Project: Mesos Issue Type: Bug Affects Versions: 1.0.0, 0.20.1 Environment: Ubuntu / x86_64 Reporter: Tony Reix Priority: Blocker ./configure does not work with IBM JVM, since it looks for a directory: /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server x86_64 /usr/lib/jvm/ibm-java-ppc64le-71/jre/lib/ppc64le/serverPPC64 LE that does not exist for the IBM JVM. Though this directory does exist for Oracle JVM and Open JDK: /usr/lib/jvm/jdk1.7.0_71/jre/lib/amd64/server Oracle JVM /usr/lib/jvm/java-1.7.0-openjdk-amd64/jre/lib/amd64/server OpenJDK However, the files: libjsig.so libjvm.so (3 versions) do exist for IBM JVM. Anyway, creating the server directory and copying the files (tried with the 3 versions of libjvm.so) does not fix the issue: checking whether or not we can build with JNI... /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined reference to `dlopen' /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined reference to `dlclose' /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined reference to `dlerror' /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined reference to `dlsym' /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined reference to `dladdr' Something (dlopen, dlclose, dlerror, dlsym, dladdr) is missing in IBM JVM. So, either the configure step relies on a feature that is not in the Java standard but only in the Oracle JVM and OpenJDK, or the IBM JVM lacks part of the Java standard. I'm not an expert about this. So, I'd like Mesos people to experiment with IBM JVM. Maybe there is another solution for this step of the Mesos configure that would work with all 3 JVMs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-94) Master and Slave HTTP handlers should have unit tests
[ https://issues.apache.org/jira/browse/MESOS-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-94: --- Labels: http json test twitter (was: http json test) Master and Slave HTTP handlers should have unit tests - Key: MESOS-94 URL: https://issues.apache.org/jira/browse/MESOS-94 Project: Mesos Issue Type: Improvement Components: json api, master, slave, test Reporter: Charles Reiss Labels: http, json, test, twitter The Master and Slave have HTTP handlers which serve their state (mainly for the webui to use). There should be unit tests of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2293) Implement the Call endpoint on master
[ https://issues.apache.org/jira/browse/MESOS-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2293: - Labels: twitter (was: ) Implement the Call endpoint on master - Key: MESOS-2293 URL: https://issues.apache.org/jira/browse/MESOS-2293 Project: Mesos Issue Type: Task Reporter: Vinod Kone Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1988) Scheduler driver should not generate TASK_LOST when disconnected from master
[ https://issues.apache.org/jira/browse/MESOS-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1988: - Labels: twitter (was: ) Scheduler driver should not generate TASK_LOST when disconnected from master Key: MESOS-1988 URL: https://issues.apache.org/jira/browse/MESOS-1988 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Labels: twitter Currently, the driver replies to launchTasks() with TASK_LOST if it detects that it is disconnected from the master. After MESOS-1972 lands, this will be the only place where driver generates TASK_LOST. See MESOS-1972 for more context. This fix is targeted for 0.22.0 to give frameworks time to implement reconciliation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.
[ https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353182#comment-14353182 ] Dominic Hamon commented on MESOS-2467: -- instead of relying on the first character (which can also be '{' in valid json) perhaps we can instead: - try JSON parsing, catch failure - fallback to old parsing This also means we can deprecate the old parsing behaviour more easily. Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints
[ https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2277: - Labels: documentation newbie starter twitter (was: documentation newbie starter) Document undocumented HTTP endpoints Key: MESOS-2277 URL: https://issues.apache.org/jira/browse/MESOS-2277 Project: Mesos Issue Type: Improvement Reporter: Niklas Quarfot Nielsen Priority: Minor Labels: documentation, newbie, starter, twitter Did a quick scan and we are missing documentation for a few endpoints: {code} files/browse.json files/read.json files/download.json files/debug.json master/roles.json master/state.json master/stats.json slave/state.json slave/stats.json {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2294) Implement the Events endpoint on master
[ https://issues.apache.org/jira/browse/MESOS-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2294: - Labels: twitter (was: ) Implement the Events endpoint on master --- Key: MESOS-2294 URL: https://issues.apache.org/jira/browse/MESOS-2294 Project: Mesos Issue Type: Task Reporter: Vinod Kone Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1127) Implement the protobufs for the scheduler API
[ https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-1127: Assignee: Vinod Kone (was: Benjamin Hindman) Implement the protobufs for the scheduler API - Key: MESOS-1127 URL: https://issues.apache.org/jira/browse/MESOS-1127 Project: Mesos Issue Type: Task Components: framework Reporter: Benjamin Hindman Assignee: Vinod Kone Labels: twitter The default scheduler/executor interface and implementation in Mesos have a few drawbacks: (1) The interface is fairly high-level which makes it hard to do certain things, for example, handle events (callbacks) in batch. This can have a big impact on the performance of schedulers (for example, writing task updates that need to be persisted). (2) The implementation requires writing a lot of boilerplate JNI and native Python wrappers when adding additional API components. The plan is to provide a lower-level API that can easily be used to implement the higher-level API that is currently provided. This will also open the door to more easily building native-language Mesos libraries (i.e., not needing the C++ shim layer) and building new higher-level abstractions on top of the lower-level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1023) Replace all static/global variables with non-POD type
[ https://issues.apache.org/jira/browse/MESOS-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353464#comment-14353464 ] Dominic Hamon commented on MESOS-1023: -- commit f780f67717fe0aa25b6870baedd55c43a7017edb (HEAD, origin/master, origin/HEAD, master) Author: Dominic Hamon d...@twitter.com Commit: Dominic Hamon d...@twitter.com Remove static strings from process and split out some source. Review: https://reviews.apache.org/r/30841 Replace all static/global variables with non-POD type - Key: MESOS-1023 URL: https://issues.apache.org/jira/browse/MESOS-1023 Project: Mesos Issue Type: Bug Components: general, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: c++ See http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Static_and_Global_Variables for the background. Real bugs have been seen. For example, in process::ID::generate we have a mapstring, int that can be accessed within the function after exit has been called. Ie, we can try to access the map after it's been destroyed, but before exit has completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2457) Update post-reviews to rbtools in 'submit your patch' of developer's guide
[ https://issues.apache.org/jira/browse/MESOS-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350579#comment-14350579 ] Dominic Hamon commented on MESOS-2457: -- Install RBTools, yes. But we still want people to run support/post-reviews.py as it wraps rbt and avoids users having to set parent branches and manage diff chains. Update post-reviews to rbtools in 'submit your patch' of developer's guide --- Key: MESOS-2457 URL: https://issues.apache.org/jira/browse/MESOS-2457 Project: Mesos Issue Type: Bug Components: documentation, project website Reporter: Nancy Ko Priority: Minor Labels: documentation, newbie In developer's guide (http://mesos.apache.org/documentation/latest/mesos-developers-guide/) post-reviews should be changed to review board tools. Specifically: List item 3: First, install post-review. See Instructions See Instructions link should also redirect to: https://www.reviewboard.org/docs/rbtools/dev/ instead of: https://www.reviewboard.org/docs/manual/dev/users/tools/post-review/ AND List item 5: From your local branch run support/post-reviews.py. The run command should be changed to: rbt post -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation
[ https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345863#comment-14345863 ] Dominic Hamon commented on MESOS-2422: -- https://reviews.apache.org/r/31502/ https://reviews.apache.org/r/31503/ https://reviews.apache.org/r/31504/ https://reviews.apache.org/r/31505/ Use fq_codel qdisc for egress network traffic isolation --- Key: MESOS-2422 URL: https://issues.apache.org/jira/browse/MESOS-2422 Project: Mesos Issue Type: Task Reporter: Cong Wang Assignee: Cong Wang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure
[ https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2136: - Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3) Expose per-cgroup memory pressure - Key: MESOS-2136 URL: https://issues.apache.org/jira/browse/MESOS-2136 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Ian Downes Assignee: Chi Zhang Labels: twitter The cgroup memory controller can provide information on the memory pressure of a cgroup. This is in the form of an event based notification where events of (low, medium, critical) are generated when the kernel makes specific actions to allocate memory. This signal is probably more informative than comparing memory usage to memory limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2103) Expose number of processes and threads in a container
[ https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2103: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3) Expose number of processes and threads in a container - Key: MESOS-2103 URL: https://issues.apache.org/jira/browse/MESOS-2103 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Chi Zhang Labels: twitter The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, cpus_throttled_time) are difficult to interpret. 1) nr_throttled is the number of intervals where *any* throttling occurred 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in the Linux sense). For example, in a typical 60 second sampling interval: nr_periods = 600, nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be much higher than (60/600) * 60 = 6 seconds if there is more than one task that is runnable but throttled. *Each* throttled task contributes to the total throttled time. Small test to demonstrate throttled_time nr_periods * quota_interval: 5 x {{'openssl speed'}} running with quota=100ms: {noformat} cat cpu.stat sleep 1 cat cpu.stat nr_periods 3228 nr_throttled 1276 throttled_time 528843772540 nr_periods 3238 nr_throttled 1286 throttled_time 531668964667 {noformat} All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second (more than 100% of the time interval) It would be helpful to expose the number of processes and tasks in the container cgroup. This would be at a very coarse granularity but would give some guidance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2403) MasterAllocatorTest/0.FrameworkReregistersFirst is flaky
[ https://issues.apache.org/jira/browse/MESOS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2403: - Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 3) MasterAllocatorTest/0.FrameworkReregistersFirst is flaky Key: MESOS-2403 URL: https://issues.apache.org/jira/browse/MESOS-2403 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.23.0 Environment: ASF CI (Ubuntu) Reporter: Vinod Kone Assignee: Vinod Kone {code} [ RUN ] MasterAllocatorTest/0.FrameworkReregistersFirst Using temporary directory '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml' I0224 23:22:31.681670 30589 leveldb.cpp:176] Opened db in 2.943518ms I0224 23:22:31.682152 30619 process.cpp:2117] Dropped / Lost event for PID: slave(65)@67.195.81.187:38391 I0224 23:22:31.682732 30589 leveldb.cpp:183] Compacted db in 1.029469ms I0224 23:22:31.682777 30589 leveldb.cpp:198] Created db iterator in 15460ns I0224 23:22:31.682792 30589 leveldb.cpp:204] Seeked to beginning of db in 1832ns I0224 23:22:31.682802 30589 leveldb.cpp:273] Iterated through 0 keys in the db in 319ns I0224 23:22:31.682833 30589 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0224 23:22:31.683228 30605 recover.cpp:449] Starting replica recovery I0224 23:22:31.683537 30605 recover.cpp:475] Replica is in 4 status I0224 23:22:31.684624 30615 replica.cpp:641] Replica in 4 status received a broadcasted recover request I0224 23:22:31.684978 30616 recover.cpp:195] Received a recover response from a replica in 4 status I0224 23:22:31.685405 30610 recover.cpp:566] Updating replica status to 3 I0224 23:22:31.686249 30609 master.cpp:349] Master 20150224-232231-3142697795-38391-30589 (pomona.apache.org) started on 67.195.81.187:38391 I0224 23:22:31.686265 30617 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 717897ns I0224 23:22:31.686319 30617 replica.cpp:323] Persisted replica status to 3 I0224 23:22:31.686336 30609 master.cpp:395] Master only allowing authenticated frameworks to register I0224 23:22:31.686357 30609 master.cpp:400] Master only allowing authenticated slaves to register I0224 23:22:31.686390 30609 credentials.hpp:37] Loading credentials for authentication from '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml/credentials' I0224 23:22:31.686511 30606 recover.cpp:475] Replica is in 3 status I0224 23:22:31.686563 30609 master.cpp:442] Authorization enabled I0224 23:22:31.686929 30607 whitelist_watcher.cpp:79] No whitelist given I0224 23:22:31.686954 30603 hierarchical.hpp:287] Initialized hierarchical allocator process I0224 23:22:31.687134 30605 replica.cpp:641] Replica in 3 status received a broadcasted recover request I0224 23:22:31.687731 30609 master.cpp:1356] The newly elected leader is master@67.195.81.187:38391 with id 20150224-232231-3142697795-38391-30589 I0224 23:22:31.839818 30609 master.cpp:1369] Elected as the leading master! I0224 23:22:31.839834 30609 master.cpp:1187] Recovering from registrar I0224 23:22:31.839926 30605 registrar.cpp:313] Recovering registrar I0224 23:22:31.84 30613 recover.cpp:195] Received a recover response from a replica in 3 status I0224 23:22:31.840504 30606 recover.cpp:566] Updating replica status to 1 I0224 23:22:31.841599 30611 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 990330ns I0224 23:22:31.841627 30611 replica.cpp:323] Persisted replica status to 1 I0224 23:22:31.841743 30611 recover.cpp:580] Successfully joined the Paxos group I0224 23:22:31.841904 30611 recover.cpp:464] Recover process terminated I0224 23:22:31.842366 30608 log.cpp:660] Attempting to start the writer I0224 23:22:31.843557 30607 replica.cpp:477] Replica received implicit promise request with proposal 1 I0224 23:22:31.844312 30607 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 722368ns I0224 23:22:31.844337 30607 replica.cpp:345] Persisted promised to 1 I0224 23:22:31.844889 30615 coordinator.cpp:230] Coordinator attemping to fill missing position I0224 23:22:31.846043 30614 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2 I0224 23:22:31.846729 30614 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 660024ns I0224 23:22:31.846746 30614 replica.cpp:679] Persisted action at 0 I0224 23:22:31.847671 30611 replica.cpp:511] Replica received write request for position 0 I0224 23:22:31.847723 30611 leveldb.cpp:438] Reading position from leveldb took 27349ns I0224 23:22:31.848429 30611 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 671461ns I0224 23:22:31.848454 30611
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2332: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3) Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Paul Brett Assignee: Paul Brett Labels: features, twitter Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation
[ https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2422: - Labels: twitter (was: ) Use fq_codel qdisc for egress network traffic isolation --- Key: MESOS-2422 URL: https://issues.apache.org/jira/browse/MESOS-2422 Project: Mesos Issue Type: Task Reporter: Cong Wang Assignee: Cong Wang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2289) Design doc for the HTTP API
[ https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2289: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3) Design doc for the HTTP API --- Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path
[ https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2350: - Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 3) Add support for MesosContainerizerLaunch to chroot to a specified path -- Key: MESOS-2350 URL: https://issues.apache.org/jira/browse/MESOS-2350 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.22.0, 0.21.1 Reporter: Ian Downes Assignee: Ian Downes Labels: twitter In preparation for the MesosContainerizer to support a filesystem isolator the MesosContainerizerLauncher must support chrooting. Optionally, it should also configure the chroot environment by (re-)mounting special filesystems such as /proc and /sys and making device nodes such as /dev/zero, etc., such that the chroot environment is functional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2058: - Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 1) Deprecate stats.json endpoints for Master and Slave --- Key: MESOS-2058 URL: https://issues.apache.org/jira/browse/MESOS-2058 Project: Mesos Issue Type: Task Components: master, slave Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: twitter Fix For: 0.23.0 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics are now duplicated in the Master and Slave between this and {{stats.json}}. We should deprecate the {{stats.json}} endpoints. Manual inspection of {{stats.json}} shows that all metrics are now covered by the new endpoint for Master and Slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation
[ https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2422: - Sprint: Twitter Mesos Q1 Sprint 4 Use fq_codel qdisc for egress network traffic isolation --- Key: MESOS-2422 URL: https://issues.apache.org/jira/browse/MESOS-2422 Project: Mesos Issue Type: Task Reporter: Cong Wang Assignee: Cong Wang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2418) Remove raw pointers from stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340367#comment-14340367 ] Dominic Hamon commented on MESOS-2418: -- no more boost please. there's also {{std::arraychar}} if it's available on our compiler/platform suite as it is more similar to the existing fixed-size buffer usage. {{std::vectorchar}} has the benefit of working in C++03 compatible compilers. given we can't reach consensus on any use of {{std::unique_ptr}} i doubt it's a good fit here. Remove raw pointers from stout/os.hpp - Key: MESOS-2418 URL: https://issues.apache.org/jira/browse/MESOS-2418 Project: Mesos Issue Type: Improvement Components: stout Reporter: Joerg Schad Priority: Minor In MESOS-2412 a memory leak was found because of a missing {{delete}}. Forgetting to free memory is a common error while manually managing memory. In order to prevent this issue from happening again, another strategy should be used to handle buffers. Among the options there are {{std::vectorchar}}, {{std::unique_ptrchar\[\]}}, or {{boost::scoped_arraychar}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2412) Potential memleak(s) in stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338796#comment-14338796 ] Dominic Hamon commented on MESOS-2412: -- https://reviews.apache.org/r/31489/ Potential memleak(s) in stout/os.hpp Key: MESOS-2412 URL: https://issues.apache.org/jira/browse/MESOS-2412 Project: Mesos Issue Type: Bug Components: stout Reporter: Joerg Schad Assignee: Dominic Hamon Labels: coverity, twitter Coverity picked up this potential memleak in os.hpp where we do not delete buffer in the else case. The exact same pattern occurs in getuid(const Optionstd::string user = None()). The corresponding CID 1230371 and 1230371. {code} inline Resultgid_t getgid(const Optionstd::string user = None()) ... while (true) { char* buffer = new char[size]; if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) { ... delete[] buffer; return gid; } else { // RHEL7 (and possibly other systems) will return non-zero and // set one of the following errors for The given name or uid // was not found. See 'man getpwnam_r'. We only check for the // errors explicitly listed, and do not consider the ellipsis. if (errno == ENOENT || errno == ESRCH || errno == EBADF || errno == EPERM) { return None(); // HERE WE DO NOT DELETE BUFFER. } ... // getpwnam_r set ERANGE so try again with a larger buffer. size *= 2; delete[] buffer; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2412) Potential memleak(s) in stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-2412: Assignee: Dominic Hamon Potential memleak(s) in stout/os.hpp Key: MESOS-2412 URL: https://issues.apache.org/jira/browse/MESOS-2412 Project: Mesos Issue Type: Bug Components: stout Reporter: Joerg Schad Assignee: Dominic Hamon Labels: coverity, twitter Coverity picked up this potential memleak in os.hpp where we do not delete buffer in the else case. The exact same pattern occurs in getuid(const Optionstd::string user = None()). The corresponding CID 1230371 and 1230371. {code} inline Resultgid_t getgid(const Optionstd::string user = None()) ... while (true) { char* buffer = new char[size]; if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) { ... delete[] buffer; return gid; } else { // RHEL7 (and possibly other systems) will return non-zero and // set one of the following errors for The given name or uid // was not found. See 'man getpwnam_r'. We only check for the // errors explicitly listed, and do not consider the ellipsis. if (errno == ENOENT || errno == ESRCH || errno == EBADF || errno == EPERM) { return None(); // HERE WE DO NOT DELETE BUFFER. } ... // getpwnam_r set ERANGE so try again with a larger buffer. size *= 2; delete[] buffer; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2412: - Sprint: Twitter Mesos Q1 Sprint 3 Potential memleak(s) in stout/os.hpp Key: MESOS-2412 URL: https://issues.apache.org/jira/browse/MESOS-2412 Project: Mesos Issue Type: Bug Components: stout Reporter: Joerg Schad Assignee: Dominic Hamon Labels: coverity, twitter Coverity picked up this potential memleak in os.hpp where we do not delete buffer in the else case. The exact same pattern occurs in getuid(const Optionstd::string user = None()). The corresponding CID 1230371 and 1230371. {code} inline Resultgid_t getgid(const Optionstd::string user = None()) ... while (true) { char* buffer = new char[size]; if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) { ... delete[] buffer; return gid; } else { // RHEL7 (and possibly other systems) will return non-zero and // set one of the following errors for The given name or uid // was not found. See 'man getpwnam_r'. We only check for the // errors explicitly listed, and do not consider the ellipsis. if (errno == ENOENT || errno == ESRCH || errno == EBADF || errno == EPERM) { return None(); // HERE WE DO NOT DELETE BUFFER. } ... // getpwnam_r set ERANGE so try again with a larger buffer. size *= 2; delete[] buffer; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2412: - Assignee: Joerg Schad (was: Dominic Hamon) Potential memleak(s) in stout/os.hpp Key: MESOS-2412 URL: https://issues.apache.org/jira/browse/MESOS-2412 Project: Mesos Issue Type: Bug Components: stout Reporter: Joerg Schad Assignee: Joerg Schad Labels: coverity, twitter Coverity picked up this potential memleak in os.hpp where we do not delete buffer in the else case. The exact same pattern occurs in getuid(const Optionstd::string user = None()). The corresponding CID 1230371 and 1230371. {code} inline Resultgid_t getgid(const Optionstd::string user = None()) ... while (true) { char* buffer = new char[size]; if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) { ... delete[] buffer; return gid; } else { // RHEL7 (and possibly other systems) will return non-zero and // set one of the following errors for The given name or uid // was not found. See 'man getpwnam_r'. We only check for the // errors explicitly listed, and do not consider the ellipsis. if (errno == ENOENT || errno == ESRCH || errno == EBADF || errno == EPERM) { return None(); // HERE WE DO NOT DELETE BUFFER. } ... // getpwnam_r set ERANGE so try again with a larger buffer. size *= 2; delete[] buffer; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp
[ https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2412: - Sprint: (was: Twitter Mesos Q1 Sprint 3) Potential memleak(s) in stout/os.hpp Key: MESOS-2412 URL: https://issues.apache.org/jira/browse/MESOS-2412 Project: Mesos Issue Type: Bug Components: stout Reporter: Joerg Schad Assignee: Joerg Schad Labels: coverity, twitter Coverity picked up this potential memleak in os.hpp where we do not delete buffer in the else case. The exact same pattern occurs in getuid(const Optionstd::string user = None()). The corresponding CID 1230371 and 1230371. {code} inline Resultgid_t getgid(const Optionstd::string user = None()) ... while (true) { char* buffer = new char[size]; if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) { ... delete[] buffer; return gid; } else { // RHEL7 (and possibly other systems) will return non-zero and // set one of the following errors for The given name or uid // was not found. See 'man getpwnam_r'. We only check for the // errors explicitly listed, and do not consider the ellipsis. if (errno == ENOENT || errno == ESRCH || errno == EBADF || errno == EPERM) { return None(); // HERE WE DO NOT DELETE BUFFER. } ... // getpwnam_r set ERANGE so try again with a larger buffer. size *= 2; delete[] buffer; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2366: - Story Points: 1 MasterSlaveReconciliationTest.ReconcileLostTask is flaky Key: MESOS-2366 URL: https://issues.apache.org/jira/browse/MESOS-2366 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Dominic Hamon Labels: flaky-test https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF' I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 2370ns I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the db in 348ns I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a broadcasted recover request I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from a replica in 4 status I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 733223ns I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3 I0218 01:53:26.886714 13943 master.cpp:347] Master 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 67.195.81.187:57268 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing authenticated frameworks to register I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing authenticated slaves to register I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials' I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a broadcasted recover request I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical allocator process I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from a replica in 3 status I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(183)@67.195.81.187:57268 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master! I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(179)@67.195.81.187:57268 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to allocate! I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 slaves in 60316ns I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 921045ns I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos group I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit promise request with proposal 1 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 534122ns I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to fill missing position I0218 01:53:27.919147 13941 replica.cpp:377] Replica received explicit promise request for position 0 with proposal 2 I0218
[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2366: - Sprint: Twitter Mesos Q1 Sprint 3 MasterSlaveReconciliationTest.ReconcileLostTask is flaky Key: MESOS-2366 URL: https://issues.apache.org/jira/browse/MESOS-2366 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Dominic Hamon Labels: flaky-test https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF' I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 2370ns I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the db in 348ns I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a broadcasted recover request I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from a replica in 4 status I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 733223ns I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3 I0218 01:53:26.886714 13943 master.cpp:347] Master 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 67.195.81.187:57268 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing authenticated frameworks to register I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing authenticated slaves to register I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials' I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a broadcasted recover request I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical allocator process I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from a replica in 3 status I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(183)@67.195.81.187:57268 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master! I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(179)@67.195.81.187:57268 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to allocate! I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 slaves in 60316ns I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 921045ns I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos group I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit promise request with proposal 1 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 534122ns I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to fill missing position I0218 01:53:27.919147 13941 replica.cpp:377] Replica received explicit promise request for position 0 with
[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path
[ https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2350: - Labels: twitter (was: ) Add support for MesosContainerizerLaunch to chroot to a specified path -- Key: MESOS-2350 URL: https://issues.apache.org/jira/browse/MESOS-2350 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.22.0, 0.21.1 Reporter: Ian Downes Assignee: Ian Downes Labels: twitter In preparation for the MesosContainerizer to support a filesystem isolator the MesosContainerizerLauncher must support chrooting. Optionally, it should also configure the chroot environment by (re-)mounting special filesystems such as /proc and /sys and making device nodes such as /dev/zero, etc., such that the chroot environment is functional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2359) Expose slave's memory and cpu cgroup metrics
[ https://issues.apache.org/jira/browse/MESOS-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2359: - Component/s: twitter Expose slave's memory and cpu cgroup metrics Key: MESOS-2359 URL: https://issues.apache.org/jira/browse/MESOS-2359 Project: Mesos Issue Type: Improvement Components: isolation, twitter Reporter: Ian Downes Priority: Minor The slave can optionally be placed into its own cgroups (--slave_cgroups=). If this is enabled, we should export the relevant metrics - in preference or in addition to the process based metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333701#comment-14333701 ] Dominic Hamon commented on MESOS-2366: -- looks like waiting for the status update acknowledgement message should be enough. MasterSlaveReconciliationTest.ReconcileLostTask is flaky Key: MESOS-2366 URL: https://issues.apache.org/jira/browse/MESOS-2366 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Dominic Hamon Labels: flaky-test https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF' I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 2370ns I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the db in 348ns I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a broadcasted recover request I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from a replica in 4 status I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 733223ns I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3 I0218 01:53:26.886714 13943 master.cpp:347] Master 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 67.195.81.187:57268 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing authenticated frameworks to register I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing authenticated slaves to register I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials' I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a broadcasted recover request I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical allocator process I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from a replica in 3 status I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(183)@67.195.81.187:57268 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master! I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(179)@67.195.81.187:57268 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to allocate! I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 slaves in 60316ns I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 921045ns I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos group I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit promise request with proposal 1 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 534122ns I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to fill missing position I0218
[jira] [Created] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator
Dominic Hamon created MESOS-2386: Summary: Provide full filesystem isolation as a native mesos isolator Key: MESOS-2386 URL: https://issues.apache.org/jira/browse/MESOS-2386 Project: Mesos Issue Type: Epic Reporter: Dominic Hamon -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1397) Rename ResourceStatistics for containers
[ https://issues.apache.org/jira/browse/MESOS-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1397: - Labels: twitter (was: ) Rename ResourceStatistics for containers Key: MESOS-1397 URL: https://issues.apache.org/jira/browse/MESOS-1397 Project: Mesos Issue Type: Improvement Reporter: Ian Downes Labels: twitter Rename ContainerStatistics which includes optional ResourceStatistics and optional PerfStatistics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1282) Support unprivileged access to cgroups
[ https://issues.apache.org/jira/browse/MESOS-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1282: - Labels: twitter (was: ) Support unprivileged access to cgroups -- Key: MESOS-1282 URL: https://issues.apache.org/jira/browse/MESOS-1282 Project: Mesos Issue Type: Improvement Affects Versions: 0.19.0 Reporter: Ian Downes Priority: Minor Labels: twitter Attachments: MESOS-1282.patch Supporting this would allow running tests with cgroup isolators on CI machines where sudo access is unavailable. This could be achieved by having the subsystems mounted and the mesos (or mesos_test) cgroup created and owned by the unprivileged user. {noformat} [vagrant@mesos cpu]$ cat /proc/mounts | grep cgroup tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,clone_children 0 0 cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,clone_children 0 0 cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,clone_children 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,clone_children 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,clone_children 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,clone_children 0 0 cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls,clone_children 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,clone_children 0 0 [vagrant@mesos cpu]$ pwd /sys/fs/cgroup/cpu [vagrant@mesos cpu]$ ls -la total 0 drwxr-xr-x 2 root root 0 May 1 22:11 . drwxrwxrwt 10 root root 200 Apr 30 23:09 .. -rw-r--r-- 1 root root 0 Apr 30 23:14 cgroup.clone_children --w--w--w- 1 root root 0 Apr 30 23:09 cgroup.event_control -rw-r--r-- 1 root root 0 Apr 30 23:09 cgroup.procs -rw-r--r-- 1 root root 0 Apr 30 23:09 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Apr 30 23:09 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Apr 30 23:09 cpu.rt_period_us -rw-r--r-- 1 root root 0 Apr 30 23:09 cpu.rt_runtime_us -rw-r--r-- 1 root root 0 Apr 30 23:09 cpu.shares -r--r--r-- 1 root root 0 Apr 30 23:09 cpu.stat -rw-r--r-- 1 root root 0 Apr 30 23:09 notify_on_release -rw-r--r-- 1 root root 0 Apr 30 23:09 release_agent -rw-r--r-- 1 root root 0 Apr 30 23:09 tasks {noformat} User is unprivileged: {noformat} [vagrant@mesos cpu]$ id uid=500(vagrant) gid=500(vagrant) groups=500(vagrant),10(wheel) [vagrant@mesos cpu]$ mkdir mesos mkdir: cannot create directory `mesos': Permission denied {noformat} Create a cgroup and chown to the unprivileged user. {noformat} [vagrant@mesos cpu]$ sudo mkdir mesos sudo chown -R vagrant:vagrant mesos [vagrant@mesos cpu]$ ls -la total 0 drwxr-xr-x 3 rootroot 0 May 1 22:11 . drwxrwxrwt 10 rootroot200 Apr 30 23:09 .. -rw-r--r-- 1 rootroot 0 Apr 30 23:14 cgroup.clone_children --w--w--w- 1 rootroot 0 Apr 30 23:09 cgroup.event_control -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cgroup.procs -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cpu.cfs_period_us -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cpu.cfs_quota_us -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cpu.rt_period_us -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cpu.rt_runtime_us -rw-r--r-- 1 rootroot 0 Apr 30 23:09 cpu.shares -r--r--r-- 1 rootroot 0 Apr 30 23:09 cpu.stat drwxr-xr-x 2 vagrant vagrant 0 May 1 22:12 mesos -rw-r--r-- 1 rootroot 0 Apr 30 23:09 notify_on_release -rw-r--r-- 1 rootroot 0 Apr 30 23:09 release_agent -rw-r--r-- 1 rootroot 0 Apr 30 23:09 tasks {noformat} The unprivileged user can now create nested cgroups and move processes into/out of cgroups it owns. {noformat} [vagrant@mesos cpu]$ echo $$ 2877 [vagrant@mesos cpu]$ echo $$ mesos/cgroup.procs [vagrant@mesos cpu]$ cat mesos/cgroup.procs 2877 2957 [vagrant@mesos cpu]$ mkdir mesos/test [vagrant@mesos cpu]$ echo $$ mesos/test/cgroup.procs [vagrant@mesos cpu]$ cat mesos/test/cgroup.procs 2877 2960 [vagrant@mesos cpu]$ echo $$ mesos/cgroup.procs [vagrant@mesos cpu]$ cat mesos/cgroup.procs 2877 2977 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-886) Slave should wait until resources are isolated before launching tasks
[ https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333741#comment-14333741 ] Dominic Hamon commented on MESOS-886: - Is this still relevant? Slave should wait until resources are isolated before launching tasks - Key: MESOS-886 URL: https://issues.apache.org/jira/browse/MESOS-886 Project: Mesos Issue Type: Bug Components: isolation, slave Affects Versions: 0.14.0 Reporter: Ian Downes Assignee: Yifan Gu Priority: Minor Labels: twitter The slave dispatches to the isolator to update resources and then sends RunTaskMessage to the executor without waiting for the update to complete. This race could, for example, lead to the task using too much RAM (including file cache) and then being OOM killed whenever the resource update completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator
[ https://issues.apache.org/jira/browse/MESOS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2386: - Labels: twitter (was: ) Provide full filesystem isolation as a native mesos isolator Key: MESOS-2386 URL: https://issues.apache.org/jira/browse/MESOS-2386 Project: Mesos Issue Type: Epic Reporter: Dominic Hamon Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-886) Slave should wait until resources are isolated before launching tasks
[ https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-886: Labels: twitter (was: ) Slave should wait until resources are isolated before launching tasks - Key: MESOS-886 URL: https://issues.apache.org/jira/browse/MESOS-886 Project: Mesos Issue Type: Bug Components: isolation, slave Affects Versions: 0.14.0 Reporter: Ian Downes Assignee: Yifan Gu Priority: Minor Labels: twitter The slave dispatches to the isolator to update resources and then sends RunTaskMessage to the executor without waiting for the update to complete. This race could, for example, lead to the task using too much RAM (including file cache) and then being OOM killed whenever the resource update completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333701#comment-14333701 ] Dominic Hamon edited comment on MESOS-2366 at 2/23/15 8:10 PM: --- looks like waiting for the status update acknowledgement message should be enough. The master updates the metrics in {{updateTask}}, called from {{statusUpdate}}. It's possible that the StatusUpdate message has been sent (which we check for) but not acted on by the Master yet, hence the metrics have not been updated. Waiting for the explicit acknowledgement is a proxy signal that the Master has updated the metrics. was (Author: dhamon): looks like waiting for the status update acknowledgement message should be enough. MasterSlaveReconciliationTest.ReconcileLostTask is flaky Key: MESOS-2366 URL: https://issues.apache.org/jira/browse/MESOS-2366 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Dominic Hamon Labels: flaky-test https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF' I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 2370ns I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the db in 348ns I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a broadcasted recover request I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from a replica in 4 status I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 733223ns I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3 I0218 01:53:26.886714 13943 master.cpp:347] Master 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 67.195.81.187:57268 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing authenticated frameworks to register I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing authenticated slaves to register I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials' I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a broadcasted recover request I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical allocator process I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from a replica in 3 status I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(183)@67.195.81.187:57268 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master! I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(179)@67.195.81.187:57268 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to allocate! I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 slaves in 60316ns I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 921045ns I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos group I0218 01:53:27.914378 13932
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Sprint: Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Assignee: Yan Xu Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2058: - Fix Version/s: 0.23.0 Deprecate stats.json endpoints for Master and Slave --- Key: MESOS-2058 URL: https://issues.apache.org/jira/browse/MESOS-2058 Project: Mesos Issue Type: Task Components: master, slave Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: twitter Fix For: 0.23.0 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics are now duplicated in the Master and Slave between this and {{stats.json}}. We should deprecate the {{stats.json}} endpoints. Manual inspection of {{stats.json}} shows that all metrics are now covered by the new endpoint for Master and Slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky
[ https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326659#comment-14326659 ] Dominic Hamon commented on MESOS-2366: -- That's curious. That suggests that the status update is not being received at the master, but I see it in the log. We could remove the metrics from the test temporarily, but it suggests that there's some wait missing in the test itself, or some check not present. MasterSlaveReconciliationTest.ReconcileLostTask is flaky Key: MESOS-2366 URL: https://issues.apache.org/jira/browse/MESOS-2366 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Labels: flaky-test https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes {code} [ RUN ] MasterSlaveReconciliationTest.ReconcileLostTask Using temporary directory '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF' I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 2370ns I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the db in 348ns I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a broadcasted recover request I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from a replica in 4 status I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 733223ns I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3 I0218 01:53:26.886714 13943 master.cpp:347] Master 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 67.195.81.187:57268 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing authenticated frameworks to register I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing authenticated slaves to register I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials' I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a broadcasted recover request I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical allocator process I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from a replica in 3 status I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(183)@67.195.81.187:57268 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master! I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: hierarchical-allocator(179)@67.195.81.187:57268 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to allocate! I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 slaves in 60316ns I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 921045ns I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos group I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit promise request with proposal 1 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to leveldb took 534122ns I0218 01:53:27.916967 13937
[jira] [Resolved] (MESOS-1708) Using the wrong resource name should report a better error.
[ https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon resolved MESOS-1708. -- Resolution: Fixed Fix Version/s: 0.22.0 Using the wrong resource name should report a better error. - Key: MESOS-1708 URL: https://issues.apache.org/jira/browse/MESOS-1708 Project: Mesos Issue Type: Bug Components: framework, master Reporter: Benjamin Hindman Assignee: Dominic Hamon Labels: newbie, twitter Fix For: 0.22.0 If a scheduler launches a task using resources the master doesn't know about the task validator causes the task to fail but the error message is not very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2185) slave state endpoint does not contain all resources in the resources field
[ https://issues.apache.org/jira/browse/MESOS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon resolved MESOS-2185. -- Resolution: Fixed Fix Version/s: 0.22.0 commit 73ddc21f44e65499d4179bb15edf97243c8ab18c (HEAD, origin/master, origin/HEAD, master) Author: Joerg Schad jo...@mesosphere.io Commit: Dominic Hamon d...@twitter.com Included all resources in state endpoint. Review: https://reviews.apache.org/r/31082 slave state endpoint does not contain all resources in the resources field -- Key: MESOS-2185 URL: https://issues.apache.org/jira/browse/MESOS-2185 Project: Mesos Issue Type: Bug Components: json api, slave Affects Versions: 0.21.0 Environment: Centos 6.5 / Centos 6.6 Reporter: Henning Schmiedehausen Assignee: Joerg Schad Labels: mesosphere Fix For: 0.22.0 fetching status for a slave from the /state.json yields resources: { ports: [31000-32000], mem: 512, disk: 33659, cpus: 1 } but in the flags section, it lists flags: { resources: cpus:1;mem:512;ports:[31000-32000];set:{label_a,label_b,label_c,label_d};range:[0-1000];scalar:108;numbers:{4,8,15,16,23,42}, } so there are additional resources. these resources show up when sending offers from that slave to the frameworks and the frameworks can use and consume them. This may just be a reporting issue with the state.json endpoint. https://gist.github.com/hgschmie/0dc4f599bb0ff2e815ed is the full response. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2244) RoutingTestINETSockets fails
[ https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2244: - Assignee: Chi Zhang RoutingTestINETSockets fails - Key: MESOS-2244 URL: https://issues.apache.org/jira/browse/MESOS-2244 Project: Mesos Issue Type: Bug Environment: Ubuntu 14.10, libnl 3.2.25 Reporter: Evelina Dumitrescu Assignee: Chi Zhang [ RUN ] RoutingTest.INETSockets *** stack smashing detected ***: /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated *** Aborted at 1421895912 (unix time) try date -d @1421895912 if you are using GNU date *** PC: @ 0x7f3566460d27 (unknown) *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from PID 5683; stack trace: *** @ 0x7f35667fec90 (unknown) @ 0x7f3566460d27 (unknown) @ 0x7f3566462418 (unknown) @ 0x7f35664a29f4 (unknown) @ 0x7f35665365cc (unknown) @ 0x7f3566536570 (unknown) @ 0x7f3566226753 idiagnl_msg_parse @ 0x7f356622678b idiagnl_msg_parser @ 0x7f3565dac4c9 nl_cache_parse @ 0x7f3565dac51b update_msg_parser @ 0x7f3565db1fbf nl_recvmsgs_report @ 0x7f3565db2329 nl_recvmsgs @ 0x7f3565dab9c9 __cache_pickup @ 0x7f3565dac43d nl_cache_pickup @ 0x7f3565dac66e nl_cache_refill @ 0x7f3566226024 idiagnl_msg_alloc_cache @ 0x7f356a95f455 routing::diagnosis::socket::infos() @ 0x114da90 RoutingTest_INETSockets_Test::TestBody() @ 0x11e6957 testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x11e151d testing::internal::HandleExceptionsInMethodIfSupported() @ 0x11c7adb testing::Test::Run() @ 0x11c8253 testing::TestInfo::Run() @ 0x11c87f6 testing::TestCase::Run() @ 0x11cd987 testing::internal::UnitTestImpl::RunAllTests() @ 0x11e7905 testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x11e2304 testing::internal::HandleExceptionsInMethodIfSupported() @ 0x11cc74a testing::UnitTest::Run() @ 0xd7a4ad main @ 0x7f356644bec5 (unknown) @ 0x91ccb9 (unknown) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1708) Using the wrong resource name should report a better error.
[ https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1708: - Sprint: Twitter Mesos Q1 Sprint 3 Using the wrong resource name should report a better error. - Key: MESOS-1708 URL: https://issues.apache.org/jira/browse/MESOS-1708 Project: Mesos Issue Type: Bug Components: framework, master Reporter: Benjamin Hindman Assignee: Dominic Hamon Labels: newbie, twitter If a scheduler launches a task using resources the master doesn't know about the task validator causes the task to fail but the error message is not very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-998) Slave should wait until Containerizer::update() completes successfully
[ https://issues.apache.org/jira/browse/MESOS-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-998: Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Slave should wait until Containerizer::update() completes successfully -- Key: MESOS-998 URL: https://issues.apache.org/jira/browse/MESOS-998 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0, 0.19.0, 0.20.0, 0.21.0, 0.19.1, 0.20.1, 0.21.1 Reporter: Ian Downes Assignee: Jie Yu Container resources are updated in several places in the slave and we don't check the update was successful or even wait until it completes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2103) Expose number and state of threads in a container
[ https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2103: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Expose number and state of threads in a container - Key: MESOS-2103 URL: https://issues.apache.org/jira/browse/MESOS-2103 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Chi Zhang Labels: twitter The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, cpus_throttled_time) are difficult to interpret. 1) nr_throttled is the number of intervals where *any* throttling occurred 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in the Linux sense). For example, in a typical 60 second sampling interval: nr_periods = 600, nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be much higher than (60/600) * 60 = 6 seconds if there is more than one task that is runnable but throttled. *Each* throttled task contributes to the total throttled time. Small test to demonstrate throttled_time nr_periods * quota_interval: 5 x {{'openssl speed'}} running with quota=100ms: {noformat} cat cpu.stat sleep 1 cat cpu.stat nr_periods 3228 nr_throttled 1276 throttled_time 528843772540 nr_periods 3238 nr_throttled 1286 throttled_time 531668964667 {noformat} All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second (more than 100% of the time interval) It would be helpful to expose the number and state of tasks in the container cgroup. This would be at a very coarse granularity but would give some guidance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2289) Design doc for the HTTP API
[ https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2289: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Design doc for the HTTP API --- Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure
[ https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2136: - Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2) Expose per-cgroup memory pressure - Key: MESOS-2136 URL: https://issues.apache.org/jira/browse/MESOS-2136 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Ian Downes Assignee: Chi Zhang Labels: twitter The cgroup memory controller can provide information on the memory pressure of a cgroup. This is in the form of an event based notification where events of (low, medium, critical) are generated when the kernel makes specific actions to allocate memory. This signal is probably more informative than comparing memory usage to memory limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2244) RoutingTestINETSockets fails
[ https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324677#comment-14324677 ] Dominic Hamon commented on MESOS-2244: -- is this an instance of mismatched libnl header/kernel? RoutingTestINETSockets fails - Key: MESOS-2244 URL: https://issues.apache.org/jira/browse/MESOS-2244 Project: Mesos Issue Type: Bug Environment: Ubuntu 14.10, libnl 3.2.25 Reporter: Evelina Dumitrescu Assignee: Chi Zhang [ RUN ] RoutingTest.INETSockets *** stack smashing detected ***: /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated *** Aborted at 1421895912 (unix time) try date -d @1421895912 if you are using GNU date *** PC: @ 0x7f3566460d27 (unknown) *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from PID 5683; stack trace: *** @ 0x7f35667fec90 (unknown) @ 0x7f3566460d27 (unknown) @ 0x7f3566462418 (unknown) @ 0x7f35664a29f4 (unknown) @ 0x7f35665365cc (unknown) @ 0x7f3566536570 (unknown) @ 0x7f3566226753 idiagnl_msg_parse @ 0x7f356622678b idiagnl_msg_parser @ 0x7f3565dac4c9 nl_cache_parse @ 0x7f3565dac51b update_msg_parser @ 0x7f3565db1fbf nl_recvmsgs_report @ 0x7f3565db2329 nl_recvmsgs @ 0x7f3565dab9c9 __cache_pickup @ 0x7f3565dac43d nl_cache_pickup @ 0x7f3565dac66e nl_cache_refill @ 0x7f3566226024 idiagnl_msg_alloc_cache @ 0x7f356a95f455 routing::diagnosis::socket::infos() @ 0x114da90 RoutingTest_INETSockets_Test::TestBody() @ 0x11e6957 testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x11e151d testing::internal::HandleExceptionsInMethodIfSupported() @ 0x11c7adb testing::Test::Run() @ 0x11c8253 testing::TestInfo::Run() @ 0x11c87f6 testing::TestCase::Run() @ 0x11cd987 testing::internal::UnitTestImpl::RunAllTests() @ 0x11e7905 testing::internal::HandleSehExceptionsInMethodIfSupported() @ 0x11e2304 testing::internal::HandleExceptionsInMethodIfSupported() @ 0x11cc74a testing::UnitTest::Run() @ 0xd7a4ad main @ 0x7f356644bec5 (unknown) @ 0x91ccb9 (unknown) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2123) Document changes in C++ Resources API in CHANGELOG.
[ https://issues.apache.org/jira/browse/MESOS-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2123: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Document changes in C++ Resources API in CHANGELOG. --- Key: MESOS-2123 URL: https://issues.apache.org/jira/browse/MESOS-2123 Project: Mesos Issue Type: Task Reporter: Jie Yu Labels: twitter With the refactor introduced in MESOS-1974, we need to document those API changes in CHANGELOG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1690) Expose metric for container destroy failures
[ https://issues.apache.org/jira/browse/MESOS-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1690: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Expose metric for container destroy failures Key: MESOS-1690 URL: https://issues.apache.org/jira/browse/MESOS-1690 Project: Mesos Issue Type: Bug Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Vinod Kone Increment counter when container destroy fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2332: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Paul Brett Assignee: Paul Brett Labels: features, twitter Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2031) Manage persistent directories on slave.
[ https://issues.apache.org/jira/browse/MESOS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2031: - Sprint: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2) Manage persistent directories on slave. --- Key: MESOS-2031 URL: https://issues.apache.org/jira/browse/MESOS-2031 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Jie Yu Whenever a slave sees a persistent disk resource (in ExecutorInfo or TaskInfo) that is new to it, it will create a persistent directory which is for tasks to store persistent data. The slave needs to do the following after it's created: 1) symlink into the executor sandbox so that tasks/executor can see it 2) garbage collect it once it is released by the framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3 (was: Twitter Mesos Q1 Sprint 2) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Assignee: Yan Xu Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2361) Add metrics to status update manager to expose number of outstanding (un-ack'ed) status updates
[ https://issues.apache.org/jira/browse/MESOS-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324975#comment-14324975 ] Dominic Hamon commented on MESOS-2361: -- the queue length is easily exposed as it is on Master and the Scheduler driver already. see src/master/metrics.hpp:157 - 159. Add metrics to status update manager to expose number of outstanding (un-ack'ed) status updates --- Key: MESOS-2361 URL: https://issues.apache.org/jira/browse/MESOS-2361 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen We have experienced custom executors with high volume of status updates cause congestion on the slave due to framework unavailability (either from being disconnected or not processing status updates fast enough). As a first step, it would be helpful to expose the status update stream/queue depths. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2344) segfaults running make check from ev integration
[ https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316853#comment-14316853 ] Dominic Hamon edited comment on MESOS-2344 at 2/11/15 7:45 PM: --- a different one now: {noformat} (gdb) bt #0 boost::range_detail::range_endmultihashmapint, process::Ownedprocess::PromiseOptionintconst (c=...) at 3rdparty/boost-1.53.0/boost/range/end.hpp:44 #1 0x7680c665 in boost::range_adl_barrier::endmultihashmapint, process::Ownedprocess::PromiseOptionint (r=...) at 3rdparty/boost-1.53.0/boost/range/end.hpp:113 #2 0x7680c4f5 in boost::foreach_detail_::endmultihashmapint, process::Ownedprocess::PromiseOptionint , mpl_::bool_true (col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714 #3 0x768096ae in multihashmapint, process::Ownedprocess::PromiseOptionint ::keys (this=0x7fffdc009878) at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74 #4 0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at ../../../3rdparty/libprocess/src/reap.cpp:82 #5 0x7680a968 in operator() (this=0x7fffe0004180, process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78 #6 0x7680a612 in std::_Function_handlervoid (process::ProcessBase*), void process::dispatchprocess::ReaperProcess(process::PIDprocess::ReaperProcess const, void (process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}::_M_invoke(std::_Any_data const, process::ProcessBase*) (__functor=..., __args=0x7fffdc0098a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071 #7 0x767b4388 in std::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const (this=0x7fffe0029f00, __args=0x7fffdc0098a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464 #8 0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764 #9 0x767ece5e in process::DispatchEvent::visit (this=0x7fffe0010a90, visitor=0x7fffdc0098a8) at ../../../3rdparty/libprocess/include/process/event.hpp:141 #10 0x008cb061 in process::ProcessBase::serve (this=0x7fffdc0098a8, event=...) at ../../3rdparty/libprocess/include/process/process.hpp:39 #11 0x7679355d in process::ProcessManager::resume (this=0x3334bb0, process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/src/process.cpp:2238 #12 0x76792d8e in process::schedule (arg=0x0) at ../../../3rdparty/libprocess/src/process.cpp:655 #13 0x721b5182 in start_thread (arg=0x7fffe9dce700) at pthread_create.c:312 #14 0x71ee200d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} though this might be related if the DispatchEvent holds a function object that's been destroyed. any other non-POD static function objects around the place? was (Author: dhamon): a different one now: {noformat} (gdb) bt #0 boost::range_detail::range_endmultihashmapint, process::Ownedprocess::PromiseOptionintconst (c=...) at 3rdparty/boost-1.53.0/boost/range/end.hpp:44 #1 0x7680c665 in boost::range_adl_barrier::endmultihashmapint, process::Ownedprocess::PromiseOptionint (r=...) at 3rdparty/boost-1.53.0/boost/range/end.hpp:113 #2 0x7680c4f5 in boost::foreach_detail_::endmultihashmapint, process::Ownedprocess::PromiseOptionint , mpl_::bool_true (col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714 #3 0x768096ae in multihashmapint, process::Ownedprocess::PromiseOptionint ::keys (this=0x7fffdc009878) at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74 #4 0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at ../../../3rdparty/libprocess/src/reap.cpp:82 #5 0x7680a968 in operator() (this=0x7fffe0004180, process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78 #6 0x7680a612 in std::_Function_handlervoid (process::ProcessBase*), void process::dispatchprocess::ReaperProcess(process::PIDprocess::ReaperProcess const, void (process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}::_M_invoke(std::_Any_data const, process::ProcessBase*) (__functor=..., __args=0x7fffdc0098a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071 #7 0x767b4388 in std::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const (this=0x7fffe0029f00, __args=0x7fffdc0098a8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464 #8 0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764 #9 0x767ece5e in
[jira] [Commented] (MESOS-2344) segfaults running make check from ev integration
[ https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316896#comment-14316896 ] Dominic Hamon commented on MESOS-2344: -- commit 95c448f77731034114183fc5f5bf6e040d4c0f5d (HEAD, origin/master, origin/HEAD, nonpod.clock, master) Author: Dominic Hamon dha...@twitter.com Commit: Dominic Hamon dha...@twitter.com Remove more non-pod statics from clock Review: https://reviews.apache.org/r/30886 segfaults running make check from ev integration Key: MESOS-2344 URL: https://issues.apache.org/jira/browse/MESOS-2344 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Dominic Hamon Assignee: Joris Van Remoortere Priority: Blocker Running make check on Ubuntu under gdb, I've seen a number of segfaults from the {{process::EventLoop}}. Stack traces and debugging sessions below: {noformat} (gdb) bt #0 0x00789c71 in std::movestd::_Tuple_impl2ul (__t=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102 #1 0x76821148 in std::_Tuple_impl1, void (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273) ( this=0x7fffe00228d8, __in=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270 #2 0x768210a4 in std::_Tuple_impl0, Duration, void (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7) (this=0x7fffe00228d8, __in=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271 #3 0x76821068 in std::tupleDuration, void (*)()::tuple(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4) ( this=0x7fffe00228d8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542 #4 0x76821014 in std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)())::_Bind(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) (this=0x7fffe00228d0, __b=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342 #5 0x76820f86 in std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()) ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()), std::integral_constantbool, false) (__functor=..., __f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987 #6 0x76820ab0 in std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()) ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)())) (__functor=..., __f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7115) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958 #7 0x768208e6 in std::functionprocess::FutureNothing ()::functionstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)()), void(std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)())) (this=0x7fffe85ca9d0, __f=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451 #8 0x7681fe55 in process::EventLoop::delay (duration=..., function=0x76729580 process::tick()) at ../../../3rdparty/libprocess/src/libev.cpp:98 #9 0x7672a151 in process::tick () at ../../../3rdparty/libprocess/src/clock.cpp:125 #10 0x7681fcb2 in process::internal::handle_delay (loop=0x77dd91f0 default_loop_struct, timer=0x7fffe00279b0, revents=256) at ../../../3rdparty/libprocess/src/libev.cpp:64 #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 default_loop_struct) at ev.c:2994 #12 0x76860803 in ev_run (loop=0x77dd91f0 default_loop_struct, flags=optimized out) at ev.c:3394 #13 0x7681fffb in ev_loop (loop=0x77dd91f0 default_loop_struct, flags=0) at 3rdparty/libev-4.15/ev.h:826 #14 0x7681ff49 in process::EventLoop::run () at ../../../3rdparty/libprocess/src/libev.cpp:114 #15 0x721d2182 in start_thread
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2332: - Story Points: 5 Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Paul Brett Assignee: Paul Brett Labels: features, twitter Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2332: - Component/s: twitter Sprint: Twitter Mesos Q1 Sprint 2 Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation, twitter Reporter: Paul Brett Labels: features Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2332: - Assignee: Paul Brett Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation, twitter Reporter: Paul Brett Assignee: Paul Brett Labels: features Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1956) Add IPv6 ICMPv6 libnl traffic control U32 filters
[ https://issues.apache.org/jira/browse/MESOS-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314529#comment-14314529 ] Dominic Hamon commented on MESOS-1956: -- when we wrote the port mapping isolator, it was to deal with the constraint of having not enough IP addresses. If we have IPv6 available, we should be able to ensure each container gets its own IP address so the port mapping isolator shouldn't be needed. when we initialize the port mapping isolator, can we check if we're in IPv4 or IPv6 world? I'm ok with the port mapping isolator only working in IPv4. [~idownes] do you agree? Add IPv6 ICMPv6 libnl traffic control U32 filters --- Key: MESOS-1956 URL: https://issues.apache.org/jira/browse/MESOS-1956 Project: Mesos Issue Type: Task Components: isolation Reporter: Evelina Dumitrescu Assignee: Evelina Dumitrescu For IPv6, the filtering should be done by source and destination ports, destination IP, destination MAC. For ICMPv6, the filtering should be done by protocol and destination IP. The IPv6/IPv4 difference could be done by the source/destination IP type from the classifier. IPv4 packets with options in the header are currently ignored due to a bug in libnl. It should be investigated if the problem occurs in the case of IPv6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2344) segfaults running make check from ev integration
Dominic Hamon created MESOS-2344: Summary: segfaults running make check from ev integration Key: MESOS-2344 URL: https://issues.apache.org/jira/browse/MESOS-2344 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Dominic Hamon Running make check on Ubuntu under gdb, I've seen a number of segfaults from the {{process::EventLoop}}. Stack traces and debugging sessions below: {noformat} (gdb) bt #0 0x00789c71 in std::movestd::_Tuple_impl2ul (__t=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102 #1 0x76821148 in std::_Tuple_impl1, void (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273) ( this=0x7fffe00228d8, __in=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270 #2 0x768210a4 in std::_Tuple_impl0, Duration, void (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7) (this=0x7fffe00228d8, __in=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271 #3 0x76821068 in std::tupleDuration, void (*)()::tuple(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4) ( this=0x7fffe00228d8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542 #4 0x76821014 in std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)())::_Bind(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) (this=0x7fffe00228d0, __b=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342 #5 0x76820f86 in std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()) ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()), std::integral_constantbool, false) (__functor=..., __f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987 #6 0x76820ab0 in std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)()) ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(Duration const, void (*)())) (__functor=..., __f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7115) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958 #7 0x768208e6 in std::functionprocess::FutureNothing ()::functionstd::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)()), void(std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const Duration , void (*)())) (this=0x7fffe85ca9d0, __f=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451 #8 0x7681fe55 in process::EventLoop::delay (duration=..., function=0x76729580 process::tick()) at ../../../3rdparty/libprocess/src/libev.cpp:98 #9 0x7672a151 in process::tick () at ../../../3rdparty/libprocess/src/clock.cpp:125 #10 0x7681fcb2 in process::internal::handle_delay (loop=0x77dd91f0 default_loop_struct, timer=0x7fffe00279b0, revents=256) at ../../../3rdparty/libprocess/src/libev.cpp:64 #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 default_loop_struct) at ev.c:2994 #12 0x76860803 in ev_run (loop=0x77dd91f0 default_loop_struct, flags=optimized out) at ev.c:3394 #13 0x7681fffb in ev_loop (loop=0x77dd91f0 default_loop_struct, flags=0) at 3rdparty/libev-4.15/ev.h:826 #14 0x7681ff49 in process::EventLoop::run () at ../../../3rdparty/libprocess/src/libev.cpp:114 #15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at pthread_create.c:312 #16 0x71eff00d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) frame 8 #8 0x7681fe55 in process::EventLoop::delay (duration=..., function=0x76729580 process::tick()) at ../../../3rdparty/libprocess/src/libev.cpp:98 98run_in_event_loopNothing( (gdb) list 93 } // namespace internal { 94 95 96 void EventLoop::delay(const Duration duration, void(*function)(void)) 97 { 98run_in_event_loopNothing( 99lambda::bind(internal::delay, duration, function)); 100 } 101 102 (gdb) p duration $1 = (const Duration ) @0x7fffe000da90:
[jira] [Commented] (MESOS-1403) Segfault when starting a slave locally.
[ https://issues.apache.org/jira/browse/MESOS-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315291#comment-14315291 ] Dominic Hamon commented on MESOS-1403: -- cc [~benjaminhindman] [~jvanremoortere] Segfault when starting a slave locally. --- Key: MESOS-1403 URL: https://issues.apache.org/jira/browse/MESOS-1403 Project: Mesos Issue Type: Bug Affects Versions: 0.19.0 Reporter: Benjamin Mahler This is from the build directory on a CentOS machine. {noformat} [bmahler@foobar build]$ sudo ./bin/mesos-slave.sh --master=localhost:5050 [sudo] password for bmahler: I0522 01:01:02.639114 4605 main.cpp:126] Build: 2014-05-06 22:08:34 by root I0522 01:01:02.639277 4605 main.cpp:128] Version: 0.19.0 I0522 01:01:02.639312 4605 mesos_containerizer.cpp:124] Using isolation: posix/cpu,posix/mem I0522 01:01:02.642699 4605 main.cpp:149] Starting Mesos slave I0522 01:01:02.644693 4631 slave.cpp:143] Slave started on 1)@IP:5051 I0522 01:01:02.645560 4631 slave.cpp:255] Slave resources: cpus(*):24; mem(*):71322; disk(*):454895; ports(*):[31000-32000] I0522 01:01:02.647763 4631 slave.cpp:283] Slave hostname: foobar I0522 01:01:02.647790 4631 slave.cpp:284] Slave checkpoint: true I0522 01:01:02.651803 4625 state.cpp:33] Recovering state from '/tmp/mesos/meta' I0522 01:01:02.653393 4625 status_update_manager.cpp:193] Recovering status update manager I0522 01:01:02.654024 4643 mesos_containerizer.cpp:281] Recovering containerizer I0522 01:01:02.655377 4639 slave.cpp:2988] Finished recovery I0522 01:01:02.656368 4639 slave.cpp:536] New master detected at master@127.0.0.1:5050 I0522 01:01:02.656682 4639 slave.cpp:572] No credentials provided. Attempting to register without authentication I0522 01:01:02.656744 4629 status_update_manager.cpp:167] New master detected at master@127.0.0.1:5050 I0522 01:01:02.656754 4639 slave.cpp:585] Detecting new master *** Aborted at 1400720462 (unix time) try date -d @1400720462 if you are using GNU date *** I0522 01:01:02.656982 4639 slave.cpp:2194] master@127.0.0.1:5050 exited W0522 01:01:02.657004 4639 slave.cpp:2197] Master disconnected! Waiting for a new master to be elected PC: @ 0x7f4a9e3faff6 std::_Deque_base::_M_destroy_nodes() *** SIGSEGV (@0x31) received by PID 4605 (TID 0x7f4a8c1d0940) from PID 49; stack trace: *** @ 0x7f4a9baefca0 (unknown) @ 0x7f4a9e3faff6 std::_Deque_base::_M_destroy_nodes() @ 0x7f4a9e3ecdaf std::_Deque_base::~_Deque_base() @ 0x7f4a9e3e2bd5 std::deque::~deque() @ 0x7f4a9e3dfe10 process::DataDecoder::~DataDecoder() @ 0x7f4a9e3ba9bc process::receiving_connect() @ 0x7f4a9e506bc5 ev_invoke_pending @ 0x7f4a9e509af5 ev_run @ 0x7f4a9e3b5928 ev_loop @ 0x7f4a9e3bb2d9 process::serve() @ 0x7f4a9bae783d start_thread @ 0x7f4a9a84f26d clone /var/tmp/scltOMGb3: line 8: 4605 Segmentation fault './bin/mesos-slave.sh' '--master=localhost:5050' {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos
[ https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2288: - Labels: twitter (was: ) HTTP API for interacting with Mesos --- Key: MESOS-2288 URL: https://issues.apache.org/jira/browse/MESOS-2288 Project: Mesos Issue Type: Epic Reporter: Vinod Kone Labels: twitter Currently Mesos frameworks (schedulers and executors) interact with Mesos (masters and slaves) via drivers provided by Mesos. While the driver helped in providing some common functionality for all frameworks (master detection, authentication, validation etc), it has several drawbacks. -- Frameworks need to depend on a native library which makes their build/deploy process cumbersome. -- Pure language frameworks cannot use off the shelf libraries to interact with the undocumented API used by the driver. -- Makes it hard for developers to implement new APIs (lot of boiler plate code to write). This proposal is for Mesos to provide a well documented public HTTP API that frameworks (and maybe operators) can use to interact with Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1708) Using the wrong resource name should report a better error.
[ https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-1708: Assignee: Dominic Hamon Using the wrong resource name should report a better error. - Key: MESOS-1708 URL: https://issues.apache.org/jira/browse/MESOS-1708 Project: Mesos Issue Type: Bug Components: framework, master Reporter: Benjamin Hindman Assignee: Dominic Hamon Labels: newbie, twitter If a scheduler launches a task using resources the master doesn't know about the task validator causes the task to fail but the error message is not very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2289) Design doc for the HTTP API
[ https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2289: - Assignee: Vinod Kone Design doc for the HTTP API --- Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2289) Design doc for the HTTP API
[ https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2289: - Sprint: Twitter Mesos Q1 Sprint 2 Design doc for the HTTP API --- Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2255) SlaveRecoveryTest/0.MasterFailover is flaky
[ https://issues.apache.org/jira/browse/MESOS-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2255: - Labels: flaky twitter (was: flaky) SlaveRecoveryTest/0.MasterFailover is flaky --- Key: MESOS-2255 URL: https://issues.apache.org/jira/browse/MESOS-2255 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Yan Xu Labels: flaky, twitter {noformat:title=} [ RUN ] SlaveRecoveryTest/0.MasterFailover Using temporary directory '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0' I0123 07:45:49.818686 17634 leveldb.cpp:176] Opened db in 31.195549ms I0123 07:45:49.821962 17634 leveldb.cpp:183] Compacted db in 3.190936ms I0123 07:45:49.822049 17634 leveldb.cpp:198] Created db iterator in 47324ns I0123 07:45:49.822069 17634 leveldb.cpp:204] Seeked to beginning of db in 2038ns I0123 07:45:49.822084 17634 leveldb.cpp:273] Iterated through 0 keys in the db in 484ns I0123 07:45:49.822160 17634 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0123 07:45:49.824241 17660 recover.cpp:449] Starting replica recovery I0123 07:45:49.825217 17660 recover.cpp:475] Replica is in EMPTY status I0123 07:45:49.827020 17660 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0123 07:45:49.827453 17659 recover.cpp:195] Received a recover response from a replica in EMPTY status I0123 07:45:49.828047 17659 recover.cpp:566] Updating replica status to STARTING I0123 07:45:49.838543 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 10.24963ms I0123 07:45:49.838580 17659 replica.cpp:323] Persisted replica status to STARTING I0123 07:45:49.848836 17659 recover.cpp:475] Replica is in STARTING status I0123 07:45:49.850039 17659 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0123 07:45:49.850286 17659 recover.cpp:195] Received a recover response from a replica in STARTING status I0123 07:45:49.850754 17659 recover.cpp:566] Updating replica status to VOTING I0123 07:45:49.853698 17655 master.cpp:262] Master 20150123-074549-16842879-44955-17634 (utopic) started on 127.0.1.1:44955 I0123 07:45:49.853981 17655 master.cpp:308] Master only allowing authenticated frameworks to register I0123 07:45:49.853997 17655 master.cpp:313] Master only allowing authenticated slaves to register I0123 07:45:49.854038 17655 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0/credentials' I0123 07:45:49.854557 17655 master.cpp:357] Authorization enabled I0123 07:45:49.859633 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.742923ms I0123 07:45:49.859853 17659 replica.cpp:323] Persisted replica status to VOTING I0123 07:45:49.860327 17658 recover.cpp:580] Successfully joined the Paxos group I0123 07:45:49.860703 17654 recover.cpp:464] Recover process terminated I0123 07:45:49.859591 17655 master.cpp:1219] The newly elected leader is master@127.0.1.1:44955 with id 20150123-074549-16842879-44955-17634 I0123 07:45:49.864702 17655 master.cpp:1232] Elected as the leading master! I0123 07:45:49.864904 17655 master.cpp:1050] Recovering from registrar I0123 07:45:49.865406 17660 registrar.cpp:313] Recovering registrar I0123 07:45:49.866576 17660 log.cpp:660] Attempting to start the writer I0123 07:45:49.868638 17658 replica.cpp:477] Replica received implicit promise request with proposal 1 I0123 07:45:49.872521 17658 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.848859ms I0123 07:45:49.872555 17658 replica.cpp:345] Persisted promised to 1 I0123 07:45:49.873769 17661 coordinator.cpp:230] Coordinator attemping to fill missing position I0123 07:45:49.875474 17658 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2 I0123 07:45:49.880878 17658 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 5.364021ms I0123 07:45:49.880913 17658 replica.cpp:679] Persisted action at 0 I0123 07:45:49.882619 17657 replica.cpp:511] Replica received write request for position 0 I0123 07:45:49.882998 17657 leveldb.cpp:438] Reading position from leveldb took 150092ns I0123 07:45:49.886488 17657 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 3.269189ms I0123 07:45:49.886536 17657 replica.cpp:679] Persisted action at 0 I0123 07:45:49.887181 17657 replica.cpp:658] Replica received learned notice for position 0 I0123 07:45:49.892900 17657 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 5.690093ms I0123 07:45:49.892935 17657 replica.cpp:679] Persisted action at 0 I0123 07:45:49.892956 17657 replica.cpp:664] Replica learned NOP action at
[jira] [Updated] (MESOS-2300) Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23
[ https://issues.apache.org/jira/browse/MESOS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2300: - Labels: cgroups test twitter (was: cgroups test) Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23 --- Key: MESOS-2300 URL: https://issues.apache.org/jira/browse/MESOS-2300 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.1 Environment: (Though the hostname of this box is {{docker1}}, this is not running on a docker container. This box sits on vanilla hardware, and happens to also be used as a docker server. Though not when I ran the offending tests.) {code} huitseeker@docker1:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 14.10 Release: 14.10 Codename: utopic {code} {code} huitseeker@docker1:~$ uname -a Linux docker1 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:56:17 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux }} {code} Mesos retrieved from {{http://git-wip-us.apache.org/repos/asf/mesos.git}} And compiled from git tag {{0.21.1}} (currently resolves to {{2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8}}). Box is a clean, ansible-generated Ubuntu with cgmanager disabled, and the following packages installed on top of the usual mesos dependencies: - cgroup-lite (service is enabled and started) - linux-tools-common - linux-tools-generic - linux-cloud-tools-generic - linux-tools-3.16.0-23-generic - linux-cloud-tools-3.16.0-23-generic Reporter: François Garillot Labels: cgroups, test, twitter During make check : {code} [--] Global test environment tear-down [==] 503 tests from 89 test cases ran. (387352 ms total) [ PASSED ] 499 tests. [ FAILED ] 4 tests, listed below: [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups [ FAILED ] NsTest.ROOT_setns [ FAILED ] PerfTest.ROOT_SampleInit {code} Details: {code} [ RUN ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get ../../src/tests/cgroups_tests.cpp:364: Failure Value of: mesos_test2 Expected: cgroups.get()[0] Which is: mesos [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get (10 ms) [ RUN ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups ../../src/tests/cgroups_tests.cpp:392: Failure Value of: path::join(TEST_CGROUPS_ROOT, 2) Actual: mesos_test/2 Expected: cgroups.get()[0] Which is: mesos_test/1 ../../src/tests/cgroups_tests.cpp:393: Failure Value of: path::join(TEST_CGROUPS_ROOT, 1) Actual: mesos_test/1 Expected: cgroups.get()[1] Which is: mesos_test/2 [ FAILED ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups (12 ms) {code} {code} [ RUN ] NsTest.ROOT_setns ../../src/tests/ns_tests.cpp:123: Failure Value of: status.get().get() Actual: 256 Expected: 0 [ FAILED ] NsTest.ROOT_setns (93 ms) {code} {code} [ RUN ] PerfTest.ROOT_SampleInit ../../src/tests/perf_tests.cpp:143: Failure Expected: (0u) (statistics.get().cycles()), actual: 0 vs 0 ../../src/tests/perf_tests.cpp:146: Failure Expected: (0.0) (statistics.get().task_clock()), actual: 0 vs 0 [ FAILED ] PerfTest.ROOT_SampleInit (1078 ms) {code} Those tests have been run in parallel (-j 8) as well as sequentially (-j 1), no difference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-621) HierarchicalAllocator::slaveRemoved doesn't properly handle framework allocations/resources
[ https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon resolved MESOS-621. - Resolution: Won't Fix HierarchicalAllocator::slaveRemoved doesn't properly handle framework allocations/resources --- Key: MESOS-621 URL: https://issues.apache.org/jira/browse/MESOS-621 Project: Mesos Issue Type: Bug Components: allocation, technical debt Reporter: Vinod Kone Labels: twitter Currently a slaveRemoved() simply removes the slave from 'slaves' map and slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more things need to be done when a slave is removed (e.g., framework unallocations). It would be nice to fix this and have a test for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2238) Use Owned for Process pointers in wrapper classes
[ https://issues.apache.org/jira/browse/MESOS-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2238: - Labels: easyfix newbie (was: easyfix) Use Owned for Process pointers in wrapper classes --- Key: MESOS-2238 URL: https://issues.apache.org/jira/browse/MESOS-2238 Project: Mesos Issue Type: Improvement Reporter: Alexander Rukletsov Labels: easyfix, newbie A common pattern in our code (see e.g. {{Isolator}}, {{DockerContainerizer}}, {{Allocator}}) is to wrap Process-based class into a non Process-one. However, our code base is inconsistent about how we store the pointer to the underlying class: somewhere we wrap it into {{Owned}} (see e.g. {{Isolator}}, {{DockerContainerizer}}), somewhere it is a raw pointer (see e.g. {{Allocator}}, {{ExternalContainerizer}}). Using {{Owned}} for this particular case is preferable, since it hints the reader about the correct semantics and intention. For consistency reason, sweep through the code base and replace raw pointers with its {{Owned}} counterpart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-181) Virtual Machine Isolation Module
[ https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305765#comment-14305765 ] Dominic Hamon commented on MESOS-181: - It doesn't seem likely that we're going to integrate this any time soon. Shall we close out the issue? Virtual Machine Isolation Module Key: MESOS-181 URL: https://issues.apache.org/jira/browse/MESOS-181 Project: Mesos Issue Type: Story Components: isolation, slave Environment: Ubuntu 11.04, Ubuntu 11.10 Reporter: Charles Earl Priority: Minor Labels: virtualiztion Earlier in the year I implemented a virtual machine isolation module. This module uses lib-virt to launch and manage virtual machine containers. The code is still rough and have done basic testing with the Spark example. This code works with the KVM (http://www.linux-kvm.org/page/Main_Page) virtual machine manager. I've placed the relevant code in a branch called mesos-vm, for now located at https://github.com/charlescearl/VirtualMesos. The code is based upon the mesos lxc isolation module that is located in src/slave/lxc_isolation_module.cpp/.hpp. My code based on the mesos master branch dated Wed Nov 23 12:02:07 2011 -0800, commit 059aabb2ec5bd7b20ed08ab9c439531a352ba3ec. I'll generate a patch soon for this. Suggestions appreciated on whether this is the appropriate branch/commit to patch against. Most of the implementation is contained in vm_isolation_module.cpp and vm_isolation_module.hpp and there are some minor additions in launcher to handle setup of the environment for the virtual machine. I use the libvirt (http://libvirt.org/) library, to manage the virtual machine container in which the jobs are executed. Dependencies The code has been tested on Ubuntu 11.04 and 11.10 and depends on libpython2.6 and libvirt0 Configuration of the virtual machine container The virtual machine invocation depends upon a few configuration assumptions: 1. ssh public keys installed on the container. I assume that the container is setup to allow password-less secure access. 2. Directory structure on the container matches the servant machine. For example, in invoking a spark executor, assume that the paths match the setup on the container host. Running it In the $MESOS_HOME/conf/mesos.conf file add the line isolation=vm to use the virtual machine isolation. The Mesos slave is invoked with the isolation parameter set to vm. For example sudo bin/mesos-slave -m mesos://master@mesos-host:5050 -w 9839 --isolation=vm Rough description of how it works The `vm_isolation_module` class forks a process that in turn launches a virtual machine. A routine located in bin called find_addr.pl is responsible for figuring out the IP address of the launched virtual machine. This is probably not portable since it is explicitly looking for entry in the virbr0 network. A script vmLauncherTemplate.sh located in bin assists the the vmLauncher method to setup the environment for launching tasks inside of the virtual machine. The vmLauncher method uses vmLauncherTemplate.sh to create a tasks specific shell vmLauncherTemplate-task_id.sh, which is copied to the running guest and used to run the executor inside the VM. This communicates with the slave on the host. Comments and suggestions on improvements and next directions are appreciated! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints
[ https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2277: - Labels: documentation newbie starter (was: starter) Document undocumented HTTP endpoints Key: MESOS-2277 URL: https://issues.apache.org/jira/browse/MESOS-2277 Project: Mesos Issue Type: Improvement Reporter: Niklas Quarfot Nielsen Priority: Minor Labels: documentation, newbie, starter Did a quick scan and we are missing documentation for a few endpoints: {code} files/browse.json files/read.json files/download.json files/debug.json master/roles.json master/state.json master/stats.json slave/state.json slave/stats.json {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-181) Virtual Machine Isolation Module
[ https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon resolved MESOS-181. - Resolution: Won't Fix Sadly, our isolation efforts have diverged from the initial effort here. If we do ever provide VM isolation, we'll need to carefully determine requirements first and then develop a solution. Virtual Machine Isolation Module Key: MESOS-181 URL: https://issues.apache.org/jira/browse/MESOS-181 Project: Mesos Issue Type: Story Components: isolation, slave Environment: Ubuntu 11.04, Ubuntu 11.10 Reporter: Charles Earl Priority: Minor Labels: virtualiztion Earlier in the year I implemented a virtual machine isolation module. This module uses lib-virt to launch and manage virtual machine containers. The code is still rough and have done basic testing with the Spark example. This code works with the KVM (http://www.linux-kvm.org/page/Main_Page) virtual machine manager. I've placed the relevant code in a branch called mesos-vm, for now located at https://github.com/charlescearl/VirtualMesos. The code is based upon the mesos lxc isolation module that is located in src/slave/lxc_isolation_module.cpp/.hpp. My code based on the mesos master branch dated Wed Nov 23 12:02:07 2011 -0800, commit 059aabb2ec5bd7b20ed08ab9c439531a352ba3ec. I'll generate a patch soon for this. Suggestions appreciated on whether this is the appropriate branch/commit to patch against. Most of the implementation is contained in vm_isolation_module.cpp and vm_isolation_module.hpp and there are some minor additions in launcher to handle setup of the environment for the virtual machine. I use the libvirt (http://libvirt.org/) library, to manage the virtual machine container in which the jobs are executed. Dependencies The code has been tested on Ubuntu 11.04 and 11.10 and depends on libpython2.6 and libvirt0 Configuration of the virtual machine container The virtual machine invocation depends upon a few configuration assumptions: 1. ssh public keys installed on the container. I assume that the container is setup to allow password-less secure access. 2. Directory structure on the container matches the servant machine. For example, in invoking a spark executor, assume that the paths match the setup on the container host. Running it In the $MESOS_HOME/conf/mesos.conf file add the line isolation=vm to use the virtual machine isolation. The Mesos slave is invoked with the isolation parameter set to vm. For example sudo bin/mesos-slave -m mesos://master@mesos-host:5050 -w 9839 --isolation=vm Rough description of how it works The `vm_isolation_module` class forks a process that in turn launches a virtual machine. A routine located in bin called find_addr.pl is responsible for figuring out the IP address of the launched virtual machine. This is probably not portable since it is explicitly looking for entry in the virbr0 network. A script vmLauncherTemplate.sh located in bin assists the the vmLauncher method to setup the environment for launching tasks inside of the virtual machine. The vmLauncher method uses vmLauncherTemplate.sh to create a tasks specific shell vmLauncherTemplate-task_id.sh, which is copied to the running guest and used to run the executor inside the VM. This communicates with the slave on the host. Comments and suggestions on improvements and next directions are appreciated! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations
[ https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reopened MESOS-2138: -- Assignee: Michael Park (was: Benjamin Mahler) Add an Offer::Operation message for Dynamic Reservations - Key: MESOS-2138 URL: https://issues.apache.org/jira/browse/MESOS-2138 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Labels: protobuf Fix For: 0.22.0 A framework now has a notion of *accepting* offers that it was given (via {{acceptOffers}}) and is able to specify a sequence of operations to perform (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a sequence of {{Offer::Operation}} consisting of only {{Launch}}. The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as possible {{Offer::Operation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations
[ https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon closed MESOS-2138. Resolution: Fixed Add an Offer::Operation message for Dynamic Reservations - Key: MESOS-2138 URL: https://issues.apache.org/jira/browse/MESOS-2138 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Labels: protobuf Fix For: 0.22.0 A framework now has a notion of *accepting* offers that it was given (via {{acceptOffers}}) and is able to specify a sequence of operations to perform (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a sequence of {{Offer::Operation}} consisting of only {{Launch}}. The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as possible {{Offer::Operation}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2308) Task reconciliation API should support data partitioning
[ https://issues.apache.org/jira/browse/MESOS-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2308: - Shepherd: Vinod Kone Story Points: 8 Task reconciliation API should support data partitioning Key: MESOS-2308 URL: https://issues.apache.org/jira/browse/MESOS-2308 Project: Mesos Issue Type: Story Reporter: Bill Farner Assignee: Benjamin Mahler Labels: twitter The {{reconcileTasks}} API call requires the caller to specify a collection of {{TaskStatus}}es, with the option to provide an empty collection to retrieve the master's entire state. Retrieving the entire state is the only mechanism for the scheduler to learn that there are tasks running it does not know about, however this call does not allow incremental querying. The result would be that the master may need to send many thousands of status updates, and the scheduler would have to handle them. It would be ideal if the scheduler had a means to partition these requests so it can control the pace of these status updates. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2314) remove unnecessary constants
[ https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-2314: Assignee: Dominic Hamon remove unnecessary constants Key: MESOS-2314 URL: https://issues.apache.org/jira/browse/MESOS-2314 Project: Mesos Issue Type: Improvement Components: slave, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie In {{src/slave/paths.cpp}} a number of string constants are defined to describe the formats of various paths. However, given there is a 1:1 mapping between the string constant and the functions that build the paths, the code would be more readable if the format strings were inline in the functions. In the cases where one constant depends on another (see the {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} chain, for example) the function calls can just be chained together. This will have the added benefit of removing some statically constructed string constants, which are dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2314) remove unnecessary constants
[ https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302334#comment-14302334 ] Dominic Hamon commented on MESOS-2314: -- https://reviews.apache.org/r/30531 remove unnecessary constants Key: MESOS-2314 URL: https://issues.apache.org/jira/browse/MESOS-2314 Project: Mesos Issue Type: Improvement Components: slave, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie In {{src/slave/paths.cpp}} a number of string constants are defined to describe the formats of various paths. However, given there is a 1:1 mapping between the string constant and the functions that build the paths, the code would be more readable if the format strings were inline in the functions. In the cases where one constant depends on another (see the {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} chain, for example) the function calls can just be chained together. This will have the added benefit of removing some statically constructed string constants, which are dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2314) remove unnecessary constants
[ https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2314: - Story Points: 2 remove unnecessary constants Key: MESOS-2314 URL: https://issues.apache.org/jira/browse/MESOS-2314 Project: Mesos Issue Type: Improvement Components: slave, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie In {{src/slave/paths.cpp}} a number of string constants are defined to describe the formats of various paths. However, given there is a 1:1 mapping between the string constant and the functions that build the paths, the code would be more readable if the format strings were inline in the functions. In the cases where one constant depends on another (see the {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} chain, for example) the function calls can just be chained together. This will have the added benefit of removing some statically constructed string constants, which are dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2314) remove unnecessary constants
[ https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2314: - Sprint: Twitter Mesos Q1 Sprint 2 remove unnecessary constants Key: MESOS-2314 URL: https://issues.apache.org/jira/browse/MESOS-2314 Project: Mesos Issue Type: Improvement Components: slave, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie In {{src/slave/paths.cpp}} a number of string constants are defined to describe the formats of various paths. However, given there is a 1:1 mapping between the string constant and the functions that build the paths, the code would be more readable if the format strings were inline in the functions. In the cases where one constant depends on another (see the {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} chain, for example) the function calls can just be chained together. This will have the added benefit of removing some statically constructed string constants, which are dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2314) remove unnecessary constants
Dominic Hamon created MESOS-2314: Summary: remove unnecessary constants Key: MESOS-2314 URL: https://issues.apache.org/jira/browse/MESOS-2314 Project: Mesos Issue Type: Improvement Components: slave, technical debt Reporter: Dominic Hamon Priority: Minor In {{src/slave/paths.cpp}} a number of string constants are defined to describe the formats of various paths. However, given there is a 1:1 mapping between the string constant and the functions that build the paths, the code would be more readable if the format strings were inline in the functions. In the cases where one constant depends on another (see the {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} chain, for example) the function calls can just be chained together. This will have the added benefit of removing some statically constructed string constants, which are dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2305) Refactor validators in Master.
[ https://issues.apache.org/jira/browse/MESOS-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2305: - Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q1 Sprint 1) Refactor validators in Master. -- Key: MESOS-2305 URL: https://issues.apache.org/jira/browse/MESOS-2305 Project: Mesos Issue Type: Bug Reporter: Jie Yu Assignee: Jie Yu There are several motivation for this. We are in the process of adding dynamic reservations and persistent volumes support in master. To do that, master needs to validate relevant operations from the framework (See Offer::Operation in mesos.proto). The existing validator style in master is hard to extend, compose and re-use. Another motivation for this is for unit testing (MESOS-1064). Right now, we write integration tests for those validators which is unfortunate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1148) Add support for rate limiting slave removal
[ https://issues.apache.org/jira/browse/MESOS-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1148: - Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q1 Sprint 1) Add support for rate limiting slave removal --- Key: MESOS-1148 URL: https://issues.apache.org/jira/browse/MESOS-1148 Project: Mesos Issue Type: Improvement Components: master Reporter: Bill Farner Assignee: Vinod Kone Labels: twitter To safeguard against unforeseen bugs leading to widespread slave removal, it would be nice to allow for rate limiting of the decision to remove slaves and/or send TASK_LOST messages for tasks on those slaves. Ideally this would allow an operator to be notified soon enough to intervene before causing cluster impact. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure
[ https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2136: - Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1) Expose per-cgroup memory pressure - Key: MESOS-2136 URL: https://issues.apache.org/jira/browse/MESOS-2136 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Ian Downes Assignee: Chi Zhang Labels: twitter The cgroup memory controller can provide information on the memory pressure of a cgroup. This is in the form of an event based notification where events of (low, medium, critical) are generated when the kernel makes specific actions to allocate memory. This signal is probably more informative than comparing memory usage to memory limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2058: - Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q1 Sprint 1) Deprecate stats.json endpoints for Master and Slave --- Key: MESOS-2058 URL: https://issues.apache.org/jira/browse/MESOS-2058 Project: Mesos Issue Type: Task Components: master, slave Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: twitter With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics are now duplicated in the Master and Slave between this and {{stats.json}}. We should deprecate the {{stats.json}} endpoints. Manual inspection of {{stats.json}} shows that all metrics are now covered by the new endpoint for Master and Slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2100) Implement master to slave protocol for persistent disk resources.
[ https://issues.apache.org/jira/browse/MESOS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2100: - Sprint: Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1) Implement master to slave protocol for persistent disk resources. - Key: MESOS-2100 URL: https://issues.apache.org/jira/browse/MESOS-2100 Project: Mesos Issue Type: Task Components: master, slave Reporter: Jie Yu Assignee: Jie Yu Labels: twitter We need to do the following: 1) Slave needs to send persisted resources when registering (or re-registering). 2) Master needs to send total persisted resources to slave by either re-using RunTask/UpdateFrameworkInfo or introduce new type of messages (like UpdateResources). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2031) Manage persistent directories on slave.
[ https://issues.apache.org/jira/browse/MESOS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2031: - Sprint: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1) Manage persistent directories on slave. --- Key: MESOS-2031 URL: https://issues.apache.org/jira/browse/MESOS-2031 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Jie Yu Whenever a slave sees a persistent disk resource (in ExecutorInfo or TaskInfo) that is new to it, it will create a persistent directory which is for tasks to store persistent data. The slave needs to do the following after it's created: 1) symlink into the executor sandbox so that tasks/executor can see it 2) garbage collect it once it is released by the framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1830) Expose master stats differentiating between master-generated and slave-generated LOST tasks
[ https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1830: - Sprint: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q1 Sprint 1) Expose master stats differentiating between master-generated and slave-generated LOST tasks --- Key: MESOS-1830 URL: https://issues.apache.org/jira/browse/MESOS-1830 Project: Mesos Issue Type: Story Components: master Reporter: Bill Farner Assignee: Dominic Hamon Priority: Minor The master exports a monotonically-increasing counter of tasks transitioned to TASK_LOST. This loses fidelity of the source of the lost task. A first step in exposing the source of lost tasks might be to just differentiate between TASK_LOST transitions initiated by the master vs the slave (and maybe bad input from the scheduler). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2241) DiskUsageCollectorTest.SymbolicLink test is flaky
[ https://issues.apache.org/jira/browse/MESOS-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2241: - Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2 (was: Twitter Mesos Q1 Sprint 1) DiskUsageCollectorTest.SymbolicLink test is flaky - Key: MESOS-2241 URL: https://issues.apache.org/jira/browse/MESOS-2241 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Vinod Kone Assignee: Jie Yu Observed this on a local machine running linux w/ sudo. {code} [ RUN ] DiskUsageCollectorTest.SymbolicLink ../../src/tests/disk_quota_tests.cpp:138: Failure Expected: (usage1.get()) (Kilobytes(16)), actual: 24KB vs 8-byte object 00-40 00-00 00-00 00-00 [ FAILED ] DiskUsageCollectorTest.SymbolicLink (201 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2058: - Sprint: Twitter Mesos Q1 Sprint 1 (was: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2) Deprecate stats.json endpoints for Master and Slave --- Key: MESOS-2058 URL: https://issues.apache.org/jira/browse/MESOS-2058 Project: Mesos Issue Type: Task Components: master, slave Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: twitter With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics are now duplicated in the Master and Slave between this and {{stats.json}}. We should deprecate the {{stats.json}} endpoints. Manual inspection of {{stats.json}} shows that all metrics are now covered by the new endpoint for Master and Slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2123) Document changes in C++ Resources API in CHANGELOG.
[ https://issues.apache.org/jira/browse/MESOS-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2123: - Sprint: Twitter Mesos Q1 Sprint 2 Document changes in C++ Resources API in CHANGELOG. --- Key: MESOS-2123 URL: https://issues.apache.org/jira/browse/MESOS-2123 Project: Mesos Issue Type: Task Reporter: Jie Yu Labels: twitter With the refactor introduced in MESOS-1974, we need to document those API changes in CHANGELOG. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2103) Expose number and state of tasks in a container
[ https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2103: - Sprint: Twitter Mesos Q1 Sprint 2 Expose number and state of tasks in a container --- Key: MESOS-2103 URL: https://issues.apache.org/jira/browse/MESOS-2103 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.20.0 Reporter: Ian Downes Labels: twitter The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, cpus_throttled_time) are difficult to interpret. 1) nr_throttled is the number of intervals where *any* throttling occurred 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in the Linux sense). For example, in a typical 60 second sampling interval: nr_periods = 600, nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be much higher than (60/600) * 60 = 6 seconds if there is more than one task that is runnable but throttled. *Each* throttled task contributes to the total throttled time. Small test to demonstrate throttled_time nr_periods * quota_interval: 5 x {{'openssl speed'}} running with quota=100ms: {noformat} cat cpu.stat sleep 1 cat cpu.stat nr_periods 3228 nr_throttled 1276 throttled_time 528843772540 nr_periods 3238 nr_throttled 1286 throttled_time 531668964667 {noformat} All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second (more than 100% of the time interval) It would be helpful to expose the number and state of tasks in the container cgroup. This would be at a very coarse granularity but would give some guidance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1807) Disallow executors with cpu only or memory only resources
[ https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1807: - Sprint: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 Sprint 3, Twitter Mesos Q1 Sprint 2 (was: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 Sprint 3) Disallow executors with cpu only or memory only resources - Key: MESOS-1807 URL: https://issues.apache.org/jira/browse/MESOS-1807 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Assignee: Vinod Kone Labels: newbie Currently master allows executors to be launched with either only cpus or only memory but we shouldn't allow that. This is because executor is an actual unix process that is launched by the slave. If an executor doesn't specify cpus, what should do the cpu limits be for that executor when there are no tasks running on it? If no cpu limits are set then it might starve other executors/tasks on the slave violating isolation guarantees. Same goes with memory. Moreover, the current containerizer/isolator code will throw failures when using such an executor, e.g., when the last task on the executor finishes and Containerizer::update() is called with 0 cpus or 0 mem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Assignee: Yan Xu Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Assignee: Yan Xu Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Sprint: Twitter Mesos Q1 Sprint 2 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Shepherd: Vinod Kone Story Points: 8 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Assignee: Yan Xu Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
[ https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2144: - Shepherd: Jie Yu (was: Vinod Kone) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread --- Key: MESOS-2144 URL: https://issues.apache.org/jira/browse/MESOS-2144 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Cody Maloney Assignee: Yan Xu Priority: Minor Labels: flaky, twitter Occured on review bot review of: https://reviews.apache.org/r/28262/#review62333 The review doesn't touch code related to the test (And doesn't break libprocess in general) [ RUN ] ExamplesTest.LowLevelSchedulerPthread ../../src/tests/script.cpp:83: Failure Failed low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault [ FAILED ] ExamplesTest.LowLevelSchedulerPthread (7561 ms) The test -- This message was sent by Atlassian JIRA (v6.3.4#6332)