from:"Dominic Hamon $JIRA$"

[jira] [Commented] (MESOS-2484) libprocess Clock messages delivered

2015-03-12 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359108#comment-14359108
 ] 

Dominic Hamon commented on MESOS-2484:
--

see also https://issues.apache.org/jira/browse/MESOS-1456 regarding metrics.

there's been some discussion about enforcing unique PID ids for every process 
but I can't find the relevant JIRA ticket.

 libprocess Clock messages delivered 
 

 Key: MESOS-2484
 URL: https://issues.apache.org/jira/browse/MESOS-2484
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Cody Maloney

 Found in / discussed at: https://reviews.apache.org/r/30587/#rc118737-72676
 When a process is terminated, any outstanding delay() destined for that 
 process aren't terminated, meaning they arrive whenever the clock happens to 
 get there. With uniquely named processes this isn't an issue, but with names 
 that are reused (master), it could potentially lead to odd test flakiness, 
 and is artifacts carrying across / between tests which shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2216) The configure phase breaks with the IBM JVM.

2015-03-12 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359101#comment-14359101
 ] 

Dominic Hamon commented on MESOS-2216:
--

see also https://issues.apache.org/jira/browse/HADOOP-9435

so a change in include path or explicit linking against libdl should help.

 The configure phase breaks with the IBM JVM.
 --

 Key: MESOS-2216
 URL: https://issues.apache.org/jira/browse/MESOS-2216
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.0, 0.20.1
 Environment: Ubuntu / x86_64
Reporter: Tony Reix
Priority: Blocker

 ./configure does not work with IBM JVM, since it looks for a directory:
/usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server   x86_64
/usr/lib/jvm/ibm-java-ppc64le-71/jre/lib/ppc64le/serverPPC64 LE
 that does not exist for the IBM JVM.
 Though this directory does exist for Oracle JVM and Open JDK:
/usr/lib/jvm/jdk1.7.0_71/jre/lib/amd64/server  Oracle JVM
/usr/lib/jvm/java-1.7.0-openjdk-amd64/jre/lib/amd64/server OpenJDK
 However, the files:
   libjsig.so
   libjvm.so   (3 versions)
 do exist for IBM JVM.
 Anyway, creating the server directory and copying the files (tried with the 3 
 versions of libjvm.so) does not fix the issue:
 checking whether or not we can build with JNI... 
 /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
 reference to `dlopen'
 /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
 reference to `dlclose'
 /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
 reference to `dlerror'
 /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
 reference to `dlsym'
 /usr/lib/jvm/ibm-java-x86_64-71/jre/lib/amd64/server/libjvm.so: undefined 
 reference to `dladdr'
 Something (dlopen, dlclose, dlerror, dlsym, dladdr) is missing in IBM JVM.
 So, either the configure step relies on a feature that is not in the Java 
 standard but only in the Oracle JVM and OpenJDK, or the IBM JVM lacks part of 
 the Java standard.
 I'm not an expert about this. So, I'd like Mesos people to experiment with 
 IBM JVM. Maybe there is another solution for this step of the Mesos configure 
 that would work with all 3 JVMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-94) Master and Slave HTTP handlers should have unit tests

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-94:
---
Labels: http json test twitter  (was: http json test)

 Master and Slave HTTP handlers should have unit tests
 -

 Key: MESOS-94
 URL: https://issues.apache.org/jira/browse/MESOS-94
 Project: Mesos
  Issue Type: Improvement
  Components: json api, master, slave, test
Reporter: Charles Reiss
  Labels: http, json, test, twitter

 The Master and Slave have HTTP handlers which serve their state (mainly for 
 the webui to use). There should be unit tests of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2293) Implement the Call endpoint on master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2293:
-
Labels: twitter  (was: )

 Implement the Call endpoint on master
 -

 Key: MESOS-2293
 URL: https://issues.apache.org/jira/browse/MESOS-2293
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1988) Scheduler driver should not generate TASK_LOST when disconnected from master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1988:
-
Labels: twitter  (was: )

 Scheduler driver should not generate TASK_LOST when disconnected from master
 

 Key: MESOS-1988
 URL: https://issues.apache.org/jira/browse/MESOS-1988
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
  Labels: twitter

 Currently, the driver replies to launchTasks() with TASK_LOST if it detects 
 that it is disconnected from the master. After MESOS-1972 lands, this will be 
 the only place where driver generates TASK_LOST. See MESOS-1972 for more 
 context.
 This fix is targeted for 0.22.0 to give frameworks time to implement 
 reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.

2015-03-09 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353182#comment-14353182
 ] 

Dominic Hamon commented on MESOS-2467:
--

instead of relying on the first character (which can also be '{' in valid json) 
perhaps we can instead:

- try JSON parsing, catch failure
- fallback to old parsing


This also means we can deprecate the old parsing behaviour more easily. 

 Allow --resources flag to take JSON.
 

 Key: MESOS-2467
 URL: https://issues.apache.org/jira/browse/MESOS-2467
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu

 Currently, we used a customized format for --resources flag. As we introduce 
 more and more stuffs (e.g., persistence, reservation) in Resource object, we 
 need a more generic way to specify --resources.
 For backward compatibility, we can scan the first character. If it is '[', 
 then we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2277:
-
Labels: documentation newbie starter twitter  (was: documentation newbie 
starter)

 Document undocumented HTTP endpoints
 

 Key: MESOS-2277
 URL: https://issues.apache.org/jira/browse/MESOS-2277
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen
Priority: Minor
  Labels: documentation, newbie, starter, twitter

 Did a quick scan and we are missing documentation for a few endpoints:
 {code}
 files/browse.json
 files/read.json
 files/download.json
 files/debug.json
 master/roles.json
 master/state.json
 master/stats.json
 slave/state.json
 slave/stats.json
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2294) Implement the Events endpoint on master

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2294:
-
Labels: twitter  (was: )

 Implement the Events endpoint on master
 ---

 Key: MESOS-2294
 URL: https://issues.apache.org/jira/browse/MESOS-2294
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1127) Implement the protobufs for the scheduler API

2015-03-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1127:


Assignee: Vinod Kone  (was: Benjamin Hindman)

 Implement the protobufs for the scheduler API
 -

 Key: MESOS-1127
 URL: https://issues.apache.org/jira/browse/MESOS-1127
 Project: Mesos
  Issue Type: Task
  Components: framework
Reporter: Benjamin Hindman
Assignee: Vinod Kone
  Labels: twitter

 The default scheduler/executor interface and implementation in Mesos have a 
 few drawbacks:
 (1) The interface is fairly high-level which makes it hard to do certain 
 things, for example, handle events (callbacks) in batch. This can have a big 
 impact on the performance of schedulers (for example, writing task updates 
 that need to be persisted).
 (2) The implementation requires writing a lot of boilerplate JNI and native 
 Python wrappers when adding additional API components.
 The plan is to provide a lower-level API that can easily be used to implement 
 the higher-level API that is currently provided. This will also open the door 
 to more easily building native-language Mesos libraries (i.e., not needing 
 the C++ shim layer) and building new higher-level abstractions on top of the 
 lower-level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1023) Replace all static/global variables with non-POD type

2015-03-09 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353464#comment-14353464
 ] 

Dominic Hamon commented on MESOS-1023:
--

commit f780f67717fe0aa25b6870baedd55c43a7017edb (HEAD, origin/master, 
origin/HEAD, master)
Author: Dominic Hamon d...@twitter.com
Commit: Dominic Hamon d...@twitter.com

Remove static strings from process and split out some source.

Review: https://reviews.apache.org/r/30841


 Replace all static/global variables with non-POD type
 -

 Key: MESOS-1023
 URL: https://issues.apache.org/jira/browse/MESOS-1023
 Project: Mesos
  Issue Type: Bug
  Components: general, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: c++

 See 
 http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Static_and_Global_Variables
  for the background.
 Real bugs have been seen. For example, in process::ID::generate we have a 
 mapstring, int that can be accessed within the function after exit has been 
 called. Ie, we can try to access the map after it's been destroyed, but 
 before exit has completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2457) Update post-reviews to rbtools in 'submit your patch' of developer's guide

2015-03-06 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350579#comment-14350579
 ] 

Dominic Hamon commented on MESOS-2457:
--

Install RBTools, yes. But we still want people to run support/post-reviews.py 
as it wraps rbt and avoids users having to set parent branches and manage diff 
chains.

 Update post-reviews to rbtools in 'submit your patch' of developer's guide 
 ---

 Key: MESOS-2457
 URL: https://issues.apache.org/jira/browse/MESOS-2457
 Project: Mesos
  Issue Type: Bug
  Components: documentation, project website
Reporter: Nancy Ko
Priority: Minor
  Labels: documentation, newbie

 In developer's guide 
 (http://mesos.apache.org/documentation/latest/mesos-developers-guide/) 
 post-reviews should be changed to review board tools. Specifically: 
 List item 3: 
 First, install post-review. See Instructions
 See Instructions link should also redirect to: 
 https://www.reviewboard.org/docs/rbtools/dev/ 
 instead of:
 https://www.reviewboard.org/docs/manual/dev/users/tools/post-review/
 AND
 List item 5:
 From your local branch run support/post-reviews.py.
 The run command should be changed to: rbt post



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-03 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345863#comment-14345863
 ] 

Dominic Hamon commented on MESOS-2422:
--

https://reviews.apache.org/r/31502/
https://reviews.apache.org/r/31503/
https://reviews.apache.org/r/31504/
https://reviews.apache.org/r/31505/

 Use fq_codel qdisc for egress network traffic isolation
 ---

 Key: MESOS-2422
 URL: https://issues.apache.org/jira/browse/MESOS-2422
 Project: Mesos
  Issue Type: Task
Reporter: Cong Wang
Assignee: Cong Wang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2136:
-
Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter 
Mesos Q1 Sprint 4  (was: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, 
Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

 Expose per-cgroup memory pressure
 -

 Key: MESOS-2136
 URL: https://issues.apache.org/jira/browse/MESOS-2136
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Ian Downes
Assignee: Chi Zhang
  Labels: twitter

 The cgroup memory controller can provide information on the memory pressure 
 of a cgroup. This is in the form of an event based notification where events 
 of (low, medium, critical) are generated when the kernel makes specific 
 actions to allocate memory. This signal is probably more informative than 
 comparing memory usage to memory limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2103) Expose number of processes and threads in a container

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2103:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

 Expose number of processes and threads in a container
 -

 Key: MESOS-2103
 URL: https://issues.apache.org/jira/browse/MESOS-2103
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Chi Zhang
  Labels: twitter

 The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, 
 cpus_throttled_time) are difficult to interpret.
 1) nr_throttled is the number of intervals where *any* throttling occurred
 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in 
 the Linux sense).
 For example, in a typical 60 second sampling interval: nr_periods = 600, 
 nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be 
 much higher than (60/600) * 60 = 6 seconds if there is more than one task 
 that is runnable but throttled. *Each* throttled task contributes to the 
 total throttled time.
 Small test to demonstrate throttled_time  nr_periods * quota_interval:
 5 x {{'openssl speed'}} running with quota=100ms:
 {noformat}
 cat cpu.stat  sleep 1  cat cpu.stat
 nr_periods 3228
 nr_throttled 1276
 throttled_time 528843772540
 nr_periods 3238
 nr_throttled 1286
 throttled_time 531668964667
 {noformat}
 All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second 
 (more than 100% of the time interval)
 It would be helpful to expose the number of processes and tasks in the 
 container cgroup. This would be at a very coarse granularity but would give 
 some guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2403) MasterAllocatorTest/0.FrameworkReregistersFirst is flaky

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2403:
-
Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 3)

 MasterAllocatorTest/0.FrameworkReregistersFirst is flaky
 

 Key: MESOS-2403
 URL: https://issues.apache.org/jira/browse/MESOS-2403
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.23.0
 Environment: ASF CI (Ubuntu) 
Reporter: Vinod Kone
Assignee: Vinod Kone

 {code}
 [ RUN  ] MasterAllocatorTest/0.FrameworkReregistersFirst
 Using temporary directory 
 '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml'
 I0224 23:22:31.681670 30589 leveldb.cpp:176] Opened db in 2.943518ms
 I0224 23:22:31.682152 30619 process.cpp:2117] Dropped / Lost event for PID: 
 slave(65)@67.195.81.187:38391
 I0224 23:22:31.682732 30589 leveldb.cpp:183] Compacted db in 1.029469ms
 I0224 23:22:31.682777 30589 leveldb.cpp:198] Created db iterator in 15460ns
 I0224 23:22:31.682792 30589 leveldb.cpp:204] Seeked to beginning of db in 
 1832ns
 I0224 23:22:31.682802 30589 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 319ns
 I0224 23:22:31.682833 30589 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0224 23:22:31.683228 30605 recover.cpp:449] Starting replica recovery
 I0224 23:22:31.683537 30605 recover.cpp:475] Replica is in 4 status
 I0224 23:22:31.684624 30615 replica.cpp:641] Replica in 4 status received a 
 broadcasted recover request
 I0224 23:22:31.684978 30616 recover.cpp:195] Received a recover response from 
 a replica in 4 status
 I0224 23:22:31.685405 30610 recover.cpp:566] Updating replica status to 3
 I0224 23:22:31.686249 30609 master.cpp:349] Master 
 20150224-232231-3142697795-38391-30589 (pomona.apache.org) started on 
 67.195.81.187:38391
 I0224 23:22:31.686265 30617 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 717897ns
 I0224 23:22:31.686319 30617 replica.cpp:323] Persisted replica status to 3
 I0224 23:22:31.686336 30609 master.cpp:395] Master only allowing 
 authenticated frameworks to register
 I0224 23:22:31.686357 30609 master.cpp:400] Master only allowing 
 authenticated slaves to register
 I0224 23:22:31.686390 30609 credentials.hpp:37] Loading credentials for 
 authentication from 
 '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_Vy5Nml/credentials'
 I0224 23:22:31.686511 30606 recover.cpp:475] Replica is in 3 status
 I0224 23:22:31.686563 30609 master.cpp:442] Authorization enabled
 I0224 23:22:31.686929 30607 whitelist_watcher.cpp:79] No whitelist given
 I0224 23:22:31.686954 30603 hierarchical.hpp:287] Initialized hierarchical 
 allocator process
 I0224 23:22:31.687134 30605 replica.cpp:641] Replica in 3 status received a 
 broadcasted recover request
 I0224 23:22:31.687731 30609 master.cpp:1356] The newly elected leader is 
 master@67.195.81.187:38391 with id 20150224-232231-3142697795-38391-30589
 I0224 23:22:31.839818 30609 master.cpp:1369] Elected as the leading master!
 I0224 23:22:31.839834 30609 master.cpp:1187] Recovering from registrar
 I0224 23:22:31.839926 30605 registrar.cpp:313] Recovering registrar
 I0224 23:22:31.84 30613 recover.cpp:195] Received a recover response from 
 a replica in 3 status
 I0224 23:22:31.840504 30606 recover.cpp:566] Updating replica status to 1
 I0224 23:22:31.841599 30611 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 990330ns
 I0224 23:22:31.841627 30611 replica.cpp:323] Persisted replica status to 1
 I0224 23:22:31.841743 30611 recover.cpp:580] Successfully joined the Paxos 
 group
 I0224 23:22:31.841904 30611 recover.cpp:464] Recover process terminated
 I0224 23:22:31.842366 30608 log.cpp:660] Attempting to start the writer
 I0224 23:22:31.843557 30607 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0224 23:22:31.844312 30607 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 722368ns
 I0224 23:22:31.844337 30607 replica.cpp:345] Persisted promised to 1
 I0224 23:22:31.844889 30615 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0224 23:22:31.846043 30614 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0224 23:22:31.846729 30614 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 660024ns
 I0224 23:22:31.846746 30614 replica.cpp:679] Persisted action at 0
 I0224 23:22:31.847671 30611 replica.cpp:511] Replica received write request 
 for position 0
 I0224 23:22:31.847723 30611 leveldb.cpp:438] Reading position from leveldb 
 took 27349ns
 I0224 23:22:31.848429 30611 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 671461ns
 I0224 23:22:31.848454 30611

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

 Report per-container metrics for network bandwidth throttling
 -

 Key: MESOS-2332
 URL: https://issues.apache.org/jira/browse/MESOS-2332
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Paul Brett
Assignee: Paul Brett
  Labels: features, twitter

 Export metrics from the network isolation to identify scope and duration of 
 container throttling.  
 Packet loss can be identified from the overlimits and requeues fields of the 
 htb qdisc report for the virtual interface, e.g.
 {noformat}
 $ tc -s -d qdisc show dev mesos19223
 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
 1 1 1
  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 qdisc ingress : parent :fff1 
  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
 requeues 0)
  backlog 0b 0p requeues 0
 {noformat}
 Note that since a packet can be examined multiple times before transmission, 
 overlimits can exceed total packets sent.  
 Add to the port_mapping isolator usage() and the container statistics 
 protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
 protobuf fields so it's clear what these represent and how they are different 
 to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2422:
-
Labels: twitter  (was: )

 Use fq_codel qdisc for egress network traffic isolation
 ---

 Key: MESOS-2422
 URL: https://issues.apache.org/jira/browse/MESOS-2422
 Project: Mesos
  Issue Type: Task
Reporter: Cong Wang
Assignee: Cong Wang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos 
Q1 Sprint 4  (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3)

 Design doc for the HTTP API
 ---

 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone

 This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2350:
-
Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 3)

 Add support for MesosContainerizerLaunch to chroot to a specified path
 --

 Key: MESOS-2350
 URL: https://issues.apache.org/jira/browse/MESOS-2350
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.22.0, 0.21.1
Reporter: Ian Downes
Assignee: Ian Downes
  Labels: twitter

 In preparation for the MesosContainerizer to support a filesystem isolator 
 the MesosContainerizerLauncher must support chrooting. Optionally, it should 
 also configure the chroot environment by (re-)mounting special filesystems 
 such as /proc and /sys and making device nodes such as /dev/zero, etc., such 
 that the chroot environment is functional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 4  (was: Twitter 
Mesos Q1 Sprint 1)

 Deprecate stats.json endpoints for Master and Slave
 ---

 Key: MESOS-2058
 URL: https://issues.apache.org/jira/browse/MESOS-2058
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: twitter
 Fix For: 0.23.0


 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
 metrics are now duplicated in the Master and Slave between this and 
 {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
 Manual inspection of {{stats.json}} shows that all metrics are now covered by 
 the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation

2015-03-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2422:
-
Sprint: Twitter Mesos Q1 Sprint 4

 Use fq_codel qdisc for egress network traffic isolation
 ---

 Key: MESOS-2422
 URL: https://issues.apache.org/jira/browse/MESOS-2422
 Project: Mesos
  Issue Type: Task
Reporter: Cong Wang
Assignee: Cong Wang
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2418) Remove raw pointers from stout/os.hpp

2015-02-27 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14340367#comment-14340367
 ] 

Dominic Hamon commented on MESOS-2418:
--

no more boost please.

there's also {{std::arraychar}} if it's available on our compiler/platform 
suite as it is more similar to the existing fixed-size buffer usage.

{{std::vectorchar}} has the benefit of working in C++03 compatible compilers.

given we can't reach consensus on any use of {{std::unique_ptr}} i doubt it's a 
good fit here.

 Remove raw pointers from stout/os.hpp
 -

 Key: MESOS-2418
 URL: https://issues.apache.org/jira/browse/MESOS-2418
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Joerg Schad
Priority: Minor

 In MESOS-2412 a memory leak was found because of a missing {{delete}}. 
 Forgetting to free memory is a common error while manually managing memory. 
 In order to prevent this issue from happening again, another strategy should 
 be used to handle buffers.
 Among the options there are {{std::vectorchar}}, 
 {{std::unique_ptrchar\[\]}}, or {{boost::scoped_arraychar}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338796#comment-14338796
 ] 

Dominic Hamon commented on MESOS-2412:
--

https://reviews.apache.org/r/31489/

 Potential memleak(s) in stout/os.hpp
 

 Key: MESOS-2412
 URL: https://issues.apache.org/jira/browse/MESOS-2412
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Joerg Schad
Assignee: Dominic Hamon
  Labels: coverity, twitter

 Coverity picked up this potential memleak in os.hpp where we do not delete 
 buffer in the else case. The exact same pattern occurs in getuid(const 
 Optionstd::string user = None()).
 The corresponding CID 1230371 and 1230371.
 {code}
 inline Resultgid_t getgid(const Optionstd::string user = None())
 ...
   while (true) {
 char* buffer = new char[size];
 if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) {
   ... 
   delete[] buffer;
   return gid;
 } else {
   // RHEL7 (and possibly other systems) will return non-zero and
   // set one of the following errors for The given name or uid
   // was not found. See 'man getpwnam_r'. We only check for the
   // errors explicitly listed, and do not consider the ellipsis.
   if (errno == ENOENT ||
   errno == ESRCH ||
   errno == EBADF ||
   errno == EPERM) {
 return None();
// HERE WE DO NOT DELETE BUFFER.
   }
  ...
  // getpwnam_r set ERANGE so try again with a larger buffer.
   size *= 2;
   delete[] buffer;
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-2412:


Assignee: Dominic Hamon

 Potential memleak(s) in stout/os.hpp
 

 Key: MESOS-2412
 URL: https://issues.apache.org/jira/browse/MESOS-2412
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Joerg Schad
Assignee: Dominic Hamon
  Labels: coverity, twitter

 Coverity picked up this potential memleak in os.hpp where we do not delete 
 buffer in the else case. The exact same pattern occurs in getuid(const 
 Optionstd::string user = None()).
 The corresponding CID 1230371 and 1230371.
 {code}
 inline Resultgid_t getgid(const Optionstd::string user = None())
 ...
   while (true) {
 char* buffer = new char[size];
 if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) {
   ... 
   delete[] buffer;
   return gid;
 } else {
   // RHEL7 (and possibly other systems) will return non-zero and
   // set one of the following errors for The given name or uid
   // was not found. See 'man getpwnam_r'. We only check for the
   // errors explicitly listed, and do not consider the ellipsis.
   if (errno == ENOENT ||
   errno == ESRCH ||
   errno == EBADF ||
   errno == EPERM) {
 return None();
// HERE WE DO NOT DELETE BUFFER.
   }
  ...
  // getpwnam_r set ERANGE so try again with a larger buffer.
   size *= 2;
   delete[] buffer;
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Sprint: Twitter Mesos Q1 Sprint 3

 Potential memleak(s) in stout/os.hpp
 

 Key: MESOS-2412
 URL: https://issues.apache.org/jira/browse/MESOS-2412
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Joerg Schad
Assignee: Dominic Hamon
  Labels: coverity, twitter

 Coverity picked up this potential memleak in os.hpp where we do not delete 
 buffer in the else case. The exact same pattern occurs in getuid(const 
 Optionstd::string user = None()).
 The corresponding CID 1230371 and 1230371.
 {code}
 inline Resultgid_t getgid(const Optionstd::string user = None())
 ...
   while (true) {
 char* buffer = new char[size];
 if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) {
   ... 
   delete[] buffer;
   return gid;
 } else {
   // RHEL7 (and possibly other systems) will return non-zero and
   // set one of the following errors for The given name or uid
   // was not found. See 'man getpwnam_r'. We only check for the
   // errors explicitly listed, and do not consider the ellipsis.
   if (errno == ENOENT ||
   errno == ESRCH ||
   errno == EBADF ||
   errno == EPERM) {
 return None();
// HERE WE DO NOT DELETE BUFFER.
   }
  ...
  // getpwnam_r set ERANGE so try again with a larger buffer.
   size *= 2;
   delete[] buffer;
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Assignee: Joerg Schad  (was: Dominic Hamon)

 Potential memleak(s) in stout/os.hpp
 

 Key: MESOS-2412
 URL: https://issues.apache.org/jira/browse/MESOS-2412
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: coverity, twitter

 Coverity picked up this potential memleak in os.hpp where we do not delete 
 buffer in the else case. The exact same pattern occurs in getuid(const 
 Optionstd::string user = None()).
 The corresponding CID 1230371 and 1230371.
 {code}
 inline Resultgid_t getgid(const Optionstd::string user = None())
 ...
   while (true) {
 char* buffer = new char[size];
 if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) {
   ... 
   delete[] buffer;
   return gid;
 } else {
   // RHEL7 (and possibly other systems) will return non-zero and
   // set one of the following errors for The given name or uid
   // was not found. See 'man getpwnam_r'. We only check for the
   // errors explicitly listed, and do not consider the ellipsis.
   if (errno == ENOENT ||
   errno == ESRCH ||
   errno == EBADF ||
   errno == EPERM) {
 return None();
// HERE WE DO NOT DELETE BUFFER.
   }
  ...
  // getpwnam_r set ERANGE so try again with a larger buffer.
   size *= 2;
   delete[] buffer;
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2412) Potential memleak(s) in stout/os.hpp

2015-02-26 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2412:
-
Sprint:   (was: Twitter Mesos Q1 Sprint 3)

 Potential memleak(s) in stout/os.hpp
 

 Key: MESOS-2412
 URL: https://issues.apache.org/jira/browse/MESOS-2412
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Joerg Schad
Assignee: Joerg Schad
  Labels: coverity, twitter

 Coverity picked up this potential memleak in os.hpp where we do not delete 
 buffer in the else case. The exact same pattern occurs in getuid(const 
 Optionstd::string user = None()).
 The corresponding CID 1230371 and 1230371.
 {code}
 inline Resultgid_t getgid(const Optionstd::string user = None())
 ...
   while (true) {
 char* buffer = new char[size];
 if (getpwnam_r(user.get().c_str(), passwd, buffer, size, result) == 0) {
   ... 
   delete[] buffer;
   return gid;
 } else {
   // RHEL7 (and possibly other systems) will return non-zero and
   // set one of the following errors for The given name or uid
   // was not found. See 'man getpwnam_r'. We only check for the
   // errors explicitly listed, and do not consider the ellipsis.
   if (errno == ENOENT ||
   errno == ESRCH ||
   errno == EBADF ||
   errno == EPERM) {
 return None();
// HERE WE DO NOT DELETE BUFFER.
   }
  ...
  // getpwnam_r set ERANGE so try again with a larger buffer.
   size *= 2;
   delete[] buffer;
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2366:
-
Story Points: 1

 MasterSlaveReconciliationTest.ReconcileLostTask is flaky
 

 Key: MESOS-2366
 URL: https://issues.apache.org/jira/browse/MESOS-2366
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Dominic Hamon
  Labels: flaky-test

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
 {code}
 [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
 Using temporary directory 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
 I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
 I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
 I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
 I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
 2370ns
 I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
 db in 348ns
 I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
 I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
 I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
 broadcasted recover request
 I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
 a replica in 4 status
 I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 733223ns
 I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
 I0218 01:53:26.886714 13943 master.cpp:347] Master 
 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
 67.195.81.187:57268
 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
 authenticated frameworks to register
 I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
 authenticated slaves to register
 I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
 I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
 I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
 I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
 broadcasted recover request
 I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
 I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
 allocator process
 I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
 a replica in 3 status
 I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
 master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(183)@67.195.81.187:57268
 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
 I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
 I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
 scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
 I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(179)@67.195.81.187:57268
 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
 allocate!
 I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
 slaves in 60316ns
 I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 921045ns
 I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
 group
 I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
 I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
 I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
 promise request with proposal 1
 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 534122ns
 I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to 
 fill missing position
 I0218 01:53:27.919147 13941 replica.cpp:377] Replica received explicit 
 promise request for position 0 with proposal 2
 I0218

[jira] [Updated] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2366:
-
Sprint: Twitter Mesos Q1 Sprint 3

 MasterSlaveReconciliationTest.ReconcileLostTask is flaky
 

 Key: MESOS-2366
 URL: https://issues.apache.org/jira/browse/MESOS-2366
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Dominic Hamon
  Labels: flaky-test

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
 {code}
 [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
 Using temporary directory 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
 I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
 I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
 I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
 I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
 2370ns
 I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
 db in 348ns
 I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
 I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
 I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
 broadcasted recover request
 I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
 a replica in 4 status
 I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 733223ns
 I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
 I0218 01:53:26.886714 13943 master.cpp:347] Master 
 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
 67.195.81.187:57268
 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
 authenticated frameworks to register
 I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
 authenticated slaves to register
 I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
 I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
 I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
 I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
 broadcasted recover request
 I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
 I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
 allocator process
 I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
 a replica in 3 status
 I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
 master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(183)@67.195.81.187:57268
 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
 I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
 I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
 scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
 I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(179)@67.195.81.187:57268
 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
 allocate!
 I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
 slaves in 60316ns
 I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 921045ns
 I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
 group
 I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
 I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
 I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
 promise request with proposal 1
 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 534122ns
 I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to 
 fill missing position
 I0218 01:53:27.919147 13941 replica.cpp:377] Replica received explicit 
 promise request for position 0 with

[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2350:
-
Labels: twitter  (was: )

 Add support for MesosContainerizerLaunch to chroot to a specified path
 --

 Key: MESOS-2350
 URL: https://issues.apache.org/jira/browse/MESOS-2350
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.22.0, 0.21.1
Reporter: Ian Downes
Assignee: Ian Downes
  Labels: twitter

 In preparation for the MesosContainerizer to support a filesystem isolator 
 the MesosContainerizerLauncher must support chrooting. Optionally, it should 
 also configure the chroot environment by (re-)mounting special filesystems 
 such as /proc and /sys and making device nodes such as /dev/zero, etc., such 
 that the chroot environment is functional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2359) Expose slave's memory and cpu cgroup metrics

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2359:
-
Component/s: twitter

 Expose slave's memory and cpu cgroup metrics
 

 Key: MESOS-2359
 URL: https://issues.apache.org/jira/browse/MESOS-2359
 Project: Mesos
  Issue Type: Improvement
  Components: isolation, twitter
Reporter: Ian Downes
Priority: Minor

 The slave can optionally be placed into its own cgroups (--slave_cgroups=). 
 If this is enabled, we should export the relevant metrics - in preference or 
 in addition to the process based metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333701#comment-14333701
 ] 

Dominic Hamon commented on MESOS-2366:
--

looks like waiting for the status update acknowledgement message should be 
enough.

 MasterSlaveReconciliationTest.ReconcileLostTask is flaky
 

 Key: MESOS-2366
 URL: https://issues.apache.org/jira/browse/MESOS-2366
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Dominic Hamon
  Labels: flaky-test

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
 {code}
 [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
 Using temporary directory 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
 I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
 I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
 I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
 I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
 2370ns
 I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
 db in 348ns
 I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
 I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
 I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
 broadcasted recover request
 I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
 a replica in 4 status
 I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 733223ns
 I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
 I0218 01:53:26.886714 13943 master.cpp:347] Master 
 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
 67.195.81.187:57268
 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
 authenticated frameworks to register
 I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
 authenticated slaves to register
 I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
 I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
 I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
 I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
 broadcasted recover request
 I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
 I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
 allocator process
 I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
 a replica in 3 status
 I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
 master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(183)@67.195.81.187:57268
 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
 I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
 I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
 scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
 I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(179)@67.195.81.187:57268
 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
 allocate!
 I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
 slaves in 60316ns
 I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 921045ns
 I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
 group
 I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
 I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
 I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
 promise request with proposal 1
 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 534122ns
 I0218 01:53:27.916967 13937 replica.cpp:344] Persisted promised to 1
 I0218 01:53:27.917795 13936 coordinator.cpp:229] Coordinator attemping to 
 fill missing position
 I0218

[jira] [Created] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator

2015-02-23 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-2386:


 Summary: Provide full filesystem isolation as a native mesos 
isolator
 Key: MESOS-2386
 URL: https://issues.apache.org/jira/browse/MESOS-2386
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1397) Rename ResourceStatistics for containers

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1397:
-
Labels: twitter  (was: )

 Rename ResourceStatistics for containers
 

 Key: MESOS-1397
 URL: https://issues.apache.org/jira/browse/MESOS-1397
 Project: Mesos
  Issue Type: Improvement
Reporter: Ian Downes
  Labels: twitter

 Rename ContainerStatistics which includes optional ResourceStatistics and 
 optional PerfStatistics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1282) Support unprivileged access to cgroups

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1282:
-
Labels: twitter  (was: )

 Support unprivileged access to cgroups
 --

 Key: MESOS-1282
 URL: https://issues.apache.org/jira/browse/MESOS-1282
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Ian Downes
Priority: Minor
  Labels: twitter
 Attachments: MESOS-1282.patch


 Supporting this would allow running tests with cgroup isolators on CI 
 machines where sudo access is unavailable.
 This could be achieved by having the subsystems mounted and the mesos (or 
 mesos_test) cgroup created and owned by the unprivileged user.
 {noformat}
 [vagrant@mesos cpu]$ cat /proc/mounts | grep cgroup
 tmpfs /sys/fs/cgroup tmpfs rw,relatime 0 0
 cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset,clone_children 0 0
 cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu,clone_children 0 0
 cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct,clone_children 0 0
 cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory,clone_children 0 0
 cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices,clone_children 0 0
 cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer,clone_children 0 0
 cgroup /sys/fs/cgroup/net_cls cgroup rw,relatime,net_cls,clone_children 0 0
 cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio,clone_children 0 0
 [vagrant@mesos cpu]$ pwd
 /sys/fs/cgroup/cpu
 [vagrant@mesos cpu]$ ls -la
 total 0
 drwxr-xr-x  2 root root   0 May  1 22:11 .
 drwxrwxrwt 10 root root 200 Apr 30 23:09 ..
 -rw-r--r--  1 root root   0 Apr 30 23:14 cgroup.clone_children
 --w--w--w-  1 root root   0 Apr 30 23:09 cgroup.event_control
 -rw-r--r--  1 root root   0 Apr 30 23:09 cgroup.procs
 -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.cfs_period_us
 -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.cfs_quota_us
 -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.rt_period_us
 -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.rt_runtime_us
 -rw-r--r--  1 root root   0 Apr 30 23:09 cpu.shares
 -r--r--r--  1 root root   0 Apr 30 23:09 cpu.stat
 -rw-r--r--  1 root root   0 Apr 30 23:09 notify_on_release
 -rw-r--r--  1 root root   0 Apr 30 23:09 release_agent
 -rw-r--r--  1 root root   0 Apr 30 23:09 tasks
 {noformat}
 User is unprivileged:
 {noformat}
 [vagrant@mesos cpu]$ id
 uid=500(vagrant) gid=500(vagrant) groups=500(vagrant),10(wheel)
 [vagrant@mesos cpu]$ mkdir mesos
 mkdir: cannot create directory `mesos': Permission denied
 {noformat}
 Create a cgroup and chown to the unprivileged user.
 {noformat}
 [vagrant@mesos cpu]$ sudo mkdir mesos  sudo chown -R vagrant:vagrant mesos
 [vagrant@mesos cpu]$ ls -la
 total 0
 drwxr-xr-x  3 rootroot  0 May  1 22:11 .
 drwxrwxrwt 10 rootroot200 Apr 30 23:09 ..
 -rw-r--r--  1 rootroot  0 Apr 30 23:14 cgroup.clone_children
 --w--w--w-  1 rootroot  0 Apr 30 23:09 cgroup.event_control
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cgroup.procs
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.cfs_period_us
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.cfs_quota_us
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.rt_period_us
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.rt_runtime_us
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 cpu.shares
 -r--r--r--  1 rootroot  0 Apr 30 23:09 cpu.stat
 drwxr-xr-x  2 vagrant vagrant   0 May  1 22:12 mesos
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 notify_on_release
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 release_agent
 -rw-r--r--  1 rootroot  0 Apr 30 23:09 tasks
 {noformat}
 The unprivileged user can now create nested cgroups and move processes 
 into/out of cgroups it owns.
 {noformat}
 [vagrant@mesos cpu]$ echo $$
 2877
 [vagrant@mesos cpu]$ echo $$  mesos/cgroup.procs
 [vagrant@mesos cpu]$ cat mesos/cgroup.procs
 2877
 2957
 [vagrant@mesos cpu]$ mkdir mesos/test
 [vagrant@mesos cpu]$ echo $$  mesos/test/cgroup.procs
 [vagrant@mesos cpu]$ cat mesos/test/cgroup.procs
 2877
 2960
 [vagrant@mesos cpu]$ echo $$  mesos/cgroup.procs
 [vagrant@mesos cpu]$ cat mesos/cgroup.procs
 2877
 2977
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-886) Slave should wait until resources are isolated before launching tasks

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333741#comment-14333741
 ] 

Dominic Hamon commented on MESOS-886:
-

Is this still relevant?

 Slave should wait until resources are isolated before launching tasks
 -

 Key: MESOS-886
 URL: https://issues.apache.org/jira/browse/MESOS-886
 Project: Mesos
  Issue Type: Bug
  Components: isolation, slave
Affects Versions: 0.14.0
Reporter: Ian Downes
Assignee: Yifan Gu
Priority: Minor
  Labels: twitter

 The slave dispatches to the isolator to update resources and then sends 
 RunTaskMessage to the executor without waiting for the update to complete. 
 This race could, for example, lead to the task using too much RAM (including 
 file cache) and then being OOM killed whenever the resource update completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2386) Provide full filesystem isolation as a native mesos isolator

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2386:
-
Labels: twitter  (was: )

 Provide full filesystem isolation as a native mesos isolator
 

 Key: MESOS-2386
 URL: https://issues.apache.org/jira/browse/MESOS-2386
 Project: Mesos
  Issue Type: Epic
Reporter: Dominic Hamon
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-886) Slave should wait until resources are isolated before launching tasks

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-886:

Labels: twitter  (was: )

 Slave should wait until resources are isolated before launching tasks
 -

 Key: MESOS-886
 URL: https://issues.apache.org/jira/browse/MESOS-886
 Project: Mesos
  Issue Type: Bug
  Components: isolation, slave
Affects Versions: 0.14.0
Reporter: Ian Downes
Assignee: Yifan Gu
Priority: Minor
  Labels: twitter

 The slave dispatches to the isolator to update resources and then sends 
 RunTaskMessage to the executor without waiting for the update to complete. 
 This race could, for example, lead to the task using too much RAM (including 
 file cache) and then being OOM killed whenever the resource update completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-23 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333701#comment-14333701
 ] 

Dominic Hamon edited comment on MESOS-2366 at 2/23/15 8:10 PM:
---

looks like waiting for the status update acknowledgement message should be 
enough.

The master updates the metrics in {{updateTask}}, called from {{statusUpdate}}. 
It's possible that the StatusUpdate message has been sent (which we check for) 
but not acted on by the Master yet, hence the metrics have not been updated. 
Waiting for the explicit acknowledgement is a proxy signal that the Master has 
updated the metrics.


was (Author: dhamon):
looks like waiting for the status update acknowledgement message should be 
enough.

 MasterSlaveReconciliationTest.ReconcileLostTask is flaky
 

 Key: MESOS-2366
 URL: https://issues.apache.org/jira/browse/MESOS-2366
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Dominic Hamon
  Labels: flaky-test

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
 {code}
 [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
 Using temporary directory 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
 I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
 I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
 I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
 I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
 2370ns
 I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
 db in 348ns
 I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
 I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
 I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
 broadcasted recover request
 I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
 a replica in 4 status
 I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 733223ns
 I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
 I0218 01:53:26.886714 13943 master.cpp:347] Master 
 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
 67.195.81.187:57268
 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
 authenticated frameworks to register
 I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
 authenticated slaves to register
 I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
 I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
 I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
 I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
 broadcasted recover request
 I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
 I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
 allocator process
 I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
 a replica in 3 status
 I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
 master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(183)@67.195.81.187:57268
 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
 I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
 I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
 scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
 I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(179)@67.195.81.187:57268
 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
 allocate!
 I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
 slaves in 60316ns
 I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 921045ns
 I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
 group
 I0218 01:53:27.914378 13932

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Sprint: Twitter Mesos Q1 Sprint 2  (was: Twitter Mesos Q1 Sprint 2, Twitter 
Mesos Q1 Sprint 3)

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Assignee: Yan Xu
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-02-23 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Fix Version/s: 0.23.0

 Deprecate stats.json endpoints for Master and Slave
 ---

 Key: MESOS-2058
 URL: https://issues.apache.org/jira/browse/MESOS-2058
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: twitter
 Fix For: 0.23.0


 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
 metrics are now duplicated in the Master and Slave between this and 
 {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
 Manual inspection of {{stats.json}} shows that all metrics are now covered by 
 the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2366) MasterSlaveReconciliationTest.ReconcileLostTask is flaky

2015-02-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326659#comment-14326659
 ] 

Dominic Hamon commented on MESOS-2366:
--

That's curious. That suggests that the status update is not being received at 
the master, but I see it in the log.

We could remove the metrics from the test temporarily, but it suggests that 
there's some wait missing in the test itself, or some check not present.

 MasterSlaveReconciliationTest.ReconcileLostTask is flaky
 

 Key: MESOS-2366
 URL: https://issues.apache.org/jira/browse/MESOS-2366
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
  Labels: flaky-test

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2746/changes
 {code}
 [ RUN  ] MasterSlaveReconciliationTest.ReconcileLostTask
 Using temporary directory 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF'
 I0218 01:53:26.881561 13918 leveldb.cpp:175] Opened db in 2.891605ms
 I0218 01:53:26.882547 13918 leveldb.cpp:182] Compacted db in 953447ns
 I0218 01:53:26.882596 13918 leveldb.cpp:197] Created db iterator in 20629ns
 I0218 01:53:26.882616 13918 leveldb.cpp:203] Seeked to beginning of db in 
 2370ns
 I0218 01:53:26.882627 13918 leveldb.cpp:272] Iterated through 0 keys in the 
 db in 348ns
 I0218 01:53:26.882664 13918 replica.cpp:743] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0218 01:53:26.883124 13947 recover.cpp:448] Starting replica recovery
 I0218 01:53:26.883625 13941 recover.cpp:474] Replica is in 4 status
 I0218 01:53:26.884744 13945 replica.cpp:640] Replica in 4 status received a 
 broadcasted recover request
 I0218 01:53:26.885118 13939 recover.cpp:194] Received a recover response from 
 a replica in 4 status
 I0218 01:53:26.885565 13933 recover.cpp:565] Updating replica status to 3
 I0218 01:53:26.886548 13932 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 733223ns
 I0218 01:53:26.886574 13932 replica.cpp:322] Persisted replica status to 3
 I0218 01:53:26.886714 13943 master.cpp:347] Master 
 20150218-015326-3142697795-57268-13918 (pomona.apache.org) started on 
 67.195.81.187:57268
 I0218 01:53:26.886760 13943 master.cpp:393] Master only allowing 
 authenticated frameworks to register
 I0218 01:53:26.886772 13943 master.cpp:398] Master only allowing 
 authenticated slaves to register
 I0218 01:53:26.886798 13943 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterSlaveReconciliationTest_ReconcileLostTask_Rgb8FF/credentials'
 I0218 01:53:26.886826 13934 recover.cpp:474] Replica is in 3 status
 I0218 01:53:26.887151 13943 master.cpp:440] Authorization enabled
 I0218 01:53:26.887866 13944 replica.cpp:640] Replica in 3 status received a 
 broadcasted recover request
 I0218 01:53:26.887969 13942 whitelist_watcher.cpp:78] No whitelist given
 I0218 01:53:26.888021 13940 hierarchical.hpp:286] Initialized hierarchical 
 allocator process
 I0218 01:53:26.888178 13934 recover.cpp:194] Received a recover response from 
 a replica in 3 status
 I0218 01:53:26.889114 13943 master.cpp:1354] The newly elected leader is 
 master@67.195.81.187:57268 with id 20150218-015326-3142697795-57268-13918
 I0218 01:53:27.064930 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(183)@67.195.81.187:57268
 I0218 01:53:27.911870 13943 master.cpp:1367] Elected as the leading master!
 I0218 01:53:27.911911 13943 master.cpp:1185] Recovering from registrar
 I0218 01:53:27.912106 13948 process.cpp:2117] Dropped / Lost event for PID: 
 scheduler-93f78006-5b69-498b-b4e3-87cdf8062263@67.195.81.187:57268
 I0218 01:53:27.912255 13932 registrar.cpp:312] Recovering registrar
 I0218 01:53:27.912307 13948 process.cpp:2117] Dropped / Lost event for PID: 
 hierarchical-allocator(179)@67.195.81.187:57268
 I0218 01:53:27.912626 13940 hierarchical.hpp:831] No resources available to 
 allocate!
 I0218 01:53:27.912658 13940 hierarchical.hpp:738] Performed allocation for 0 
 slaves in 60316ns
 I0218 01:53:27.912838 13947 recover.cpp:565] Updating replica status to 1
 I0218 01:53:27.913966 13947 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 921045ns
 I0218 01:53:27.913998 13947 replica.cpp:322] Persisted replica status to 1
 I0218 01:53:27.914106 13932 recover.cpp:579] Successfully joined the Paxos 
 group
 I0218 01:53:27.914378 13932 recover.cpp:463] Recover process terminated
 I0218 01:53:27.914916 13939 log.cpp:659] Attempting to start the writer
 I0218 01:53:27.916374 13937 replica.cpp:476] Replica received implicit 
 promise request with proposal 1
 I0218 01:53:27.916941 13937 leveldb.cpp:305] Persisting metadata (8 bytes) to 
 leveldb took 534122ns
 I0218 01:53:27.916967 13937

[jira] [Resolved] (MESOS-1708) Using the wrong resource name should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-1708.
--
   Resolution: Fixed
Fix Version/s: 0.22.0

 Using the wrong resource name should report a better error.
 -

 Key: MESOS-1708
 URL: https://issues.apache.org/jira/browse/MESOS-1708
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Reporter: Benjamin Hindman
Assignee: Dominic Hamon
  Labels: newbie, twitter
 Fix For: 0.22.0


 If a scheduler launches a task using resources the master doesn't know about 
 the task validator causes the task to fail but the error message is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2185) slave state endpoint does not contain all resources in the resources field

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-2185.
--
   Resolution: Fixed
Fix Version/s: 0.22.0

commit 73ddc21f44e65499d4179bb15edf97243c8ab18c (HEAD, origin/master, 
origin/HEAD, master)
Author: Joerg Schad jo...@mesosphere.io
Commit: Dominic Hamon d...@twitter.com

Included all resources in state endpoint.

Review: https://reviews.apache.org/r/31082

 slave state endpoint does not contain all resources in the resources field
 --

 Key: MESOS-2185
 URL: https://issues.apache.org/jira/browse/MESOS-2185
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Affects Versions: 0.21.0
 Environment: Centos 6.5 / Centos 6.6
Reporter: Henning Schmiedehausen
Assignee: Joerg Schad
  Labels: mesosphere
 Fix For: 0.22.0


 fetching status for a slave from the /state.json yields
   resources: {
 ports: [31000-32000],
 mem: 512,
 disk: 33659,
 cpus: 1
   }
 but in the flags section, it lists
 flags: {
resources: 
 cpus:1;mem:512;ports:[31000-32000];set:{label_a,label_b,label_c,label_d};range:[0-1000];scalar:108;numbers:{4,8,15,16,23,42},
 }
 so there are additional resources. these resources show up when sending 
 offers from that slave to the frameworks and the frameworks can use and 
 consume them.
 This may just be a reporting issue with the state.json endpoint.
 https://gist.github.com/hgschmie/0dc4f599bb0ff2e815ed is the full response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2244) RoutingTestINETSockets fails

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2244:
-
Assignee: Chi Zhang

 RoutingTestINETSockets fails 
 -

 Key: MESOS-2244
 URL: https://issues.apache.org/jira/browse/MESOS-2244
 Project: Mesos
  Issue Type: Bug
 Environment: Ubuntu 14.10, libnl 3.2.25
Reporter: Evelina Dumitrescu
Assignee: Chi Zhang

 [ RUN  ] RoutingTest.INETSockets
 *** stack smashing detected ***: 
 /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated
 *** Aborted at 1421895912 (unix time) try date -d @1421895912 if you are 
 using GNU date ***
 PC: @ 0x7f3566460d27 (unknown)
 *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from 
 PID 5683; stack trace: ***
 @ 0x7f35667fec90 (unknown)
 @ 0x7f3566460d27 (unknown)
 @ 0x7f3566462418 (unknown)
 @ 0x7f35664a29f4 (unknown)
 @ 0x7f35665365cc (unknown)
 @ 0x7f3566536570 (unknown)
 @ 0x7f3566226753 idiagnl_msg_parse
 @ 0x7f356622678b idiagnl_msg_parser
 @ 0x7f3565dac4c9 nl_cache_parse
 @ 0x7f3565dac51b update_msg_parser
 @ 0x7f3565db1fbf nl_recvmsgs_report
 @ 0x7f3565db2329 nl_recvmsgs
 @ 0x7f3565dab9c9 __cache_pickup
 @ 0x7f3565dac43d nl_cache_pickup
 @ 0x7f3565dac66e nl_cache_refill
 @ 0x7f3566226024 idiagnl_msg_alloc_cache
 @ 0x7f356a95f455 routing::diagnosis::socket::infos()
 @  0x114da90 RoutingTest_INETSockets_Test::TestBody()
 @  0x11e6957 
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x11e151d 
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x11c7adb testing::Test::Run()
 @  0x11c8253 testing::TestInfo::Run()
 @  0x11c87f6 testing::TestCase::Run()
 @  0x11cd987 testing::internal::UnitTestImpl::RunAllTests()
 @  0x11e7905 
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x11e2304 
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x11cc74a testing::UnitTest::Run()
 @   0xd7a4ad main
 @ 0x7f356644bec5 (unknown)
 @   0x91ccb9 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1708) Using the wrong resource name should report a better error.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1708:
-
Sprint: Twitter Mesos Q1 Sprint 3

 Using the wrong resource name should report a better error.
 -

 Key: MESOS-1708
 URL: https://issues.apache.org/jira/browse/MESOS-1708
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Reporter: Benjamin Hindman
Assignee: Dominic Hamon
  Labels: newbie, twitter

 If a scheduler launches a task using resources the master doesn't know about 
 the task validator causes the task to fail but the error message is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-998) Slave should wait until Containerizer::update() completes successfully

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-998:

Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Slave should wait until Containerizer::update() completes successfully
 --

 Key: MESOS-998
 URL: https://issues.apache.org/jira/browse/MESOS-998
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0, 0.19.0, 0.20.0, 0.21.0, 0.19.1, 0.20.1, 0.21.1
Reporter: Ian Downes
Assignee: Jie Yu

 Container resources are updated in several places in the slave and we don't 
 check the update was successful or even wait until it completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2103) Expose number and state of threads in a container

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2103:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Expose number and state of threads in a container
 -

 Key: MESOS-2103
 URL: https://issues.apache.org/jira/browse/MESOS-2103
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Chi Zhang
  Labels: twitter

 The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, 
 cpus_throttled_time) are difficult to interpret.
 1) nr_throttled is the number of intervals where *any* throttling occurred
 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in 
 the Linux sense).
 For example, in a typical 60 second sampling interval: nr_periods = 600, 
 nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be 
 much higher than (60/600) * 60 = 6 seconds if there is more than one task 
 that is runnable but throttled. *Each* throttled task contributes to the 
 total throttled time.
 Small test to demonstrate throttled_time  nr_periods * quota_interval:
 5 x {{'openssl speed'}} running with quota=100ms:
 {noformat}
 cat cpu.stat  sleep 1  cat cpu.stat
 nr_periods 3228
 nr_throttled 1276
 throttled_time 528843772540
 nr_periods 3238
 nr_throttled 1286
 throttled_time 531668964667
 {noformat}
 All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second 
 (more than 100% of the time interval)
 It would be helpful to expose the number and state of tasks in the container 
 cgroup. This would be at a very coarse granularity but would give some 
 guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Design doc for the HTTP API
 ---

 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone

 This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2136:
-
Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: 
Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 
1, Twitter Mesos Q1 Sprint 2)

 Expose per-cgroup memory pressure
 -

 Key: MESOS-2136
 URL: https://issues.apache.org/jira/browse/MESOS-2136
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Ian Downes
Assignee: Chi Zhang
  Labels: twitter

 The cgroup memory controller can provide information on the memory pressure 
 of a cgroup. This is in the form of an event based notification where events 
 of (low, medium, critical) are generated when the kernel makes specific 
 actions to allocate memory. This signal is probably more informative than 
 comparing memory usage to memory limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2244) RoutingTestINETSockets fails

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324677#comment-14324677
 ] 

Dominic Hamon commented on MESOS-2244:
--

is this an instance of mismatched libnl header/kernel?

 RoutingTestINETSockets fails 
 -

 Key: MESOS-2244
 URL: https://issues.apache.org/jira/browse/MESOS-2244
 Project: Mesos
  Issue Type: Bug
 Environment: Ubuntu 14.10, libnl 3.2.25
Reporter: Evelina Dumitrescu
Assignee: Chi Zhang

 [ RUN  ] RoutingTest.INETSockets
 *** stack smashing detected ***: 
 /home/evelina/mesos2/mesos/build/src/.libs/lt-mesos-tests terminated
 *** Aborted at 1421895912 (unix time) try date -d @1421895912 if you are 
 using GNU date ***
 PC: @ 0x7f3566460d27 (unknown)
 *** SIGABRT (@0x3e81633) received by PID 5683 (TID 0x7f356c53a7c0) from 
 PID 5683; stack trace: ***
 @ 0x7f35667fec90 (unknown)
 @ 0x7f3566460d27 (unknown)
 @ 0x7f3566462418 (unknown)
 @ 0x7f35664a29f4 (unknown)
 @ 0x7f35665365cc (unknown)
 @ 0x7f3566536570 (unknown)
 @ 0x7f3566226753 idiagnl_msg_parse
 @ 0x7f356622678b idiagnl_msg_parser
 @ 0x7f3565dac4c9 nl_cache_parse
 @ 0x7f3565dac51b update_msg_parser
 @ 0x7f3565db1fbf nl_recvmsgs_report
 @ 0x7f3565db2329 nl_recvmsgs
 @ 0x7f3565dab9c9 __cache_pickup
 @ 0x7f3565dac43d nl_cache_pickup
 @ 0x7f3565dac66e nl_cache_refill
 @ 0x7f3566226024 idiagnl_msg_alloc_cache
 @ 0x7f356a95f455 routing::diagnosis::socket::infos()
 @  0x114da90 RoutingTest_INETSockets_Test::TestBody()
 @  0x11e6957 
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x11e151d 
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x11c7adb testing::Test::Run()
 @  0x11c8253 testing::TestInfo::Run()
 @  0x11c87f6 testing::TestCase::Run()
 @  0x11cd987 testing::internal::UnitTestImpl::RunAllTests()
 @  0x11e7905 
 testing::internal::HandleSehExceptionsInMethodIfSupported()
 @  0x11e2304 
 testing::internal::HandleExceptionsInMethodIfSupported()
 @  0x11cc74a testing::UnitTest::Run()
 @   0xd7a4ad main
 @ 0x7f356644bec5 (unknown)
 @   0x91ccb9 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2123) Document changes in C++ Resources API in CHANGELOG.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2123:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Document changes in C++ Resources API in CHANGELOG.
 ---

 Key: MESOS-2123
 URL: https://issues.apache.org/jira/browse/MESOS-2123
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
  Labels: twitter

 With the refactor introduced in MESOS-1974, we need to document those API 
 changes in CHANGELOG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1690) Expose metric for container destroy failures

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1690:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Expose metric for container destroy failures
 

 Key: MESOS-1690
 URL: https://issues.apache.org/jira/browse/MESOS-1690
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Vinod Kone

 Increment counter when container destroy fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Report per-container metrics for network bandwidth throttling
 -

 Key: MESOS-2332
 URL: https://issues.apache.org/jira/browse/MESOS-2332
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Paul Brett
Assignee: Paul Brett
  Labels: features, twitter

 Export metrics from the network isolation to identify scope and duration of 
 container throttling.  
 Packet loss can be identified from the overlimits and requeues fields of the 
 htb qdisc report for the virtual interface, e.g.
 {noformat}
 $ tc -s -d qdisc show dev mesos19223
 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
 1 1 1
  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 qdisc ingress : parent :fff1 
  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
 requeues 0)
  backlog 0b 0p requeues 0
 {noformat}
 Note that since a packet can be examined multiple times before transmission, 
 overlimits can exceed total packets sent.  
 Add to the port_mapping isolator usage() and the container statistics 
 protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
 protobuf fields so it's clear what these represent and how they are different 
 to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2031) Manage persistent directories on slave.

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2031:
-
Sprint: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos 
Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter 
Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter Mesos Q4 Sprint 3, 
Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 
6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2)

 Manage persistent directories on slave.
 ---

 Key: MESOS-2031
 URL: https://issues.apache.org/jira/browse/MESOS-2031
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu

 Whenever a slave sees a persistent disk resource (in ExecutorInfo or 
 TaskInfo) that is new to it, it will create a persistent directory which is 
 for tasks to store persistent data.
 The slave needs to do the following after it's created:
 1) symlink into the executor sandbox so that tasks/executor can see it
 2) garbage collect it once it is released by the framework



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-17 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3  (was: Twitter 
Mesos Q1 Sprint 2)

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Assignee: Yan Xu
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2361) Add metrics to status update manager to expose number of outstanding (un-ack'ed) status updates

2015-02-17 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324975#comment-14324975
 ] 

Dominic Hamon commented on MESOS-2361:
--

the queue length is easily exposed as it is on Master and the Scheduler driver 
already.

see src/master/metrics.hpp:157 - 159.

 Add metrics to status update manager to expose number of outstanding 
 (un-ack'ed) status updates
 ---

 Key: MESOS-2361
 URL: https://issues.apache.org/jira/browse/MESOS-2361
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen

 We have experienced custom executors with high volume of status updates cause 
 congestion on the slave due to framework unavailability (either from being 
 disconnected or not processing status updates fast enough). As a first step, 
 it would be helpful to expose the status update stream/queue depths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2344) segfaults running make check from ev integration

2015-02-11 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316853#comment-14316853
 ] 

Dominic Hamon edited comment on MESOS-2344 at 2/11/15 7:45 PM:
---

a different one now:

{noformat}
(gdb) bt
#0  boost::range_detail::range_endmultihashmapint, 
process::Ownedprocess::PromiseOptionintconst (c=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:44
#1  0x7680c665 in boost::range_adl_barrier::endmultihashmapint, 
process::Ownedprocess::PromiseOptionint (r=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:113
#2  0x7680c4f5 in boost::foreach_detail_::endmultihashmapint, 
process::Ownedprocess::PromiseOptionint   , mpl_::bool_true  
(col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714
#3  0x768096ae in multihashmapint, 
process::Ownedprocess::PromiseOptionint   ::keys (this=0x7fffdc009878) 
at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74
#4  0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at 
../../../3rdparty/libprocess/src/reap.cpp:82
#5  0x7680a968 in operator() (this=0x7fffe0004180, 
process=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78
#6  0x7680a612 in std::_Function_handlervoid (process::ProcessBase*), 
void 
process::dispatchprocess::ReaperProcess(process::PIDprocess::ReaperProcess 
const, void 
(process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}::_M_invoke(std::_Any_data
 const, process::ProcessBase*) (__functor=..., 
__args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071
#7  0x767b4388 in std::functionvoid 
(process::ProcessBase*)::operator()(process::ProcessBase*) const 
(this=0x7fffe0029f00, __args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464
#8  0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, 
event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764
#9  0x767ece5e in process::DispatchEvent::visit (this=0x7fffe0010a90, 
visitor=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/event.hpp:141
#10 0x008cb061 in process::ProcessBase::serve (this=0x7fffdc0098a8, 
event=...) at ../../3rdparty/libprocess/include/process/process.hpp:39
#11 0x7679355d in process::ProcessManager::resume (this=0x3334bb0, 
process=0x7fffdc0098a8) at ../../../3rdparty/libprocess/src/process.cpp:2238
#12 0x76792d8e in process::schedule (arg=0x0) at 
../../../3rdparty/libprocess/src/process.cpp:655
#13 0x721b5182 in start_thread (arg=0x7fffe9dce700) at 
pthread_create.c:312
#14 0x71ee200d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

though this might be related if the DispatchEvent holds a function object 
that's been destroyed. any other non-POD static function objects around the 
place?


was (Author: dhamon):
a different one now:

{noformat}
(gdb) bt
#0  boost::range_detail::range_endmultihashmapint, 
process::Ownedprocess::PromiseOptionintconst (c=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:44
#1  0x7680c665 in boost::range_adl_barrier::endmultihashmapint, 
process::Ownedprocess::PromiseOptionint (r=...) at 
3rdparty/boost-1.53.0/boost/range/end.hpp:113
#2  0x7680c4f5 in boost::foreach_detail_::endmultihashmapint, 
process::Ownedprocess::PromiseOptionint   , mpl_::bool_true  
(col=...) at 3rdparty/boost-1.53.0/boost/foreach.hpp:714
#3  0x768096ae in multihashmapint, 
process::Ownedprocess::PromiseOptionint   ::keys (this=0x7fffdc009878) 
at ../../../3rdparty/libprocess/3rdparty/stout/include/stout/multihashmap.hpp:74
#4  0x7680911e in process::ReaperProcess::wait (this=0x7fffdc009870) at 
../../../3rdparty/libprocess/src/reap.cpp:82
#5  0x7680a968 in operator() (this=0x7fffe0004180, 
process=0x7fffdc0098a8) at 
../../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:78
#6  0x7680a612 in std::_Function_handlervoid (process::ProcessBase*), 
void 
process::dispatchprocess::ReaperProcess(process::PIDprocess::ReaperProcess 
const, void 
(process::ReaperProcess::*)())::{lambda(process::ProcessBase*)#1}::_M_invoke(std::_Any_data
 const, process::ProcessBase*) (__functor=..., 
__args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2071
#7  0x767b4388 in std::functionvoid 
(process::ProcessBase*)::operator()(process::ProcessBase*) const 
(this=0x7fffe0029f00, __args=0x7fffdc0098a8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2464
#8  0x767a31b4 in process::ProcessBase::visit (this=0x7fffdc0098a8, 
event=...) at ../../../3rdparty/libprocess/src/process.cpp:2764
#9  0x767ece5e in

[jira] [Commented] (MESOS-2344) segfaults running make check from ev integration

2015-02-11 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316896#comment-14316896
 ] 

Dominic Hamon commented on MESOS-2344:
--

commit 95c448f77731034114183fc5f5bf6e040d4c0f5d (HEAD, origin/master, 
origin/HEAD, nonpod.clock, master)
Author: Dominic Hamon dha...@twitter.com
Commit: Dominic Hamon dha...@twitter.com

Remove more non-pod statics from clock

Review: https://reviews.apache.org/r/30886



 segfaults running make check from ev integration
 

 Key: MESOS-2344
 URL: https://issues.apache.org/jira/browse/MESOS-2344
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Dominic Hamon
Assignee: Joris Van Remoortere
Priority: Blocker

 Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
 the {{process::EventLoop}}. Stack traces and debugging sessions below:
 {noformat}
 (gdb) bt
 #0  0x00789c71 in std::movestd::_Tuple_impl2ul (__t=...) at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
 #1  0x76821148 in std::_Tuple_impl1, void 
 (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 
 0x27e516d, DIE 0x27f7273) (
 this=0x7fffe00228d8, __in=unknown type in 
 build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
 #2  0x768210a4 in std::_Tuple_impl0, Duration, void 
 (*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 
 0x27e516d, DIE 0x27f71f7) (this=0x7fffe00228d8, __in=unknown type in 
 build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
 #3  0x76821068 in std::tupleDuration, void (*)()::tuple(unknown 
 type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4) (
 this=0x7fffe00228d8) at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
 #4  0x76821014 in std::_Bindprocess::FutureNothing (*(Duration, 
 void (*)()))(const Duration , void (*)())::_Bind(unknown type in 
 build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) 
 (this=0x7fffe00228d0, __b=unknown type in 
 build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
 #5  0x76820f86 in 
 std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing 
 (*(Duration, void (*)()))(Duration const, void (*)()) 
 ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing 
 (*(Duration, void (*)()))(Duration const, void (*)()), 
 std::integral_constantbool, false) (__functor=..., __f=unknown type in 
 build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
 #6  0x76820ab0 in 
 std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing 
 (*(Duration, void (*)()))(Duration const, void (*)()) 
 ::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing 
 (*(Duration, void (*)()))(Duration const, void (*)())) (__functor=..., 
 __f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, 
 DIE 0x27f7115)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
 #7  0x768208e6 in std::functionprocess::FutureNothing 
 ()::functionstd::_Bindprocess::FutureNothing (*(Duration, void 
 (*)()))(const Duration , void (*)()), 
 void(std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const 
 Duration , void (*)())) (this=0x7fffe85ca9d0, __f=...)
 at 
 /usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
 #8  0x7681fe55 in process::EventLoop::delay (duration=..., 
 function=0x76729580 process::tick()) at 
 ../../../3rdparty/libprocess/src/libev.cpp:98
 #9  0x7672a151 in process::tick () at 
 ../../../3rdparty/libprocess/src/clock.cpp:125
 #10 0x7681fcb2 in process::internal::handle_delay 
 (loop=0x77dd91f0 default_loop_struct, timer=0x7fffe00279b0, revents=256)
 at ../../../3rdparty/libprocess/src/libev.cpp:64
 #11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
 default_loop_struct) at ev.c:2994
 #12 0x76860803 in ev_run (loop=0x77dd91f0 default_loop_struct, 
 flags=optimized out) at ev.c:3394
 #13 0x7681fffb in ev_loop (loop=0x77dd91f0 default_loop_struct, 
 flags=0) at 3rdparty/libev-4.15/ev.h:826
 #14 0x7681ff49 in process::EventLoop::run () at 
 ../../../3rdparty/libprocess/src/libev.cpp:114
 #15 0x721d2182 in start_thread

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-11 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Story Points: 5

 Report per-container metrics for network bandwidth throttling
 -

 Key: MESOS-2332
 URL: https://issues.apache.org/jira/browse/MESOS-2332
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Paul Brett
Assignee: Paul Brett
  Labels: features, twitter

 Export metrics from the network isolation to identify scope and duration of 
 container throttling.  
 Packet loss can be identified from the overlimits and requeues fields of the 
 htb qdisc report for the virtual interface, e.g.
 {noformat}
 $ tc -s -d qdisc show dev mesos19223
 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
 1 1 1
  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 qdisc ingress : parent :fff1 
  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
 requeues 0)
  backlog 0b 0p requeues 0
 {noformat}
 Note that since a packet can be examined multiple times before transmission, 
 overlimits can exceed total packets sent.  
 Add to the port_mapping isolator usage() and the container statistics 
 protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
 protobuf fields so it's clear what these represent and how they are different 
 to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-10 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Component/s: twitter
 Sprint: Twitter Mesos Q1 Sprint 2

 Report per-container metrics for network bandwidth throttling
 -

 Key: MESOS-2332
 URL: https://issues.apache.org/jira/browse/MESOS-2332
 Project: Mesos
  Issue Type: Improvement
  Components: isolation, twitter
Reporter: Paul Brett
  Labels: features

 Export metrics from the network isolation to identify scope and duration of 
 container throttling.  
 Packet loss can be identified from the overlimits and requeues fields of the 
 htb qdisc report for the virtual interface, e.g.
 {noformat}
 $ tc -s -d qdisc show dev mesos19223
 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
 1 1 1
  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 qdisc ingress : parent :fff1 
  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
 requeues 0)
  backlog 0b 0p requeues 0
 {noformat}
 Note that since a packet can be examined multiple times before transmission, 
 overlimits can exceed total packets sent.  
 Add to the port_mapping isolator usage() and the container statistics 
 protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
 protobuf fields so it's clear what these represent and how they are different 
 to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling

2015-02-10 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2332:
-
Assignee: Paul Brett

 Report per-container metrics for network bandwidth throttling
 -

 Key: MESOS-2332
 URL: https://issues.apache.org/jira/browse/MESOS-2332
 Project: Mesos
  Issue Type: Improvement
  Components: isolation, twitter
Reporter: Paul Brett
Assignee: Paul Brett
  Labels: features

 Export metrics from the network isolation to identify scope and duration of 
 container throttling.  
 Packet loss can be identified from the overlimits and requeues fields of the 
 htb qdisc report for the virtual interface, e.g.
 {noformat}
 $ tc -s -d qdisc show dev mesos19223
 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 
 1 1 1
  Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0)
  backlog 0b 0p requeues 0
 qdisc ingress : parent :fff1 
  Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 
 requeues 0)
  backlog 0b 0p requeues 0
 {noformat}
 Note that since a packet can be examined multiple times before transmission, 
 overlimits can exceed total packets sent.  
 Add to the port_mapping isolator usage() and the container statistics 
 protobuf. Carefully consider the naming (esp tx/rx) + commenting of the 
 protobuf fields so it's clear what these represent and how they are different 
 to the existing dropped packet counts from the network stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1956) Add IPv6 ICMPv6 libnl traffic control U32 filters

2015-02-10 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314529#comment-14314529
 ] 

Dominic Hamon commented on MESOS-1956:
--

when we wrote the port mapping isolator, it was to deal with the constraint of 
having not enough IP addresses. If we have IPv6 available, we should be able to 
ensure each container gets its own IP address so the port mapping isolator 
shouldn't be needed.

when we initialize the port mapping isolator, can we check if we're in IPv4 or 
IPv6 world? I'm ok with the port mapping isolator only working in IPv4. 
[~idownes] do you agree?

 Add IPv6  ICMPv6 libnl traffic control U32 filters
 ---

 Key: MESOS-1956
 URL: https://issues.apache.org/jira/browse/MESOS-1956
 Project: Mesos
  Issue Type: Task
  Components: isolation
Reporter: Evelina Dumitrescu
Assignee: Evelina Dumitrescu

 For IPv6, the filtering should be done by source and destination ports, 
 destination IP, destination MAC.
 For ICMPv6, the filtering should be done by protocol and destination IP.
 The IPv6/IPv4 difference could be done by the source/destination IP type from 
 the classifier.
 IPv4 packets with options in the header are currently ignored due to a bug in 
 libnl. It should be investigated if the problem occurs in the case of IPv6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2344) segfaults running make check from ev integration

2015-02-10 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-2344:


 Summary: segfaults running make check from ev integration
 Key: MESOS-2344
 URL: https://issues.apache.org/jira/browse/MESOS-2344
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Dominic Hamon


Running make check on Ubuntu under gdb, I've seen a number of segfaults from 
the {{process::EventLoop}}. Stack traces and debugging sessions below:

{noformat}
(gdb) bt
#0  0x00789c71 in std::movestd::_Tuple_impl2ul (__t=...) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/move.h:102
#1  0x76821148 in std::_Tuple_impl1, void (*)()::_Tuple_impl(unknown 
type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273) (
this=0x7fffe00228d8, __in=unknown type in 
build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f7273)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:270
#2  0x768210a4 in std::_Tuple_impl0, Duration, void 
(*)()::_Tuple_impl(unknown type in build/src/.libs/libmesos-0.22.0.so, CU 
0x27e516d, DIE 0x27f71f7) (this=0x7fffe00228d8, __in=unknown type in 
build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71f7)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:271
#3  0x76821068 in std::tupleDuration, void (*)()::tuple(unknown type 
in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f71c4) (
this=0x7fffe00228d8) at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/tuple:542
#4  0x76821014 in std::_Bindprocess::FutureNothing (*(Duration, void 
(*)()))(const Duration , void (*)())::_Bind(unknown type in 
build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f718d) 
(this=0x7fffe00228d0, __b=unknown type in build/src/.libs/libmesos-0.22.0.so, 
CU 0x27e516d, DIE 0x27f718d)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1342
#5  0x76820f86 in 
std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing 
(*(Duration, void (*)()))(Duration const, void (*)()) 
::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing 
(*(Duration, void (*)()))(Duration const, void (*)()), 
std::integral_constantbool, false) (__functor=..., __f=unknown type in 
build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 0x27f714b)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1987
#6  0x76820ab0 in 
std::_Function_base::_Base_managerstd::_Bindprocess::FutureNothing 
(*(Duration, void (*)()))(Duration const, void (*)()) 
::_M_init_functor(std::_Any_data, std::_Bindprocess::FutureNothing 
(*(Duration, void (*)()))(Duration const, void (*)())) (__functor=..., 
__f=unknown type in build/src/.libs/libmesos-0.22.0.so, CU 0x27e516d, DIE 
0x27f7115)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:1958
#7  0x768208e6 in std::functionprocess::FutureNothing 
()::functionstd::_Bindprocess::FutureNothing (*(Duration, void 
(*)()))(const Duration , void (*)()), 
void(std::_Bindprocess::FutureNothing (*(Duration, void (*)()))(const 
Duration , void (*)())) (this=0x7fffe85ca9d0, __f=...)
at 
/usr/bin/../lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/functional:2451
#8  0x7681fe55 in process::EventLoop::delay (duration=..., 
function=0x76729580 process::tick()) at 
../../../3rdparty/libprocess/src/libev.cpp:98
#9  0x7672a151 in process::tick () at 
../../../3rdparty/libprocess/src/clock.cpp:125
#10 0x7681fcb2 in process::internal::handle_delay (loop=0x77dd91f0 
default_loop_struct, timer=0x7fffe00279b0, revents=256)
at ../../../3rdparty/libprocess/src/libev.cpp:64
#11 0x7685f8c5 in ev_invoke_pending (loop=0x77dd91f0 
default_loop_struct) at ev.c:2994
#12 0x76860803 in ev_run (loop=0x77dd91f0 default_loop_struct, 
flags=optimized out) at ev.c:3394
#13 0x7681fffb in ev_loop (loop=0x77dd91f0 default_loop_struct, 
flags=0) at 3rdparty/libev-4.15/ev.h:826
#14 0x7681ff49 in process::EventLoop::run () at 
../../../3rdparty/libprocess/src/libev.cpp:114
#15 0x721d2182 in start_thread (arg=0x7fffe85cb700) at 
pthread_create.c:312
#16 0x71eff00d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 8
#8  0x7681fe55 in process::EventLoop::delay (duration=..., 
function=0x76729580 process::tick()) at 
../../../3rdparty/libprocess/src/libev.cpp:98
98run_in_event_loopNothing(
(gdb) list
93  } // namespace internal {
94
95
96  void EventLoop::delay(const Duration duration, void(*function)(void))
97  {
98run_in_event_loopNothing(
99lambda::bind(internal::delay, duration, function));
100 }
101
102
(gdb) p duration
$1 = (const Duration ) @0x7fffe000da90:

[jira] [Commented] (MESOS-1403) Segfault when starting a slave locally.

2015-02-10 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315291#comment-14315291
 ] 

Dominic Hamon commented on MESOS-1403:
--

cc [~benjaminhindman] [~jvanremoortere]

 Segfault when starting a slave locally.
 ---

 Key: MESOS-1403
 URL: https://issues.apache.org/jira/browse/MESOS-1403
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.19.0
Reporter: Benjamin Mahler

 This is from the build directory on a CentOS machine.
 {noformat}
 [bmahler@foobar build]$ sudo ./bin/mesos-slave.sh --master=localhost:5050
 [sudo] password for bmahler:
 I0522 01:01:02.639114  4605 main.cpp:126] Build: 2014-05-06 22:08:34 by root
 I0522 01:01:02.639277  4605 main.cpp:128] Version: 0.19.0
 I0522 01:01:02.639312  4605 mesos_containerizer.cpp:124] Using isolation: 
 posix/cpu,posix/mem
 I0522 01:01:02.642699  4605 main.cpp:149] Starting Mesos slave
 I0522 01:01:02.644693  4631 slave.cpp:143] Slave started on 1)@IP:5051
 I0522 01:01:02.645560  4631 slave.cpp:255] Slave resources: cpus(*):24; 
 mem(*):71322; disk(*):454895; ports(*):[31000-32000]
 I0522 01:01:02.647763  4631 slave.cpp:283] Slave hostname: foobar
 I0522 01:01:02.647790  4631 slave.cpp:284] Slave checkpoint: true
 I0522 01:01:02.651803  4625 state.cpp:33] Recovering state from 
 '/tmp/mesos/meta'
 I0522 01:01:02.653393  4625 status_update_manager.cpp:193] Recovering status 
 update manager
 I0522 01:01:02.654024  4643 mesos_containerizer.cpp:281] Recovering 
 containerizer
 I0522 01:01:02.655377  4639 slave.cpp:2988] Finished recovery
 I0522 01:01:02.656368  4639 slave.cpp:536] New master detected at 
 master@127.0.0.1:5050
 I0522 01:01:02.656682  4639 slave.cpp:572] No credentials provided. 
 Attempting to register without authentication
 I0522 01:01:02.656744  4629 status_update_manager.cpp:167] New master 
 detected at master@127.0.0.1:5050
 I0522 01:01:02.656754  4639 slave.cpp:585] Detecting new master
 *** Aborted at 1400720462 (unix time) try date -d @1400720462 if you are 
 using GNU date ***
 I0522 01:01:02.656982  4639 slave.cpp:2194] master@127.0.0.1:5050 exited
 W0522 01:01:02.657004  4639 slave.cpp:2197] Master disconnected! Waiting for 
 a new master to be elected
 PC: @ 0x7f4a9e3faff6 std::_Deque_base::_M_destroy_nodes()
 *** SIGSEGV (@0x31) received by PID 4605 (TID 0x7f4a8c1d0940) from PID 49; 
 stack trace: ***
 @ 0x7f4a9baefca0 (unknown)
 @ 0x7f4a9e3faff6 std::_Deque_base::_M_destroy_nodes()
 @ 0x7f4a9e3ecdaf std::_Deque_base::~_Deque_base()
 @ 0x7f4a9e3e2bd5 std::deque::~deque()
 @ 0x7f4a9e3dfe10 process::DataDecoder::~DataDecoder()
 @ 0x7f4a9e3ba9bc process::receiving_connect()
 @ 0x7f4a9e506bc5 ev_invoke_pending
 @ 0x7f4a9e509af5 ev_run
 @ 0x7f4a9e3b5928 ev_loop
 @ 0x7f4a9e3bb2d9 process::serve()
 @ 0x7f4a9bae783d start_thread
 @ 0x7f4a9a84f26d clone
 /var/tmp/scltOMGb3: line 8:  4605 Segmentation fault  
 './bin/mesos-slave.sh' '--master=localhost:5050'
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2288:
-
Labels: twitter  (was: )

 HTTP API for interacting with Mesos
 ---

 Key: MESOS-2288
 URL: https://issues.apache.org/jira/browse/MESOS-2288
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone
  Labels: twitter

 Currently Mesos frameworks (schedulers and executors) interact with Mesos 
 (masters and slaves) via drivers provided by Mesos. While the driver helped 
 in providing some common functionality for all frameworks (master detection, 
 authentication, validation etc), it has several drawbacks.
 -- Frameworks need to depend on a native library which makes their 
 build/deploy process cumbersome.
 -- Pure language frameworks cannot use off the shelf libraries to interact 
 with the undocumented API used by the driver.
 -- Makes it hard for developers to implement new APIs (lot of boiler plate 
 code to write).
 This proposal is for Mesos to provide a well documented public HTTP API that 
 frameworks (and maybe operators) can use to interact with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1708) Using the wrong resource name should report a better error.

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1708:


Assignee: Dominic Hamon

 Using the wrong resource name should report a better error.
 -

 Key: MESOS-1708
 URL: https://issues.apache.org/jira/browse/MESOS-1708
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Reporter: Benjamin Hindman
Assignee: Dominic Hamon
  Labels: newbie, twitter

 If a scheduler launches a task using resources the master doesn't know about 
 the task validator causes the task to fail but the error message is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Assignee: Vinod Kone

 Design doc for the HTTP API
 ---

 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone

 This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2289) Design doc for the HTTP API

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2289:
-
Sprint: Twitter Mesos Q1 Sprint 2

 Design doc for the HTTP API
 ---

 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone

 This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2255) SlaveRecoveryTest/0.MasterFailover is flaky

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2255:
-
Labels: flaky twitter  (was: flaky)

 SlaveRecoveryTest/0.MasterFailover is flaky
 ---

 Key: MESOS-2255
 URL: https://issues.apache.org/jira/browse/MESOS-2255
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Yan Xu
  Labels: flaky, twitter

 {noformat:title=}
 [ RUN  ] SlaveRecoveryTest/0.MasterFailover
 Using temporary directory '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0'
 I0123 07:45:49.818686 17634 leveldb.cpp:176] Opened db in 31.195549ms
 I0123 07:45:49.821962 17634 leveldb.cpp:183] Compacted db in 3.190936ms
 I0123 07:45:49.822049 17634 leveldb.cpp:198] Created db iterator in 47324ns
 I0123 07:45:49.822069 17634 leveldb.cpp:204] Seeked to beginning of db in 
 2038ns
 I0123 07:45:49.822084 17634 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 484ns
 I0123 07:45:49.822160 17634 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0123 07:45:49.824241 17660 recover.cpp:449] Starting replica recovery
 I0123 07:45:49.825217 17660 recover.cpp:475] Replica is in EMPTY status
 I0123 07:45:49.827020 17660 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0123 07:45:49.827453 17659 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0123 07:45:49.828047 17659 recover.cpp:566] Updating replica status to 
 STARTING
 I0123 07:45:49.838543 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 10.24963ms
 I0123 07:45:49.838580 17659 replica.cpp:323] Persisted replica status to 
 STARTING
 I0123 07:45:49.848836 17659 recover.cpp:475] Replica is in STARTING status
 I0123 07:45:49.850039 17659 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0123 07:45:49.850286 17659 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0123 07:45:49.850754 17659 recover.cpp:566] Updating replica status to VOTING
 I0123 07:45:49.853698 17655 master.cpp:262] Master 
 20150123-074549-16842879-44955-17634 (utopic) started on 127.0.1.1:44955
 I0123 07:45:49.853981 17655 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0123 07:45:49.853997 17655 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0123 07:45:49.854038 17655 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveRecoveryTest_0_MasterFailover_dtF7o0/credentials'
 I0123 07:45:49.854557 17655 master.cpp:357] Authorization enabled
 I0123 07:45:49.859633 17659 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.742923ms
 I0123 07:45:49.859853 17659 replica.cpp:323] Persisted replica status to 
 VOTING
 I0123 07:45:49.860327 17658 recover.cpp:580] Successfully joined the Paxos 
 group
 I0123 07:45:49.860703 17654 recover.cpp:464] Recover process terminated
 I0123 07:45:49.859591 17655 master.cpp:1219] The newly elected leader is 
 master@127.0.1.1:44955 with id 20150123-074549-16842879-44955-17634
 I0123 07:45:49.864702 17655 master.cpp:1232] Elected as the leading master!
 I0123 07:45:49.864904 17655 master.cpp:1050] Recovering from registrar
 I0123 07:45:49.865406 17660 registrar.cpp:313] Recovering registrar
 I0123 07:45:49.866576 17660 log.cpp:660] Attempting to start the writer
 I0123 07:45:49.868638 17658 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0123 07:45:49.872521 17658 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 3.848859ms
 I0123 07:45:49.872555 17658 replica.cpp:345] Persisted promised to 1
 I0123 07:45:49.873769 17661 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0123 07:45:49.875474 17658 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0123 07:45:49.880878 17658 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 5.364021ms
 I0123 07:45:49.880913 17658 replica.cpp:679] Persisted action at 0
 I0123 07:45:49.882619 17657 replica.cpp:511] Replica received write request 
 for position 0
 I0123 07:45:49.882998 17657 leveldb.cpp:438] Reading position from leveldb 
 took 150092ns
 I0123 07:45:49.886488 17657 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 3.269189ms
 I0123 07:45:49.886536 17657 replica.cpp:679] Persisted action at 0
 I0123 07:45:49.887181 17657 replica.cpp:658] Replica received learned notice 
 for position 0
 I0123 07:45:49.892900 17657 leveldb.cpp:343] Persisting action (16 bytes) to 
 leveldb took 5.690093ms
 I0123 07:45:49.892935 17657 replica.cpp:679] Persisted action at 0
 I0123 07:45:49.892956 17657 replica.cpp:664] Replica learned NOP action at

[jira] [Updated] (MESOS-2300) Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2300:
-
Labels: cgroups test twitter  (was: cgroups test)

 Failing tests on 0.21.1 with Ubuntu 14.10 / Linux 3.16.0-23
 ---

 Key: MESOS-2300
 URL: https://issues.apache.org/jira/browse/MESOS-2300
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.1
 Environment: (Though the hostname of this box is {{docker1}}, this is 
 not running on a docker container. This box sits on vanilla hardware, and 
 happens to also be used as a docker server. Though not when I ran the 
 offending tests.)
 {code}
 huitseeker@docker1:~$  lsb_release -a
 No LSB modules are available.
 Distributor ID:   Ubuntu
 Description:  Ubuntu 14.10
 Release:  14.10
 Codename: utopic
 {code}
 {code}
 huitseeker@docker1:~$ uname -a
 Linux docker1 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:56:17 UTC 2014 
 x86_64 x86_64 x86_64 GNU/Linux }}
 {code}
 Mesos retrieved from {{http://git-wip-us.apache.org/repos/asf/mesos.git}}
 And compiled from git tag {{0.21.1}} (currently resolves to 
 {{2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8}}). Box is a clean, 
 ansible-generated Ubuntu with cgmanager disabled, and the following packages 
 installed on top of the usual mesos dependencies:
 - cgroup-lite (service is enabled and started)
 - linux-tools-common
 - linux-tools-generic
 - linux-cloud-tools-generic
 - linux-tools-3.16.0-23-generic
 - linux-cloud-tools-3.16.0-23-generic
Reporter: François Garillot
  Labels: cgroups, test, twitter

 During make check :
 {code}
 [--] Global test environment tear-down
 [==] 503 tests from 89 test cases ran. (387352 ms total)
 [  PASSED  ] 499 tests.
 [  FAILED  ] 4 tests, listed below:
 [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get
 [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups
 [  FAILED  ] NsTest.ROOT_setns
 [  FAILED  ] PerfTest.ROOT_SampleInit
 {code}
 Details:
 {code}
 [ RUN  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get
 ../../src/tests/cgroups_tests.cpp:364: Failure
 Value of: mesos_test2
 Expected: cgroups.get()[0]
 Which is: mesos
 [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get (10 ms)
 [ RUN  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups
 ../../src/tests/cgroups_tests.cpp:392: Failure
 Value of: path::join(TEST_CGROUPS_ROOT, 2)
   Actual: mesos_test/2
 Expected: cgroups.get()[0]
 Which is: mesos_test/1
 ../../src/tests/cgroups_tests.cpp:393: Failure
 Value of: path::join(TEST_CGROUPS_ROOT, 1)
   Actual: mesos_test/1
 Expected: cgroups.get()[1]
 Which is: mesos_test/2
 [  FAILED  ] CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups (12 ms)
 {code}
 {code}
 [ RUN  ] NsTest.ROOT_setns
 ../../src/tests/ns_tests.cpp:123: Failure
 Value of: status.get().get()
   Actual: 256
 Expected: 0
 [  FAILED  ] NsTest.ROOT_setns (93 ms)
 {code}
 {code}
 [ RUN  ] PerfTest.ROOT_SampleInit
 ../../src/tests/perf_tests.cpp:143: Failure
 Expected: (0u)  (statistics.get().cycles()), actual: 0 vs 0
 ../../src/tests/perf_tests.cpp:146: Failure
 Expected: (0.0)  (statistics.get().task_clock()), actual: 0 vs 0
 [  FAILED  ] PerfTest.ROOT_SampleInit (1078 ms)
 {code}
 Those tests have been run in parallel (-j 8) as well as sequentially (-j 1), 
 no difference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-621) HierarchicalAllocator::slaveRemoved doesn't properly handle framework allocations/resources

2015-02-09 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon resolved MESOS-621.
-
Resolution: Won't Fix

 HierarchicalAllocator::slaveRemoved doesn't properly handle framework 
 allocations/resources
 ---

 Key: MESOS-621
 URL: https://issues.apache.org/jira/browse/MESOS-621
 Project: Mesos
  Issue Type: Bug
  Components: allocation, technical debt
Reporter: Vinod Kone
  Labels: twitter

 Currently a slaveRemoved() simply removes the slave from 'slaves' map and 
 slave's resources from 'roleSorter'. Looking at resourcesRecovered(), more 
 things need to be done when a slave is removed (e.g., framework 
 unallocations).
 It would be nice to fix this and have a test for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2238) Use Owned for Process pointers in wrapper classes

2015-02-04 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2238:
-
Labels: easyfix newbie  (was: easyfix)

 Use Owned for Process pointers in wrapper classes
 ---

 Key: MESOS-2238
 URL: https://issues.apache.org/jira/browse/MESOS-2238
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov
  Labels: easyfix, newbie

 A common pattern in our code (see e.g. {{Isolator}}, {{DockerContainerizer}}, 
 {{Allocator}}) is to wrap Process-based class into a non Process-one. 
 However, our code base is inconsistent about how we store the pointer to the 
 underlying class: somewhere we wrap it into {{Owned}} (see e.g. 
 {{Isolator}}, {{DockerContainerizer}}), somewhere it is a raw pointer (see 
 e.g. {{Allocator}}, {{ExternalContainerizer}}).
 Using {{Owned}} for this particular case is preferable, since it hints the 
 reader about the correct semantics and intention. For consistency reason, 
 sweep through the code base and replace raw pointers with its {{Owned}} 
 counterpart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-181) Virtual Machine Isolation Module

2015-02-04 Thread Dominic Hamon (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305765#comment-14305765
]

Dominic Hamon commented on MESOS-181:
-

It doesn't seem likely that we're going to integrate this any time soon. Shall
we close out the issue?

Virtual Machine Isolation Module

Key: MESOS-181
URL: https://issues.apache.org/jira/browse/MESOS-181
Project: Mesos
Issue Type: Story
Components: isolation, slave
Environment: Ubuntu 11.04, Ubuntu 11.10
Reporter: Charles Earl
Priority: Minor
Labels: virtualiztion

Earlier in the year I implemented a virtual machine isolation module. This
module uses lib-virt to launch and manage virtual machine containers. The
code is still rough and have done basic testing with the Spark example.
This code works with the KVM (http://www.linux-kvm.org/page/Main_Page)
virtual machine manager. I've placed the relevant code in a branch called
mesos-vm, for now located at https://github.com/charlescearl/VirtualMesos.
The code is based upon the mesos lxc isolation module that is located in
src/slave/lxc_isolation_module.cpp/.hpp. My code based on the mesos master
branch dated Wed Nov 23 12:02:07 2011 -0800, commit
059aabb2ec5bd7b20ed08ab9c439531a352ba3ec. I'll generate a patch soon for
this. Suggestions appreciated on whether this is the appropriate
branch/commit to patch against.
Most of the implementation is contained in vm_isolation_module.cpp and
vm_isolation_module.hpp and there are some minor additions in launcher to
handle setup of the environment for the virtual machine. I use the libvirt
(http://libvirt.org/) library, to manage the virtual machine container in
which the jobs are executed.
Dependencies
The code has been tested on Ubuntu 11.04 and 11.10 and depends on
libpython2.6 and libvirt0
Configuration of the virtual machine container
The virtual machine invocation depends upon a few configuration assumptions:
1. ssh public keys installed on the container. I assume that the container
is setup to allow password-less secure access.
2. Directory structure on the container matches the servant machine. For
example, in invoking a spark executor, assume that the paths match the setup
on the container host.
Running it
In the $MESOS_HOME/conf/mesos.conf file add the line
isolation=vm
to use the virtual machine isolation.
The Mesos slave is invoked with the isolation parameter set to vm. For example
sudo bin/mesos-slave -m mesos://master@mesos-host:5050 -w 9839
--isolation=vm
Rough description of how it works
The `vm_isolation_module` class forks a process that in turn launches a
virtual machine. A routine located in bin called find_addr.pl is
responsible for figuring out the IP address of the launched virtual machine.
This is probably not portable since it is explicitly looking for entry in the
virbr0 network.
A script vmLauncherTemplate.sh located in bin assists the the vmLauncher
method to setup the environment for launching tasks inside of the virtual
machine. The vmLauncher method uses vmLauncherTemplate.sh to create a tasks
specific shell vmLauncherTemplate-task_id.sh, which is copied to the
running guest and used to run the executor inside the VM. This communicates
with the slave on the host.
Comments and suggestions on improvements and next directions are appreciated!

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints

2015-02-04 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2277:
-
Labels: documentation newbie starter  (was: starter)

 Document undocumented HTTP endpoints
 

 Key: MESOS-2277
 URL: https://issues.apache.org/jira/browse/MESOS-2277
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen
Priority: Minor
  Labels: documentation, newbie, starter

 Did a quick scan and we are missing documentation for a few endpoints:
 {code}
 files/browse.json
 files/read.json
 files/download.json
 files/debug.json
 master/roles.json
 master/state.json
 master/stats.json
 slave/state.json
 slave/stats.json
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-181) Virtual Machine Isolation Module

2015-02-04 Thread Dominic Hamon (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dominic Hamon resolved MESOS-181.
-
Resolution: Won't Fix

Sadly, our isolation efforts have diverged from the initial effort here. If we
do ever provide VM isolation, we'll need to carefully determine requirements
first and then develop a solution.

Virtual Machine Isolation Module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reopened MESOS-2138:
--
  Assignee: Michael Park  (was: Benjamin Mahler)

  Add an Offer::Operation message for Dynamic Reservations
 -

 Key: MESOS-2138
 URL: https://issues.apache.org/jira/browse/MESOS-2138
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
  Labels: protobuf
 Fix For: 0.22.0


 A framework now has a notion of *accepting* offers that it was given (via 
 {{acceptOffers}}) and is able to specify a sequence of operations to perform 
 (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible 
 {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a 
 sequence of {{Offer::Operation}} consisting of only {{Launch}}.
 The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as 
 possible {{Offer::Operation}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (MESOS-2138) Add an Offer::Operation message for Dynamic Reservations

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon closed MESOS-2138.

Resolution: Fixed

  Add an Offer::Operation message for Dynamic Reservations
 -

 Key: MESOS-2138
 URL: https://issues.apache.org/jira/browse/MESOS-2138
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
  Labels: protobuf
 Fix For: 0.22.0


 A framework now has a notion of *accepting* offers that it was given (via 
 {{acceptOffers}}) and is able to specify a sequence of operations to perform 
 (via a sequence of {{Offer::Operation}}). {{Launch}} is one of the possible 
 {{Offer::Operation}} and which means {{LaunchTasks}} is an alias to a 
 sequence of {{Offer::Operation}} consisting of only {{Launch}}.
 The goal of this ticket is to add {{Reserve}} and {{Unreserve}} messages as 
 possible {{Offer::Operation}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2308) Task reconciliation API should support data partitioning

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2308:
-
Shepherd: Vinod Kone
Story Points: 8

 Task reconciliation API should support data partitioning
 

 Key: MESOS-2308
 URL: https://issues.apache.org/jira/browse/MESOS-2308
 Project: Mesos
  Issue Type: Story
Reporter: Bill Farner
Assignee: Benjamin Mahler
  Labels: twitter

 The {{reconcileTasks}} API call requires the caller to specify a collection 
 of {{TaskStatus}}es, with the option to provide an empty collection to 
 retrieve the master's entire state.  Retrieving the entire state is the only 
 mechanism for the scheduler to learn that there are tasks running it does not 
 know about, however this call does not allow incremental querying.  The 
 result would be that the master may need to send many thousands of status 
 updates, and the scheduler would have to handle them.  It would be ideal if 
 the scheduler had a means to partition these requests so it can control the 
 pace of these status updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-2314:


Assignee: Dominic Hamon

 remove unnecessary constants
 

 Key: MESOS-2314
 URL: https://issues.apache.org/jira/browse/MESOS-2314
 Project: Mesos
  Issue Type: Improvement
  Components: slave, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie

 In {{src/slave/paths.cpp}} a number of string constants are defined to 
 describe the formats of various paths. However, given there is a 1:1 mapping 
 between the string constant and the functions that build the paths, the code 
 would be more readable if the format strings were inline in the functions.
 In the cases where one constant depends on another (see the 
 {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
 chain, for example) the function calls can just be chained together.
 This will have the added benefit of removing some statically constructed 
 string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302334#comment-14302334
 ] 

Dominic Hamon commented on MESOS-2314:
--

https://reviews.apache.org/r/30531

 remove unnecessary constants
 

 Key: MESOS-2314
 URL: https://issues.apache.org/jira/browse/MESOS-2314
 Project: Mesos
  Issue Type: Improvement
  Components: slave, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie

 In {{src/slave/paths.cpp}} a number of string constants are defined to 
 describe the formats of various paths. However, given there is a 1:1 mapping 
 between the string constant and the functions that build the paths, the code 
 would be more readable if the format strings were inline in the functions.
 In the cases where one constant depends on another (see the 
 {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
 chain, for example) the function calls can just be chained together.
 This will have the added benefit of removing some statically constructed 
 string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2314:
-
Story Points: 2

 remove unnecessary constants
 

 Key: MESOS-2314
 URL: https://issues.apache.org/jira/browse/MESOS-2314
 Project: Mesos
  Issue Type: Improvement
  Components: slave, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie

 In {{src/slave/paths.cpp}} a number of string constants are defined to 
 describe the formats of various paths. However, given there is a 1:1 mapping 
 between the string constant and the functions that build the paths, the code 
 would be more readable if the format strings were inline in the functions.
 In the cases where one constant depends on another (see the 
 {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
 chain, for example) the function calls can just be chained together.
 This will have the added benefit of removing some statically constructed 
 string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2314:
-
Sprint: Twitter Mesos Q1 Sprint 2

 remove unnecessary constants
 

 Key: MESOS-2314
 URL: https://issues.apache.org/jira/browse/MESOS-2314
 Project: Mesos
  Issue Type: Improvement
  Components: slave, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie

 In {{src/slave/paths.cpp}} a number of string constants are defined to 
 describe the formats of various paths. However, given there is a 1:1 mapping 
 between the string constant and the functions that build the paths, the code 
 would be more readable if the format strings were inline in the functions.
 In the cases where one constant depends on another (see the 
 {{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
 chain, for example) the function calls can just be chained together.
 This will have the added benefit of removing some statically constructed 
 string constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2314) remove unnecessary constants

2015-02-02 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-2314:


 Summary: remove unnecessary constants
 Key: MESOS-2314
 URL: https://issues.apache.org/jira/browse/MESOS-2314
 Project: Mesos
  Issue Type: Improvement
  Components: slave, technical debt
Reporter: Dominic Hamon
Priority: Minor


In {{src/slave/paths.cpp}} a number of string constants are defined to describe 
the formats of various paths. However, given there is a 1:1 mapping between the 
string constant and the functions that build the paths, the code would be more 
readable if the format strings were inline in the functions.

In the cases where one constant depends on another (see the 
{{EXECUTOR_INFO_PATH, EXECUTOR_PATH, FRAMEWORK_PATH, SLAVE_PATH, ROOT_PATH}} 
chain, for example) the function calls can just be chained together.

This will have the added benefit of removing some statically constructed string 
constants, which are dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2305) Refactor validators in Master.

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2305:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter 
Mesos Q1 Sprint 1)

 Refactor validators in Master.
 --

 Key: MESOS-2305
 URL: https://issues.apache.org/jira/browse/MESOS-2305
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu
Assignee: Jie Yu

 There are several motivation for this. We are in the process of adding 
 dynamic reservations and persistent volumes support in master. To do that, 
 master needs to validate relevant operations from the framework (See 
 Offer::Operation in mesos.proto). The existing validator style in master is 
 hard to extend, compose and re-use.
 Another motivation for this is for unit testing (MESOS-1064). Right now, we 
 write integration tests for those validators which is unfortunate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1148) Add support for rate limiting slave removal

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1148:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter 
Mesos Q1 Sprint 1)

 Add support for rate limiting slave removal
 ---

 Key: MESOS-1148
 URL: https://issues.apache.org/jira/browse/MESOS-1148
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Bill Farner
Assignee: Vinod Kone
  Labels: twitter

 To safeguard against unforeseen bugs leading to widespread slave removal, it 
 would be nice to allow for rate limiting of the decision to remove slaves 
 and/or send TASK_LOST messages for tasks on those slaves.  Ideally this would 
 allow an operator to be notified soon enough to intervene before causing 
 cluster impact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2136) Expose per-cgroup memory pressure

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2136:
-
Sprint: Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter Mesos Q4 Sprint 5, 
Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1)

 Expose per-cgroup memory pressure
 -

 Key: MESOS-2136
 URL: https://issues.apache.org/jira/browse/MESOS-2136
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Ian Downes
Assignee: Chi Zhang
  Labels: twitter

 The cgroup memory controller can provide information on the memory pressure 
 of a cgroup. This is in the form of an event based notification where events 
 of (low, medium, critical) are generated when the kernel makes specific 
 actions to allocate memory. This signal is probably more informative than 
 comparing memory usage to memory limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter 
Mesos Q1 Sprint 1)

 Deprecate stats.json endpoints for Master and Slave
 ---

 Key: MESOS-2058
 URL: https://issues.apache.org/jira/browse/MESOS-2058
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: twitter

 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
 metrics are now duplicated in the Master and Slave between this and 
 {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
 Manual inspection of {{stats.json}} shows that all metrics are now covered by 
 the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2100) Implement master to slave protocol for persistent disk resources.

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2100:
-
Sprint: Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos 
Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: 
Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 
6, Twitter Mesos Q1 Sprint 1)

 Implement master to slave protocol for persistent disk resources.
 -

 Key: MESOS-2100
 URL: https://issues.apache.org/jira/browse/MESOS-2100
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Jie Yu
Assignee: Jie Yu
  Labels: twitter

 We need to do the following:
 1) Slave needs to send persisted resources when registering (or 
 re-registering).
 2) Master needs to send total persisted resources to slave by either re-using 
 RunTask/UpdateFrameworkInfo or introduce new type of messages (like 
 UpdateResources).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2031) Manage persistent directories on slave.

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2031:
-
Sprint: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos 
Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1, Twitter 
Mesos Q1 Sprint 2  (was: Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, 
Twitter Mesos Q4 Sprint 5, Twitter Mesos Q4 Sprint 6, Twitter Mesos Q1 Sprint 1)

 Manage persistent directories on slave.
 ---

 Key: MESOS-2031
 URL: https://issues.apache.org/jira/browse/MESOS-2031
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu

 Whenever a slave sees a persistent disk resource (in ExecutorInfo or 
 TaskInfo) that is new to it, it will create a persistent directory which is 
 for tasks to store persistent data.
 The slave needs to do the following after it's created:
 1) symlink into the executor sandbox so that tasks/executor can see it
 2) garbage collect it once it is released by the framework



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1830) Expose master stats differentiating between master-generated and slave-generated LOST tasks

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1830:
-
Sprint: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 
Sprint 3, Twitter Mesos Q4 Sprint 4, Twitter Mesos Q4 Sprint 5, Twitter Mesos 
Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter Q4 Sprint 1, Twitter 
Mesos Q4 Sprint 2, Twitter Mesos Q4 Sprint 3, Twitter Mesos Q4 Sprint 4, 
Twitter Mesos Q4 Sprint 5, Twitter Mesos Q1 Sprint 1)

 Expose master stats differentiating between master-generated and 
 slave-generated LOST tasks
 ---

 Key: MESOS-1830
 URL: https://issues.apache.org/jira/browse/MESOS-1830
 Project: Mesos
  Issue Type: Story
  Components: master
Reporter: Bill Farner
Assignee: Dominic Hamon
Priority: Minor

 The master exports a monotonically-increasing counter of tasks transitioned 
 to TASK_LOST.  This loses fidelity of the source of the lost task.  A first 
 step in exposing the source of lost tasks might be to just differentiate 
 between TASK_LOST transitions initiated by the master vs the slave (and maybe 
 bad input from the scheduler).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2241) DiskUsageCollectorTest.SymbolicLink test is flaky

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2241:
-
Sprint: Twitter Mesos Q1 Sprint 1, Twitter Mesos Q1 Sprint 2  (was: Twitter 
Mesos Q1 Sprint 1)

 DiskUsageCollectorTest.SymbolicLink test is flaky
 -

 Key: MESOS-2241
 URL: https://issues.apache.org/jira/browse/MESOS-2241
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Jie Yu

 Observed this on a local machine running linux w/ sudo.
 {code}
 [ RUN  ] DiskUsageCollectorTest.SymbolicLink
 ../../src/tests/disk_quota_tests.cpp:138: Failure
 Expected: (usage1.get())  (Kilobytes(16)), actual: 24KB vs 8-byte object 
 00-40 00-00 00-00 00-00
 [  FAILED  ] DiskUsageCollectorTest.SymbolicLink (201 ms)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2058:
-
Sprint: Twitter Mesos Q1 Sprint 1  (was: Twitter Mesos Q1 Sprint 1, Twitter 
Mesos Q1 Sprint 2)

 Deprecate stats.json endpoints for Master and Slave
 ---

 Key: MESOS-2058
 URL: https://issues.apache.org/jira/browse/MESOS-2058
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: twitter

 With the introduction of the libprocess {{/metrics/snapshot}} endpoint, 
 metrics are now duplicated in the Master and Slave between this and 
 {{stats.json}}. We should deprecate the {{stats.json}} endpoints.
 Manual inspection of {{stats.json}} shows that all metrics are now covered by 
 the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2123) Document changes in C++ Resources API in CHANGELOG.

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2123:
-
Sprint: Twitter Mesos Q1 Sprint 2

 Document changes in C++ Resources API in CHANGELOG.
 ---

 Key: MESOS-2123
 URL: https://issues.apache.org/jira/browse/MESOS-2123
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
  Labels: twitter

 With the refactor introduced in MESOS-1974, we need to document those API 
 changes in CHANGELOG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2103) Expose number and state of tasks in a container

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2103:
-
Sprint: Twitter Mesos Q1 Sprint 2

 Expose number and state of tasks in a container
 ---

 Key: MESOS-2103
 URL: https://issues.apache.org/jira/browse/MESOS-2103
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Affects Versions: 0.20.0
Reporter: Ian Downes
  Labels: twitter

 The CFS cpu statistics (cpus_nr_throttled, cpus_nr_periods, 
 cpus_throttled_time) are difficult to interpret.
 1) nr_throttled is the number of intervals where *any* throttling occurred
 2) throttled_time is the aggregate time *across all runnable tasks* (tasks in 
 the Linux sense).
 For example, in a typical 60 second sampling interval: nr_periods = 600, 
 nr_throttled could be 60, i.e., 10% of intervals, but throttled_time could be 
 much higher than (60/600) * 60 = 6 seconds if there is more than one task 
 that is runnable but throttled. *Each* throttled task contributes to the 
 total throttled time.
 Small test to demonstrate throttled_time  nr_periods * quota_interval:
 5 x {{'openssl speed'}} running with quota=100ms:
 {noformat}
 cat cpu.stat  sleep 1  cat cpu.stat
 nr_periods 3228
 nr_throttled 1276
 throttled_time 528843772540
 nr_periods 3238
 nr_throttled 1286
 throttled_time 531668964667
 {noformat}
 All 10 intervals throttled (100%) for total time of 2.8 seconds in 1 second 
 (more than 100% of the time interval)
 It would be helpful to expose the number and state of tasks in the container 
 cgroup. This would be at a very coarse granularity but would give some 
 guidance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1807) Disallow executors with cpu only or memory only resources

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1807:
-
Sprint: Twitter Q4 Sprint 1, Twitter Mesos Q4 Sprint 2, Twitter Mesos Q4 
Sprint 3, Twitter Mesos Q1 Sprint 2  (was: Twitter Q4 Sprint 1, Twitter Mesos 
Q4 Sprint 2, Twitter Mesos Q4 Sprint 3)

 Disallow executors with cpu only or memory only resources
 -

 Key: MESOS-1807
 URL: https://issues.apache.org/jira/browse/MESOS-1807
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone
  Labels: newbie

 Currently master allows executors to be launched with either only cpus or 
 only memory but we shouldn't allow that.
 This is because executor is an actual unix process that is launched by the 
 slave. If an executor doesn't specify cpus, what should do the cpu limits be 
 for that executor when there are no tasks running on it? If no cpu limits are 
 set then it might starve other executors/tasks on the slave violating 
 isolation guarantees. Same goes with memory. Moreover, the current 
 containerizer/isolator code will throw failures when using such an executor, 
 e.g., when the last task on the executor finishes and Containerizer::update() 
 is called with 0 cpus or 0 mem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Assignee: Yan Xu

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Assignee: Yan Xu
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Sprint: Twitter Mesos Q1 Sprint 2

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Shepherd: Vinod Kone
Story Points: 8

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Assignee: Yan Xu
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-02-02 Thread Dominic Hamon (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2144:
-
Shepherd: Jie Yu  (was: Vinod Kone)

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Assignee: Yan Xu
Priority: Minor
  Labels: flaky, twitter

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 >

1 - 100 of 588 matches

Mail list logo