[jira] [Updated] (MESOS-1550) MesosSchedulerDriver should never, ever, call 'stop'.

2014-06-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1550:
---

Fix Version/s: 0.19.1

 MesosSchedulerDriver should never, ever, call 'stop'.
 -

 Key: MESOS-1550
 URL: https://issues.apache.org/jira/browse/MESOS-1550
 Project: Mesos
  Issue Type: Bug
  Components: framework, java api, python api
Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.17.0, 0.16.0, 0.15.0, 0.18.0, 
 0.19.0
Reporter: Benjamin Hindman
Priority: Critical
 Fix For: 0.19.1


 Using MesosSchedulerDriver.stop causes the master to unregister the 
 framework. The library should never make this decision for a framework, it 
 should defer to the framework itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1539) No longer able to spin up Mesos master in local mode

2014-06-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1539:
---

Target Version/s: 0.19.1
   Fix Version/s: (was: 0.19.1)

 No longer able to spin up Mesos master in local mode
 

 Key: MESOS-1539
 URL: https://issues.apache.org/jira/browse/MESOS-1539
 Project: Mesos
  Issue Type: Bug
  Components: java api
Affects Versions: 0.19.0
 Environment: Ubuntu 14.04 / Mac OS X  against Mesos 0.19.0
Reporter: Sunil Shah
Assignee: Benjamin Mahler
 Fix For: 0.20.0


 JVM frameworks such as Marathon use the local master mode for testing 
 purposes (passed through as the `--master local` parameter).
 This doesn't not to work in Mesos 0.19.0 because of the new mandatory 
 registry and quorum parameters. There is no way to set these for local 
 masters - it emits the following message before terminating the framework:
 `--work_dir needed for replicated log based registry`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.

2014-06-27 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046537#comment-14046537
 ] 

Benjamin Mahler commented on MESOS-1517:


There's only a few types of messages involved here. When a master fails over, 
the slaves and frameworks will try to re-register before doing anything else.

This means that if we queue up the messages we'll only be reducing the need for 
frameworks and slaves to retry registration, which is already something that is 
required of them. So I think this change would mostly be beneficial for our 
integration tests where the retries are not desirable. :)

 Maintain a queue of messages that arrive before the master recovers.
 

 Key: MESOS-1517
 URL: https://issues.apache.org/jira/browse/MESOS-1517
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
  Labels: reliability

 Currently when the master is recovering, we drop all incoming messages. If 
 slaves and frameworks knew about the leading master only once it has 
 recovered, then we would only expect to see messages after we've recovered.
 We previously considered enqueuing all messages through the recovery future, 
 but this has the downside of forcing all messages to go through the master's 
 queue twice:
 {code}
   // TODO(bmahler): Consider instead re-enqueing *all* messages
   // through recover(). What are the performance implications of
   // the additional queueing delay and the accumulated backlog
   // of messages post-recovery?
   if (!recovered.get().isReady()) {
 VLOG(1)  Dropping '  event.message-name  ' message since 
  not recovered yet;
 ++metrics.dropped_messages;
 return;
   }
 {code}
 However, an easy solution to this problem is to maintain an explicit queue of 
 incoming messages that gets flushed once we finish recovery. This ensures 
 that all messages post-recovery are processed normally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1566) Support private docker registry.

2014-07-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1566:
---

Summary: Support private docker registry.  (was: Support private registry)

 Support private docker registry.
 

 Key: MESOS-1566
 URL: https://issues.apache.org/jira/browse/MESOS-1566
 Project: Mesos
  Issue Type: Technical task
Reporter: Timothy Chen

 Need to support Docker launching images hosted in private registry service, 
 which requires docker login.
 Can consider utilizing .dockercfg file for providing credentials.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?

2014-07-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1574:
---

Component/s: isolation

 what to do when a rogue process binds to a port mesos didn't allocate to it?
 

 Key: MESOS-1574
 URL: https://issues.apache.org/jira/browse/MESOS-1574
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, isolation
Reporter: Jay Buffington
Priority: Minor

 I recently had an issue where a slave had a process who's parent was init 
 that was bound to a port in the range that mesos thought was a free resource. 
  I'm not sure if this is due to a bug in mesos (it lost track of this process 
 during an upgrade?) or if there was a bad user who started a process on the 
 host manually outside of mesos.  The process is over a month old and I have 
 no history in mesos to ask it if/when it launched the task :(
 If a rogue process binds to a port that mesos-slave has offered to the master 
 as an available resource there should be some sort of reckoning.  Mesos could:
* kill the rogue process
* rescind the offer for that port
* have an api that can be plugged into a monitoring system to alert humans 
 of this inconsistency



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1538) A container destruction in the middle of a launch leads to CHECK failure.

2014-07-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1538:
---

Summary: A container destruction in the middle of a launch leads to CHECK 
failure.  (was: A container destruction in the middle of a launch leads to 
CHECK failure)

 A container destruction in the middle of a launch leads to CHECK failure.
 -

 Key: MESOS-1538
 URL: https://issues.apache.org/jira/browse/MESOS-1538
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Ian Downes
 Fix For: 0.19.1


 There is a race between the destroy() and exec() in the containerizer 
 process, when the destroy is called in the middle of the launch.
 In particular if the destroy is completed and the container removed from 
 'promises' map before 'exec()' was called, CHECK failure happens.
 The fix is to return a Failure instead of doing a CHECK in 'exec()'.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.

2014-07-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1567:
---

Description: We currently do not log the user id when receiving a SIGTERM, 
this makes debugging a bit difficult. It's easy to get this information through 
sigaction.  (was: We currently do not log the user pid when receiving a 
SIGTERM, this makes debugging a bit difficult. It's easy to get this 
information through sigaction.)

 Add logging of the user uid when receiving SIGTERM.
 ---

 Key: MESOS-1567
 URL: https://issues.apache.org/jira/browse/MESOS-1567
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Alexandra Sava

 We currently do not log the user id when receiving a SIGTERM, this makes 
 debugging a bit difficult. It's easy to get this information through 
 sigaction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-07-15 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1466:
---

Labels: reliability  (was: )

 Race between executor exited event and launch task can cause overcommit of 
 resources
 

 Key: MESOS-1466
 URL: https://issues.apache.org/jira/browse/MESOS-1466
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Vinod Kone
  Labels: reliability

 The following sequence of events can cause an overcommit
 -- Launch task is called for a task whose executor is already running
 -- Executor's resources are not accounted for on the master
 -- Executor exits and the event is enqueued behind launch tasks on the master
 -- Master sends the task to the slave which needs to commit for resources 
 for task and the (new) executor.
 -- Master processes the executor exited event and re-offers the executor's 
 resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.

2014-07-15 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1603:
---

Sprint: Q3 Sprint 1

 SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
 -

 Key: MESOS-1603
 URL: https://issues.apache.org/jira/browse/MESOS-1603
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 {noformat}
 [ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
 Using temporary directory 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU'
 I0715 18:16:03.231495  5857 leveldb.cpp:176] Opened db in 27.552259ms
 I0715 18:16:03.240953  5857 leveldb.cpp:183] Compacted db in 8.801497ms
 I0715 18:16:03.241580  5857 leveldb.cpp:198] Created db iterator in 39823ns
 I0715 18:16:03.241945  5857 leveldb.cpp:204] Seeked to beginning of db in 
 15498ns
 I0715 18:16:03.242385  5857 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 15153ns
 I0715 18:16:03.242780  5857 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0715 18:16:03.243475  5882 recover.cpp:425] Starting replica recovery
 I0715 18:16:03.243540  5882 recover.cpp:451] Replica is in EMPTY status
 I0715 18:16:03.243862  5882 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0715 18:16:03.243919  5882 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0715 18:16:03.244112  5875 recover.cpp:542] Updating replica status to 
 STARTING
 I0715 18:16:03.249405  5880 master.cpp:288] Master 
 20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514
 I0715 18:16:03.249445  5880 master.cpp:325] Master only allowing 
 authenticated frameworks to register
 I0715 18:16:03.249454  5880 master.cpp:330] Master only allowing 
 authenticated slaves to register
 I0715 18:16:03.249480  5880 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials'
 I0715 18:16:03.250130  5880 master.cpp:359] Authorization enabled
 I0715 18:16:03.250900  5880 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:36514
 I0715 18:16:03.250951  5880 master.cpp:122] No whitelist given. Advertising 
 offers for all slaves
 I0715 18:16:03.251145  5880 master.cpp:1128] The newly elected leader is 
 master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857
 I0715 18:16:03.251164  5880 master.cpp:1141] Elected as the leading master!
 I0715 18:16:03.251173  5880 master.cpp:959] Recovering from registrar
 I0715 18:16:03.251225  5880 registrar.cpp:313] Recovering registrar
 I0715 18:16:03.254640  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 10.421369ms
 I0715 18:16:03.254683  5875 replica.cpp:320] Persisted replica status to 
 STARTING
 I0715 18:16:03.254770  5875 recover.cpp:451] Replica is in STARTING status
 I0715 18:16:03.255097  5875 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0715 18:16:03.255166  5875 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0715 18:16:03.255280  5875 recover.cpp:542] Updating replica status to VOTING
 I0715 18:16:03.263897  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.581313ms
 I0715 18:16:03.263944  5875 replica.cpp:320] Persisted replica status to 
 VOTING
 I0715 18:16:03.264010  5875 recover.cpp:556] Successfully joined the Paxos 
 group
 I0715 18:16:03.264085  5875 recover.cpp:440] Recover process terminated
 I0715 18:16:03.264227  5875 log.cpp:656] Attempting to start the writer
 I0715 18:16:03.264570  5875 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0715 18:16:03.322881  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 58.31469ms
 I0715 18:16:03.323349  5875 replica.cpp:342] Persisted promised to 1
 I0715 18:16:03.328495  5876 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0715 18:16:03.328910  5876 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0715 18:16:03.338655  5876 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 9.73834ms
 I0715 18:16:03.338693  5876 replica.cpp:676] Persisted action at 0
 I0715 18:16:03.338964  5876 replica.cpp:508] Replica received write request 
 for position 0
 I0715 18:16:03.338997  5876 leveldb.cpp:438] Reading position from leveldb 
 took 21691ns
 I0715 18:16:03.349257  5876 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 10.25515ms
 I0715 18:16:03.349551  5876 replica.cpp:676] Persisted action at 0
 I0715 18:16:03.354379  5877 replica.cpp:655] Replica received learned notice 

[jira] [Resolved] (MESOS-1460) SlaveTest.TerminatingSlaveDoesNotRegister is flaky

2014-07-15 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1460.


Resolution: Fixed

{noformat}
commit ebee9afee7f6e5f04a5f259642c12eb0b99c35e0
Author: Yifan Gu yi...@mesosphere.io
Date:   Thu Jun 12 12:24:46 2014 -0700

Fixed a flaky test: SlaveTest.TerminatingSlaveDoesNotReregister.

Review: https://reviews.apache.org/r/22472
{noformat}

 SlaveTest.TerminatingSlaveDoesNotRegister is flaky
 --

 Key: MESOS-1460
 URL: https://issues.apache.org/jira/browse/MESOS-1460
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon
Assignee: Yifan Gu

 [ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
 Using temporary directory 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_U2FkN5'
 I0605 11:04:21.890828 32082 leveldb.cpp:176] Opened db in 49.661187ms
 I0605 11:04:21.908869 32082 leveldb.cpp:183] Compacted db in 17.671793ms
 I0605 11:04:21.909230 32082 leveldb.cpp:198] Created db iterator in 26848ns
 I0605 11:04:21.909484 32082 leveldb.cpp:204] Seeked to beginning of db in 
 1705ns
 I0605 11:04:21.909740 32082 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 815ns
 I0605 11:04:21.910032 32082 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0605 11:04:21.910549 32105 recover.cpp:425] Starting replica recovery
 I0605 11:04:21.910626 32105 recover.cpp:451] Replica is in EMPTY status
 I0605 11:04:21.910951 32105 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0605 11:04:21.911013 32105 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0605 11:04:21.93 32105 recover.cpp:542] Updating replica status to 
 STARTING
 I0605 11:04:21.914664 32109 master.cpp:272] Master 
 20140605-110421-16842879-56385-32082 (precise) started on 127.0.1.1:56385
 I0605 11:04:21.914690 32109 master.cpp:309] Master only allowing 
 authenticated frameworks to register
 I0605 11:04:21.914695 32109 master.cpp:314] Master only allowing 
 authenticated slaves to register
 I0605 11:04:21.914702 32109 credentials.hpp:35] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_U2FkN5/credentials'
 I0605 11:04:21.914765 32109 master.cpp:340] Master enabling authorization
 I0605 11:04:21.915194 32109 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:56385
 I0605 11:04:21.915230 32109 master.cpp:108] No whitelist given. Advertising 
 offers for all slaves
 I0605 11:04:21.915393 32109 master.cpp:957] The newly elected leader is 
 master@127.0.1.1:56385 with id 20140605-110421-16842879-56385-32082
 I0605 11:04:21.915405 32109 master.cpp:970] Elected as the leading master!
 I0605 11:04:21.915410 32109 master.cpp:788] Recovering from registrar
 I0605 11:04:21.915458 32109 registrar.cpp:313] Recovering registrar
 I0605 11:04:21.931046 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 19.869329ms
 I0605 11:04:21.931084 32105 replica.cpp:320] Persisted replica status to 
 STARTING
 I0605 11:04:21.931169 32105 recover.cpp:451] Replica is in STARTING status
 I0605 11:04:21.931500 32105 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0605 11:04:21.931560 32105 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0605 11:04:21.931656 32105 recover.cpp:542] Updating replica status to VOTING
 I0605 11:04:21.945734 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 14.013731ms
 I0605 11:04:21.945791 32105 replica.cpp:320] Persisted replica status to 
 VOTING
 I0605 11:04:21.945868 32105 recover.cpp:556] Successfully joined the Paxos 
 group
 I0605 11:04:21.945930 32105 recover.cpp:440] Recover process terminated
 I0605 11:04:21.946071 32105 log.cpp:656] Attempting to start the writer
 I0605 11:04:21.946374 32105 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0605 11:04:21.960847 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 14.444258ms
 I0605 11:04:21.961493 32105 replica.cpp:342] Persisted promised to 1
 I0605 11:04:21.965292 32107 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0605 11:04:21.965626 32107 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0605 11:04:21.982533 32107 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 16.8754ms
 I0605 11:04:21.982589 32107 replica.cpp:676] Persisted action at 0
 I0605 11:04:21.982921 32107 replica.cpp:508] Replica received write request 
 for position 0
 I0605 11:04:21.982952 32107 leveldb.cpp:438] Reading position from leveldb 
 took 16276ns
 I0605 11:04:21.999135 32107 

[jira] [Created] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.

2014-07-15 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1603:
--

 Summary: SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
 Key: MESOS-1603
 URL: https://issues.apache.org/jira/browse/MESOS-1603
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


{noformat}
[ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
Using temporary directory 
'/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU'
I0715 18:16:03.231495  5857 leveldb.cpp:176] Opened db in 27.552259ms
I0715 18:16:03.240953  5857 leveldb.cpp:183] Compacted db in 8.801497ms
I0715 18:16:03.241580  5857 leveldb.cpp:198] Created db iterator in 39823ns
I0715 18:16:03.241945  5857 leveldb.cpp:204] Seeked to beginning of db in 
15498ns
I0715 18:16:03.242385  5857 leveldb.cpp:273] Iterated through 0 keys in the db 
in 15153ns
I0715 18:16:03.242780  5857 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0715 18:16:03.243475  5882 recover.cpp:425] Starting replica recovery
I0715 18:16:03.243540  5882 recover.cpp:451] Replica is in EMPTY status
I0715 18:16:03.243862  5882 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0715 18:16:03.243919  5882 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0715 18:16:03.244112  5875 recover.cpp:542] Updating replica status to STARTING
I0715 18:16:03.249405  5880 master.cpp:288] Master 
20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514
I0715 18:16:03.249445  5880 master.cpp:325] Master only allowing authenticated 
frameworks to register
I0715 18:16:03.249454  5880 master.cpp:330] Master only allowing authenticated 
slaves to register
I0715 18:16:03.249480  5880 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials'
I0715 18:16:03.250130  5880 master.cpp:359] Authorization enabled
I0715 18:16:03.250900  5880 hierarchical_allocator_process.hpp:301] 
Initializing hierarchical allocator process with master : master@127.0.1.1:36514
I0715 18:16:03.250951  5880 master.cpp:122] No whitelist given. Advertising 
offers for all slaves
I0715 18:16:03.251145  5880 master.cpp:1128] The newly elected leader is 
master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857
I0715 18:16:03.251164  5880 master.cpp:1141] Elected as the leading master!
I0715 18:16:03.251173  5880 master.cpp:959] Recovering from registrar
I0715 18:16:03.251225  5880 registrar.cpp:313] Recovering registrar
I0715 18:16:03.254640  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 10.421369ms
I0715 18:16:03.254683  5875 replica.cpp:320] Persisted replica status to 
STARTING
I0715 18:16:03.254770  5875 recover.cpp:451] Replica is in STARTING status
I0715 18:16:03.255097  5875 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0715 18:16:03.255166  5875 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0715 18:16:03.255280  5875 recover.cpp:542] Updating replica status to VOTING
I0715 18:16:03.263897  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 8.581313ms
I0715 18:16:03.263944  5875 replica.cpp:320] Persisted replica status to VOTING
I0715 18:16:03.264010  5875 recover.cpp:556] Successfully joined the Paxos group
I0715 18:16:03.264085  5875 recover.cpp:440] Recover process terminated
I0715 18:16:03.264227  5875 log.cpp:656] Attempting to start the writer
I0715 18:16:03.264570  5875 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0715 18:16:03.322881  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 58.31469ms
I0715 18:16:03.323349  5875 replica.cpp:342] Persisted promised to 1
I0715 18:16:03.328495  5876 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0715 18:16:03.328910  5876 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0715 18:16:03.338655  5876 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 9.73834ms
I0715 18:16:03.338693  5876 replica.cpp:676] Persisted action at 0
I0715 18:16:03.338964  5876 replica.cpp:508] Replica received write request for 
position 0
I0715 18:16:03.338997  5876 leveldb.cpp:438] Reading position from leveldb took 
21691ns
I0715 18:16:03.349257  5876 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 10.25515ms
I0715 18:16:03.349551  5876 replica.cpp:676] Persisted action at 0
I0715 18:16:03.354379  5877 replica.cpp:655] Replica received learned notice 
for position 0
I0715 18:16:03.367383  5877 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 12.99789ms
I0715 18:16:03.367434  5877 replica.cpp:676] Persisted action at 0
I0715 18:16:03.367444  5877 replica.cpp:661] Replica learned NOP action at 

[jira] [Commented] (MESOS-1525) Don't require slave id for reconciliation requests.

2014-07-15 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063057#comment-14063057
 ] 

Benjamin Mahler commented on MESOS-1525:


Review: https://reviews.apache.org/r/23542/

 Don't require slave id for reconciliation requests.
 ---

 Key: MESOS-1525
 URL: https://issues.apache.org/jira/browse/MESOS-1525
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 Reconciliation requests currently specify a list of TaskStatuses. SlaveID is 
 optional inside TaskStatus but reconciliation requests are dropped when the 
 SlaveID is not specified.
 We can answer reconciliation requests for a task so long as there are no 
 transient slaves, this is what we should do when the slave id is not 
 specified.
 There's an open question around whether we want the Reconcile Event to 
 specify TaskID/SlaveID instead of TaskStatus, but I'll save that for later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.

2014-07-15 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063058#comment-14063058
 ] 

Benjamin Mahler commented on MESOS-1603:


Review: https://reviews.apache.org/r/23543/

 SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
 -

 Key: MESOS-1603
 URL: https://issues.apache.org/jira/browse/MESOS-1603
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 {noformat}
 [ RUN  ] SlaveTest.TerminatingSlaveDoesNotReregister
 Using temporary directory 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU'
 I0715 18:16:03.231495  5857 leveldb.cpp:176] Opened db in 27.552259ms
 I0715 18:16:03.240953  5857 leveldb.cpp:183] Compacted db in 8.801497ms
 I0715 18:16:03.241580  5857 leveldb.cpp:198] Created db iterator in 39823ns
 I0715 18:16:03.241945  5857 leveldb.cpp:204] Seeked to beginning of db in 
 15498ns
 I0715 18:16:03.242385  5857 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 15153ns
 I0715 18:16:03.242780  5857 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0715 18:16:03.243475  5882 recover.cpp:425] Starting replica recovery
 I0715 18:16:03.243540  5882 recover.cpp:451] Replica is in EMPTY status
 I0715 18:16:03.243862  5882 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0715 18:16:03.243919  5882 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0715 18:16:03.244112  5875 recover.cpp:542] Updating replica status to 
 STARTING
 I0715 18:16:03.249405  5880 master.cpp:288] Master 
 20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514
 I0715 18:16:03.249445  5880 master.cpp:325] Master only allowing 
 authenticated frameworks to register
 I0715 18:16:03.249454  5880 master.cpp:330] Master only allowing 
 authenticated slaves to register
 I0715 18:16:03.249480  5880 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials'
 I0715 18:16:03.250130  5880 master.cpp:359] Authorization enabled
 I0715 18:16:03.250900  5880 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:36514
 I0715 18:16:03.250951  5880 master.cpp:122] No whitelist given. Advertising 
 offers for all slaves
 I0715 18:16:03.251145  5880 master.cpp:1128] The newly elected leader is 
 master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857
 I0715 18:16:03.251164  5880 master.cpp:1141] Elected as the leading master!
 I0715 18:16:03.251173  5880 master.cpp:959] Recovering from registrar
 I0715 18:16:03.251225  5880 registrar.cpp:313] Recovering registrar
 I0715 18:16:03.254640  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 10.421369ms
 I0715 18:16:03.254683  5875 replica.cpp:320] Persisted replica status to 
 STARTING
 I0715 18:16:03.254770  5875 recover.cpp:451] Replica is in STARTING status
 I0715 18:16:03.255097  5875 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0715 18:16:03.255166  5875 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0715 18:16:03.255280  5875 recover.cpp:542] Updating replica status to VOTING
 I0715 18:16:03.263897  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.581313ms
 I0715 18:16:03.263944  5875 replica.cpp:320] Persisted replica status to 
 VOTING
 I0715 18:16:03.264010  5875 recover.cpp:556] Successfully joined the Paxos 
 group
 I0715 18:16:03.264085  5875 recover.cpp:440] Recover process terminated
 I0715 18:16:03.264227  5875 log.cpp:656] Attempting to start the writer
 I0715 18:16:03.264570  5875 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0715 18:16:03.322881  5875 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 58.31469ms
 I0715 18:16:03.323349  5875 replica.cpp:342] Persisted promised to 1
 I0715 18:16:03.328495  5876 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0715 18:16:03.328910  5876 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0715 18:16:03.338655  5876 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 9.73834ms
 I0715 18:16:03.338693  5876 replica.cpp:676] Persisted action at 0
 I0715 18:16:03.338964  5876 replica.cpp:508] Replica received write request 
 for position 0
 I0715 18:16:03.338997  5876 leveldb.cpp:438] Reading position from leveldb 
 took 21691ns
 I0715 18:16:03.349257  5876 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 10.25515ms
 I0715 18:16:03.349551  5876 replica.cpp:676] Persisted action at 0
 

[jira] [Created] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.

2014-07-21 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1620:
--

 Summary: Reconciliation does not send back tasks pending 
validation / authorization.
 Key: MESOS-1620
 URL: https://issues.apache.org/jira/browse/MESOS-1620
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send 
back TASK_STAGING for those tasks that are pending in the Master (validation / 
authorization still in progress).

For both implicit and explicit task reconciliation, the master could reply with 
TASK_STAGING for these tasks, as this provides additional information to the 
framework.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MESOS-1529) Handle a network partition between Master and Slave

2014-07-23 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072537#comment-14072537
 ] 

Benjamin Mahler edited comment on MESOS-1529 at 7/24/14 2:55 AM:
-

For now we will proceed by adding a ping timeout on the slave to ensure that 
the slave re-registers when the master is no longer pinging it. This will 
resolve the case that motivated this ticket:

https://reviews.apache.org/r/23874/
https://reviews.apache.org/r/23875/
https://reviews.apache.org/r/23866/
https://reviews.apache.org/r/23867/
https://reviews.apache.org/r/23868/

I decided to punt on the failover timeout in the master in the first pass 
because it can be dangerous when ZooKeeper issues are preventing the slave from 
re-registering with the master; we do not want to remove a ton of slaves in 
this situation. Rather, when the slave is health checking correctly but does 
not re-register within a timeout, we could send a registration request from the 
master to the slave, telling the slave that it must re-register. This message 
could also be used when receiving status updates (or other messages) from 
slaves that are disconnected in the master.


was (Author: bmahler):
For now we will proceed by adding a ping timeout on the slave to ensure that 
the slave re-registers when the master is no longer pinging it. This will 
resolve the case that motivated this ticket:

https://reviews.apache.org/r/23866/
https://reviews.apache.org/r/23867/
https://reviews.apache.org/r/23868/

I decided to punt on the failover timeout in the master in the first pass 
because it can be dangerous when ZooKeeper issues are preventing the slave from 
re-registering with the master; we do not want to remove a ton of slaves in 
this situation. Rather, when the slave is health checking correctly but does 
not re-register within a timeout, we could send a registration request from the 
master to the slave, telling the slave that it must re-register. This message 
could also be used when receiving status updates (or other messages) from 
slaves that are disconnected in the master.

 Handle a network partition between Master and Slave
 ---

 Key: MESOS-1529
 URL: https://issues.apache.org/jira/browse/MESOS-1529
 Project: Mesos
  Issue Type: Bug
Reporter: Dominic Hamon
Assignee: Benjamin Mahler

 If a network partition occurs between a Master and Slave, the Master will 
 remove the Slave (as it fails health check) and mark the tasks being run 
 there as LOST. However, the Slave is not aware that it has been removed so 
 the tasks will continue to run.
 (To clarify a little bit: neither the master nor the slave receives 'exited' 
 event, indicating that the connection between the master and slave is not 
 closed).
 There are at least two possible approaches to solving this issue:
 1. Introduce a health check from Slave to Master so they have a consistent 
 view of a network partition. We may still see this issue should a one-way 
 connection error occur.
 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the 
 Slave reappears and reconcile then. We'd still need to mark Slaves and tasks 
 as potentially lost (zombie state) but maybe the Scheduler can make a more 
 intelligent decision.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1635) zk flag fails when specifying a file and the

2014-07-24 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073707#comment-14073707
 ] 

Benjamin Mahler commented on MESOS-1635:


Looks like there was a TODO left for this:
https://github.com/apache/mesos/blob/0.19.1/src/master/main.cpp#L197

I think we should improve URL::parse per the TODO and update these:
https://github.com/apache/mesos/blob/0.19.1/src/master/detector.cpp#L107
https://github.com/apache/mesos/blob/0.19.1/src/master/contender.cpp#L73

Should be a simple fix, would be happy to shepherd this if someone wants to 
pick it up!

 zk flag fails when specifying a file and the 
 -

 Key: MESOS-1635
 URL: https://issues.apache.org/jira/browse/MESOS-1635
 Project: Mesos
  Issue Type: Bug
  Components: cli
Affects Versions: 0.19.1
 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 
 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Ken Sipe

 The zk flag supports referencing a file.  It works  when registry is 
 in_memory, however in a real environment it fails.
 the following starts up just fine.
 /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory
 however when the follow is executed it fails:
  /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 
 --work_dir=/tmp/mesos
 It uses the same working format for the zk flag, but now we are using the 
 replicated logs. it fails with:
 I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root
 I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1
 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1
 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: 
 dc0b7bf2a1a7981079b33a16b689892f9cda0d8d
 Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1406) Master stats.json using boolean instead of integral value for 'elected'.

2014-07-24 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1406:
---

  Description: 
All stats.json values should be numeric, but it looks like a regression was 
introduced here:

{noformat}
commit dee9bd96e88053ab96c84253578ed332d343fe41
Author: Charlie Carson charliecar...@gmail.com
Date:   Thu Feb 20 16:24:09 2014 -0800

Add JSON::Boolean to stout/json.hpp.

If you assign an JSON::Object a bool then it will get coerced into a
JSON::Number w/ value of 0.0 or 1.0.  This is because JSON::True and
JSON::False do not have constructors from bool.

The fix is to introduce a common super class, JSON::Boolean, which
both JSON::True and JSON::False inherit from.  JSON::Boolean has the
necessary constructor which takes a bool.

However, this leads to ambiguity when assigning a cstring to
a JSON::Value, since JSON::String already takes a const char * and
a const char * is implicitly convertable to a bool.

The solution for that is to rename the variant from JSON::Value
to JSON::inner::Variant and to create a new class JSON::Value
which inherits from JSON::inner::Variant.  The new JSON::Value
can have all the conversion constructors in a single place, so
is no ambiguity, and delegate everythign else to the Variant.

Also added a bunch of unit tests.

SEE: https://issues.apache.org/jira/browse/MESOS-939

Review: https://reviews.apache.org/r/17520
{noformat}

This caused all JSON values constructed from booleans to implicitly change from 
0/1 to true/false.

  was:
All stats.json values should be numeric, but it looks like a regression was 
introduced here:

{noformat}
commit f9d1dd819b6cc3843e4d1287ac10276d62cbfed4
Author: Vinod Kone vi...@twitter.com
Date:   Tue Nov 19 10:39:27 2013 -0800

Replaced usage of old detector with new Master contender and detector
abstractions.

From: Jiang Yan Xu y...@jxu.me
Review: https://reviews.apache.org/r/15510
{noformat}

Which appears to have been included since 0.16.0.

Old stats.json:

{code}
{
  ...
  elected: 0,
  ...
}
{code}

{code}
{
  ...
  elected: false,
  ...
}
{code}

Affects Version/s: (was: 0.18.0)
   (was: 0.17.0)
   (was: 0.16.0)
Fix Version/s: (was: 0.19.0)

 Master stats.json using boolean instead of integral value for 'elected'.
 

 Key: MESOS-1406
 URL: https://issues.apache.org/jira/browse/MESOS-1406
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 All stats.json values should be numeric, but it looks like a regression was 
 introduced here:
 {noformat}
 commit dee9bd96e88053ab96c84253578ed332d343fe41
 Author: Charlie Carson charliecar...@gmail.com
 Date:   Thu Feb 20 16:24:09 2014 -0800
 Add JSON::Boolean to stout/json.hpp.
 If you assign an JSON::Object a bool then it will get coerced into a
 JSON::Number w/ value of 0.0 or 1.0.  This is because JSON::True and
 JSON::False do not have constructors from bool.
 The fix is to introduce a common super class, JSON::Boolean, which
 both JSON::True and JSON::False inherit from.  JSON::Boolean has the
 necessary constructor which takes a bool.
 However, this leads to ambiguity when assigning a cstring to
 a JSON::Value, since JSON::String already takes a const char * and
 a const char * is implicitly convertable to a bool.
 The solution for that is to rename the variant from JSON::Value
 to JSON::inner::Variant and to create a new class JSON::Value
 which inherits from JSON::inner::Variant.  The new JSON::Value
 can have all the conversion constructors in a single place, so
 is no ambiguity, and delegate everythign else to the Variant.
 Also added a bunch of unit tests.
 SEE: https://issues.apache.org/jira/browse/MESOS-939
 Review: https://reviews.apache.org/r/17520
 {noformat}
 This caused all JSON values constructed from booleans to implicitly change 
 from 0/1 to true/false.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.

2014-07-24 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reopened MESOS-1219:



This was not committed after all in light of: MESOS-1630

 Master should disallow frameworks that reconnect after failover timeout.
 

 Key: MESOS-1219
 URL: https://issues.apache.org/jira/browse/MESOS-1219
 Project: Mesos
  Issue Type: Bug
  Components: master, webui
Reporter: Robert Lacroix
Assignee: Vinod Kone
 Fix For: 0.20.0


 When a scheduler reconnects after the failover timeout has exceeded, the 
 framework id is usually reused because the scheduler doesn't know that the 
 timeout exceeded and it is actually handled as a new framework.
 The /framework/:framework_id route of the Web UI doesn't handle those cases 
 very well because its key is reused. It only shows the terminated one.
 Would it make sense to ignore the provided framework id when a scheduler 
 reconnects to a terminated framework and generate a new id to make sure it's 
 unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1635) zk flag fails when specifying a file and the

2014-07-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1635:
---

Shepherd: Benjamin Mahler

 zk flag fails when specifying a file and the 
 -

 Key: MESOS-1635
 URL: https://issues.apache.org/jira/browse/MESOS-1635
 Project: Mesos
  Issue Type: Bug
  Components: cli
Affects Versions: 0.19.1
 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 
 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Ken Sipe

 The zk flag supports referencing a file.  It works  when registry is 
 in_memory, however in a real environment it fails.
 the following starts up just fine.
 /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory
 however when the follow is executed it fails:
  /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 
 --work_dir=/tmp/mesos
 It uses the same working format for the zk flag, but now we are using the 
 replicated logs. it fails with:
 I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root
 I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1
 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1
 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: 
 dc0b7bf2a1a7981079b33a16b689892f9cda0d8d
 Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1635) zk flag fails when specifying a file and the replicated logs

2014-07-29 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1635.


   Resolution: Fixed
Fix Version/s: 0.20.0

{noformat}
commit cd61a228ecb3bf0d40fe3658a9ec58f645a9ecd2
Author: Ken Sipe kens...@gmail.com
Date:   Tue Jul 29 12:24:38 2014 -0700

Fixed the master to accept a file:// based zk flag.

Review: https://reviews.apache.org/r/23997
{noformat}

 zk flag fails when specifying a file and the replicated logs
 

 Key: MESOS-1635
 URL: https://issues.apache.org/jira/browse/MESOS-1635
 Project: Mesos
  Issue Type: Bug
  Components: cli
Affects Versions: 0.19.1
 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 
 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Ken Sipe
 Fix For: 0.20.0


 The zk flag supports referencing a file.  It works  when registry is 
 in_memory, however in a real environment it fails.
 the following starts up just fine.
 /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory
 however when the follow is executed it fails:
  /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 
 --work_dir=/tmp/mesos
 It uses the same working format for the zk flag, but now we are using the 
 replicated logs. it fails with:
 I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root
 I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1
 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1
 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: 
 dc0b7bf2a1a7981079b33a16b689892f9cda0d8d
 Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1619) OsTest.User test is flaky

2014-07-29 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078585#comment-14078585
 ] 

Benjamin Mahler commented on MESOS-1619:


This thread 
(http://www.redhat.com/archives/libvir-list/2012-December/msg00208.html) seems 
relevant:

{quote}
virGetUserIDByName returns an error when the return value of getpwnam_r
is non-0. However on my RHEL system, getpwnam_r returns ENOENT when the
requested user cannot be found, which then causes virGetUserID not
to behave as documented (it returns an error instead of falling back
to parsing the passed-in value as an uid).
{quote}

Let's fix up os.hpp based on this knowledge?

 OsTest.User test is flaky
 -

 Key: MESOS-1619
 URL: https://issues.apache.org/jira/browse/MESOS-1619
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: centos7 w/ gcc
Reporter: Vinod Kone

 [ RUN  ] OsTest.user
 stout/tests/os_tests.cpp:720: Failure
 Value of: os::getuid(UUID::random().toString()).isNone()
   Actual: false
 Expected: true
 stout/tests/os_tests.cpp:721: Failure
 Value of: os::getgid(UUID::random().toString()).isNone()
   Actual: false
 Expected: true
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 E0721 06:15:58.656255 13440 os.hpp:731] Failed to set gid: Failed to get 
 username information: No such file or directory
 [  FAILED  ] OsTest.user (20 ms)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2014-07-29 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1653:
--

 Summary: HealthCheckTest.GracePeriod is flaky.
 Key: MESOS-1653
 URL: https://issues.apache.org/jira/browse/MESOS-1653
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler


{noformat}
[--] 3 tests from HealthCheckTest
[ RUN  ] HealthCheckTest.GracePeriod
Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 2317ns
I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the db 
in 1367ns
I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to STARTING
I0729 17:10:10.508482  1213 master.cpp:289] Master 
20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing authenticated 
frameworks to register
I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing authenticated 
slaves to register
I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
Initializing hierarchical allocator process with master : master@127.0.1.1:54701
I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
offers for all slaves
I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 12.946461ms
I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
STARTING
I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 11.537102ms
I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to VOTING
I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos group
I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 8.830863ms
I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 6.548344ms
I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request for 
position 0
I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb took 
15676ns
I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 9.373798ms
I0729 17:10:10.557241  1209 replica.cpp:676] Persisted action at 0
I0729 17:10:10.560642  1210 replica.cpp:655] Replica received learned notice 
for position 0
I0729 17:10:10.570312  1210 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 9.61503ms
I0729 17:10:10.570380  1210 replica.cpp:676] Persisted action at 0
I0729 17:10:10.570406  1210 replica.cpp:661] Replica learned NOP action at 
position 0
I0729 17:10:10.570746  1210 log.cpp:672] Writer started 

[jira] [Created] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-08-04 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1668:
--

 Summary: Handle a temporary one-way master -- slave socket 
closure.
 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Priority: Minor


In MESOS-1529, we realized that it's possible for a slave to remain 
disconnected in the master if the following occurs:

→ Master and Slave connected operating normally.
→ Temporary one-way network failure, master→slave link breaks.
→ Master marks slave as disconnected.
→ Network restored and health checking continues normally, slave is not removed 
as a result. Slave does not attempt to re-register since it is receiving pings 
once again.
→ Slave remains disconnected according to the master, and the slave does not 
try to re-register. Bad!

We were originally thinking of using a failover timeout in the master to remove 
these slaves that don't re-register. However, it can be dangerous when 
ZooKeeper issues are preventing the slave from re-registering with the master; 
we do not want to remove a ton of slaves in this situation.

Rather, when the slave is health checking correctly but does not re-register 
within a timeout, we could send a registration request from the master to the 
slave, telling the slave that it must re-register. This message could also be 
used when receiving status updates (or other messages) from slaves that are 
disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1470) Add cluster maintenance documentation.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1470:
---

Target Version/s:   (was: 0.20.0)

 Add cluster maintenance documentation.
 --

 Key: MESOS-1470
 URL: https://issues.apache.org/jira/browse/MESOS-1470
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 0.19.0
Reporter: Benjamin Mahler

 Now that the master has replicated state on the disk, we should add 
 documentation that guides operators for common maintenance work:
 * Swapping a master in the ensemble.
 * Growing the master ensemble.
 * Shrinking the master ensemble.
 This would help craft similar documentation for users of the replicated log.
 We should also add documentation for existing slave maintenance documentation:
 * Best practices for rolling upgrades.
 * How to shut down a slave.
 This latter category will be incorporated with [~alexandra.sava]'s 
 maintenance work!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1461) Add task reconciliation to the Python API.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1461:
---

Target Version/s: 0.21.0  (was: 0.20.0)

 Add task reconciliation to the Python API.
 --

 Key: MESOS-1461
 URL: https://issues.apache.org/jira/browse/MESOS-1461
 Project: Mesos
  Issue Type: Task
  Components: python api
Affects Versions: 0.19.0
Reporter: Benjamin Mahler

 Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but 
 was never added to the Python API.
 This may be obviated by the lower level API.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1517:
---

Target Version/s:   (was: 0.20.0)

 Maintain a queue of messages that arrive before the master recovers.
 

 Key: MESOS-1517
 URL: https://issues.apache.org/jira/browse/MESOS-1517
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
  Labels: reliability

 Currently when the master is recovering, we drop all incoming messages. If 
 slaves and frameworks knew about the leading master only once it has 
 recovered, then we would only expect to see messages after we've recovered.
 We previously considered enqueuing all messages through the recovery future, 
 but this has the downside of forcing all messages to go through the master's 
 queue twice:
 {code}
   // TODO(bmahler): Consider instead re-enqueing *all* messages
   // through recover(). What are the performance implications of
   // the additional queueing delay and the accumulated backlog
   // of messages post-recovery?
   if (!recovered.get().isReady()) {
 VLOG(1)  Dropping '  event.message-name  ' message since 
  not recovered yet;
 ++metrics.dropped_messages;
 return;
   }
 {code}
 However, an easy solution to this problem is to maintain an explicit queue of 
 incoming messages that gets flushed once we finish recovery. This ensures 
 that all messages post-recovery are processed normally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1653.


   Resolution: Fixed
Fix Version/s: 0.20.0
 Assignee: Timothy Chen

Fix was included here:

{noformat}
commit 656b0e075c79e03cf6937bbe7302424768729aa2
Author: Timothy Chen tnac...@apache.org
Date:   Wed Aug 6 11:34:03 2014 -0700

Re-enabled HealthCheckTest.ConsecutiveFailures test.

The test originally was flaky because the time to process the number
of consecutive checks configured exceeds the task itself, so the task
finished but the number of expected task health check didn't match.

Review: https://reviews.apache.org/r/23772
{noformat}

 HealthCheckTest.GracePeriod is flaky.
 -

 Key: MESOS-1653
 URL: https://issues.apache.org/jira/browse/MESOS-1653
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Timothy Chen
 Fix For: 0.20.0


 {noformat}
 [--] 3 tests from HealthCheckTest
 [ RUN  ] HealthCheckTest.GracePeriod
 Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
 I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
 I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
 I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
 I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
 2317ns
 I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 1367ns
 I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
 I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
 I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
 STARTING
 I0729 17:10:10.508482  1213 master.cpp:289] Master 
 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
 I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
 authenticated frameworks to register
 I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
 authenticated slaves to register
 I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
 authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
 I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
 I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:54701
 I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
 offers for all slaves
 I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
 master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
 I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
 I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
 I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
 I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 12.946461ms
 I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
 STARTING
 I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
 I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
 I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 11.537102ms
 I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
 VOTING
 I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
 group
 I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
 I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
 I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.830863ms
 I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
 I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting 

[jira] [Reopened] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reopened MESOS-1613:



Looks like it's still flaky:

{noformat}
Changes

Summary

Made ephemeral ports a resource and killed private resources. (details)
Do not send ephemeral_ports resource to frameworks. (details)
Create static mesos library. (details)
Re-enabled HealthCheckTest.ConsecutiveFailures test. (details)
Merge resourcesRecovered and resourcesUnused. (details)
Added executor metrics for slave. (details)

[ RUN  ] HealthCheckTest.ConsecutiveFailures
Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu'
I0806 15:06:59.268267  9210 leveldb.cpp:176] Opened db in 29.926087ms
I0806 15:06:59.275971  9210 leveldb.cpp:183] Compacted db in 7.40006ms
I0806 15:06:59.276254  9210 leveldb.cpp:198] Created db iterator in 7678ns
I0806 15:06:59.276741  9210 leveldb.cpp:204] Seeked to beginning of db in 2076ns
I0806 15:06:59.277034  9210 leveldb.cpp:273] Iterated through 0 keys in the db 
in 1908ns
I0806 15:06:59.277307  9210 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0806 15:06:59.277868  9233 recover.cpp:425] Starting replica recovery
I0806 15:06:59.277946  9233 recover.cpp:451] Replica is in EMPTY status
I0806 15:06:59.278240  9233 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0806 15:06:59.278296  9233 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0806 15:06:59.278391  9233 recover.cpp:542] Updating replica status to STARTING
I0806 15:06:59.282282  9234 master.cpp:287] Master 
20140806-150659-16842879-60888-9210 (lucid) started on 127.0.1.1:60888
I0806 15:06:59.282316  9234 master.cpp:324] Master only allowing authenticated 
frameworks to register
I0806 15:06:59.282322  9234 master.cpp:329] Master only allowing authenticated 
slaves to register
I0806 15:06:59.282330  9234 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu/credentials'
I0806 15:06:59.282508  9234 master.cpp:358] Authorization enabled
I0806 15:06:59.283121  9234 hierarchical_allocator_process.hpp:296] 
Initializing hierarchical allocator process with master : master@127.0.1.1:60888
I0806 15:06:59.283174  9234 master.cpp:121] No whitelist given. Advertising 
offers for all slaves
I0806 15:06:59.283413  9234 master.cpp:1127] The newly elected leader is 
master@127.0.1.1:60888 with id 20140806-150659-16842879-60888-9210
I0806 15:06:59.283429  9234 master.cpp:1140] Elected as the leading master!
I0806 15:06:59.283435  9234 master.cpp:958] Recovering from registrar
I0806 15:06:59.283491  9234 registrar.cpp:313] Recovering registrar
I0806 15:06:59.284046  9233 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.600113ms
I0806 15:06:59.284080  9233 replica.cpp:320] Persisted replica status to 
STARTING
I0806 15:06:59.284226  9233 recover.cpp:451] Replica is in STARTING status
I0806 15:06:59.284580  9233 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0806 15:06:59.284643  9233 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0806 15:06:59.284747  9233 recover.cpp:542] Updating replica status to VOTING
I0806 15:06:59.289934  9233 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.119357ms
I0806 15:06:59.290256  9233 replica.cpp:320] Persisted replica status to VOTING
I0806 15:06:59.290876  9237 recover.cpp:556] Successfully joined the Paxos group
I0806 15:06:59.291131  9232 recover.cpp:440] Recover process terminated
I0806 15:06:59.300732  9236 log.cpp:656] Attempting to start the writer
I0806 15:06:59.301061  9236 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0806 15:06:59.306172  9236 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.070193ms
I0806 15:06:59.306229  9236 replica.cpp:342] Persisted promised to 1
I0806 15:06:59.306747  9236 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0806 15:06:59.307143  9236 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0806 15:06:59.309715  9236 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 2.521311ms
I0806 15:06:59.310199  9236 replica.cpp:676] Persisted action at 0
I0806 15:06:59.320276  9234 replica.cpp:508] Replica received write request for 
position 0
I0806 15:06:59.320335  9234 leveldb.cpp:438] Reading position from leveldb took 
26656ns
I0806 15:06:59.325726  9234 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 5.358479ms
I0806 15:06:59.325781  9234 replica.cpp:676] Persisted action at 0
I0806 15:06:59.325999  9234 replica.cpp:655] Replica received learned notice 
for position 0
I0806 15:06:59.328487  9234 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 2.458504ms
I0806 

[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1668:
---


Placing this under reconciliation because, although extremely rare, it can lead 
to some inconsistent state between the master and slave for an arbitrary amount 
of time. For example, if the launchTask message is dropped as a result of the 
socket closure between Master → Slave in the scenario above.

 Handle a temporary one-way master -- slave socket closure.
 ---

 Key: MESOS-1668
 URL: https://issues.apache.org/jira/browse/MESOS-1668
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Priority: Minor
  Labels: reliability

 In MESOS-1529, we realized that it's possible for a slave to remain 
 disconnected in the master if the following occurs:
 → Master and Slave connected operating normally.
 → Temporary one-way network failure, master→slave link breaks.
 → Master marks slave as disconnected.
 → Network restored and health checking continues normally, slave is not 
 removed as a result. Slave does not attempt to re-register since it is 
 receiving pings once again.
 → Slave remains disconnected according to the master, and the slave does not 
 try to re-register. Bad!
 We were originally thinking of using a failover timeout in the master to 
 remove these slaves that don't re-register. However, it can be dangerous when 
 ZooKeeper issues are preventing the slave from re-registering with the 
 master; we do not want to remove a ton of slaves in this situation.
 Rather, when the slave is health checking correctly but does not re-register 
 within a timeout, we could send a registration request from the master to the 
 slave, telling the slave that it must re-register. This message could also be 
 used when receiving status updates (or other messages) from slaves that are 
 disconnected in the master.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-06 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088425#comment-14088425
 ] 

Benjamin Mahler commented on MESOS-1613:


So far only Twitter CI is exposing this flakiness. I've pasted the full logs in 
the comment above, are you looking for logging from the command executor? We 
may want to investigate wiring up the tests to expose them in the output to 
make this easier for you to debug.

 HealthCheckTest.ConsecutiveFailures is flaky
 

 Key: MESOS-1613
 URL: https://issues.apache.org/jira/browse/MESOS-1613
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.20.0
 Environment: Ubuntu 10.04 GCC
Reporter: Vinod Kone
Assignee: Timothy Chen

 {code}
 [ RUN  ] HealthCheckTest.ConsecutiveFailures
 Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
 I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
 I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
 I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
 I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
 2125ns
 I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 10747ns
 I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
 I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
 I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
 STARTING
 I0717 04:39:59.301985  5031 master.cpp:288] Master 
 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
 I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
 authenticated frameworks to register
 I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
 authenticated slaves to register
 I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
 I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
 I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:40280
 I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
 offers for all slaves
 I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.325097ms
 I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
 STARTING
 I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
 master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
 I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
 I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
 I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
 I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
 I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
 I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.204157ms
 I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
 VOTING
 I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
 group
 I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
 I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
 I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.271822ms
 I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
 I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 5.004323ms
 I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
 I0717 04:39:59.323799  5031 replica.cpp:508] Replica received write request 
 for position 0
 I0717 04:39:59.323837  5031 

[jira] [Updated] (MESOS-887) Scheduler Driver should use exited() to detect disconnection with Master.

2014-08-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-887:
--

Labels: framework reliability  (was: )

 Scheduler Driver should use exited() to detect disconnection with Master.
 -

 Key: MESOS-887
 URL: https://issues.apache.org/jira/browse/MESOS-887
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.14.2, 0.16.0, 0.15.0
Reporter: Benjamin Mahler
  Labels: framework, reliability

 The Scheduler Driver already links with the master, but it does not use the 
 built in exited() notification from libprocess to detect socket closure.
 This would fast-track the delay from zookeeper detecting a leadership change, 
 and would minimize the number of dropped messages leaving the driver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-07 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090065#comment-14090065
 ] 

Benjamin Mahler commented on MESOS-1613:


[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

 HealthCheckTest.ConsecutiveFailures is flaky
 

 Key: MESOS-1613
 URL: https://issues.apache.org/jira/browse/MESOS-1613
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.20.0
 Environment: Ubuntu 10.04 GCC
Reporter: Vinod Kone
Assignee: Timothy Chen

 {code}
 [ RUN  ] HealthCheckTest.ConsecutiveFailures
 Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
 I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
 I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
 I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
 I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
 2125ns
 I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 10747ns
 I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
 I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
 I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
 STARTING
 I0717 04:39:59.301985  5031 master.cpp:288] Master 
 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
 I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
 authenticated frameworks to register
 I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
 authenticated slaves to register
 I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
 I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
 I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:40280
 I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
 offers for all slaves
 I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.325097ms
 I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
 STARTING
 I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
 master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
 I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
 I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
 I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
 I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
 I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
 I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.204157ms
 I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
 VOTING
 I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
 group
 I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
 I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
 I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.271822ms
 I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
 I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 5.004323ms
 I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
 I0717 04:39:59.323799  5031 replica.cpp:508] Replica received write request 
 for position 0
 I0717 04:39:59.323837  5031 leveldb.cpp:438] Reading position from leveldb 
 took 21901ns
 I0717 04:39:59.329038  5031 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 5.175998ms
 I0717 

[jira] [Comment Edited] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2014-08-07 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090065#comment-14090065
 ] 

Benjamin Mahler edited comment on MESOS-1613 at 8/7/14 11:56 PM:
-

[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

E.g.
https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2299/consoleFull


was (Author: bmahler):
[~tnachen] it's failing on ASF CI as well, can you triage or disable in the 
interim?

 HealthCheckTest.ConsecutiveFailures is flaky
 

 Key: MESOS-1613
 URL: https://issues.apache.org/jira/browse/MESOS-1613
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.20.0
 Environment: Ubuntu 10.04 GCC
Reporter: Vinod Kone
Assignee: Timothy Chen

 {code}
 [ RUN  ] HealthCheckTest.ConsecutiveFailures
 Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
 I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
 I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
 I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
 I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
 2125ns
 I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 10747ns
 I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
 I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
 I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
 STARTING
 I0717 04:39:59.301985  5031 master.cpp:288] Master 
 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
 I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
 authenticated frameworks to register
 I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
 authenticated slaves to register
 I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
 I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
 I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:40280
 I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
 offers for all slaves
 I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.325097ms
 I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
 STARTING
 I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
 master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
 I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
 I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
 I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
 I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
 I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
 I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.204157ms
 I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
 VOTING
 I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
 group
 I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
 I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
 I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.271822ms
 I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
 I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 5.004323ms
 I0717 04:39:59.323493  5031 replica.cpp:676] Persisted action at 0
 I0717 

[jira] [Commented] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.

2014-08-11 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093505#comment-14093505
 ] 

Benjamin Mahler commented on MESOS-1620:


Review chain for this one, did some cleanups along the way:

https://reviews.apache.org/r/24582/
https://reviews.apache.org/r/24583/
https://reviews.apache.org/r/24576/
https://reviews.apache.org/r/24515/
https://reviews.apache.org/r/24516/

 Reconciliation does not send back tasks pending validation / authorization.
 ---

 Key: MESOS-1620
 URL: https://issues.apache.org/jira/browse/MESOS-1620
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send 
 back TASK_STAGING for those tasks that are pending in the Master (validation 
 / authorization still in progress).
 For both implicit and explicit task reconciliation, the master could reply 
 with TASK_STAGING for these tasks, as this provides additional information to 
 the framework.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.

2014-08-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1620:
---

Shepherd: Vinod Kone  (was: Dominic Hamon)

 Reconciliation does not send back tasks pending validation / authorization.
 ---

 Key: MESOS-1620
 URL: https://issues.apache.org/jira/browse/MESOS-1620
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send 
 back TASK_STAGING for those tasks that are pending in the Master (validation 
 / authorization still in progress).
 For both implicit and explicit task reconciliation, the master could reply 
 with TASK_STAGING for these tasks, as this provides additional information to 
 the framework.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.

2014-08-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1700:
---

Sprint: Q3 Sprint 3

 ThreadLocal does not release pthread keys or log properly.
 --

 Key: MESOS-1700
 URL: https://issues.apache.org/jira/browse/MESOS-1700
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 The ThreadLocalT abstraction in stout does not release the allocated 
 pthread keys upon destruction:
 https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22
 It also does not log the errors correctly. Fortunately this does not impact 
 mesos at the current time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.

2014-08-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096175#comment-14096175
 ] 

Benjamin Mahler commented on MESOS-1700:


https://reviews.apache.org/r/24669/

 ThreadLocal does not release pthread keys or log properly.
 --

 Key: MESOS-1700
 URL: https://issues.apache.org/jira/browse/MESOS-1700
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 The ThreadLocalT abstraction in stout does not release the allocated 
 pthread keys upon destruction:
 https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22
 It also does not log the errors correctly. Fortunately this does not impact 
 mesos at the current time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1714:
--

 Summary: The C++ 'Resources' abstraction should keep the 
underlying resources flattened.
 Key: MESOS-1714
 URL: https://issues.apache.org/jira/browse/MESOS-1714
 Project: Mesos
  Issue Type: Bug
  Components: c++ api
Reporter: Benjamin Mahler


Currently, the C++ Resources class does not ensure that the underlying 
Resources protobufs are kept flat.

This is an issue because some of the methods, e.g. 
[Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269],
 assume the resources are flat.

There is code that constructs unflattened resources, e.g. 
[Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353].
 We could prevent this type of construction, however it is perfectly fine if we 
ensure the C++ 'Resources' class performs flattening.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1715:
--

 Summary: The slave does not send pending tasks during 
re-registration.
 Key: MESOS-1715
 URL: https://issues.apache.org/jira/browse/MESOS-1715
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


In what looks like an oversight, the pending tasks in the slave 
(Framework::pending) are not sent in the re-registration message.

This can lead to spurious TASK_LOST notifications being generated by the master 
when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1717:
--

 Summary: The slave does not show pending tasks in the JSON 
endpoints.
 Key: MESOS-1717
 URL: https://issues.apache.org/jira/browse/MESOS-1717
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Reporter: Benjamin Mahler


The slave does not show pending tasks in the /state.json endpoint.

This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1718:
--

 Summary: Command executor can overcommit the slave.
 Key: MESOS-1718
 URL: https://issues.apache.org/jira/browse/MESOS-1718
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


Currently we give a small amount of resources to the command executor, in 
addition to resources used by the command task:

https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
{code: title=}
ExecutorInfo Slave::getExecutorInfo(
const FrameworkID frameworkId,
const TaskInfo task)
{
  ...
// Add an allowance for the command executor. This does lead to a
// small overcommit of resources.
executor.mutable_resources()-MergeFrom(
Resources::parse(
  cpus: + stringify(DEFAULT_EXECUTOR_CPUS) + ; +
  mem: + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
  ...
}
{code}

This leads to an overcommit of the slave. Ideally, for command tasks we can 
transfer all of the task resources to the executor at the slave / isolation 
level.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.

2014-08-18 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1720:
--

 Summary: Slave should send exited executor message when the 
executor is never launched.
 Key: MESOS-1720
 URL: https://issues.apache.org/jira/browse/MESOS-1720
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


When the slave sends TASK_LOST before launching an executor for a task, the 
slave does not send an exited executor message to the master.

Since the master receives no exited executor message, it still thinks the 
executor's resources are consumed on the slave.

One possible fix for this would be to send the exited executor message to the 
master in these cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-08-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101757#comment-14101757
 ] 

Benjamin Mahler commented on MESOS-1466:


We're going to proceed with a mitigation of this by rejecting tasks once the 
slave is overcommitted:
https://issues.apache.org/jira/browse/MESOS-1721

However, we would also like to ensure that this kind of race is not possible. 
One solution is to use master acknowledgments for executor exits:

(1) When an executor terminates (or the executor could not be launched: 
MESOS-1720), we send an exited executor message.
(2) The master acknowledges these message.
(3) The slave will not accept tasks for unacknowledged terminal executors (this 
must include those executors that could not be launched, per MESOS-1720).

The result of this is that a new executor cannot be launched until the master 
is aware of the old executor exiting.

 Race between executor exited event and launch task can cause overcommit of 
 resources
 

 Key: MESOS-1466
 URL: https://issues.apache.org/jira/browse/MESOS-1466
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: reliability

 The following sequence of events can cause an overcommit
 -- Launch task is called for a task whose executor is already running
 -- Executor's resources are not accounted for on the master
 -- Executor exits and the event is enqueued behind launch tasks on the master
 -- Master sends the task to the slave which needs to commit for resources 
 for task and the (new) executor.
 -- Master processes the executor exited event and re-offers the executor's 
 resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-08-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---

Summary: The slave does not send pending tasks / executors during 
re-registration.  (was: The slave does not send pending tasks during 
re-registration.)

 The slave does not send pending tasks / executors during re-registration.
 -

 Key: MESOS-1715
 URL: https://issues.apache.org/jira/browse/MESOS-1715
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 In what looks like an oversight, the pending tasks in the slave 
 (Framework::pending) are not sent in the re-registration message.
 This can lead to spurious TASK_LOST notifications being generated by the 
 master when it falsely thinks the tasks are not present on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.

2014-08-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1715:
---

Description: 
In what looks like an oversight, the pending tasks and executors in the slave 
(Framework::pending) are not sent in the re-registration message.

For tasks, this can lead to spurious TASK_LOST notifications being generated by 
the master when it falsely thinks the tasks are not present on the slave.

For executors, this can lead to under-accounting in the master, causing an 
overcommit on the slave.

  was:
In what looks like an oversight, the pending tasks in the slave 
(Framework::pending) are not sent in the re-registration message.

This can lead to spurious TASK_LOST notifications being generated by the master 
when it falsely thinks the tasks are not present on the slave.


 The slave does not send pending tasks / executors during re-registration.
 -

 Key: MESOS-1715
 URL: https://issues.apache.org/jira/browse/MESOS-1715
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 In what looks like an oversight, the pending tasks and executors in the slave 
 (Framework::pending) are not sent in the re-registration message.
 For tasks, this can lead to spurious TASK_LOST notifications being generated 
 by the master when it falsely thinks the tasks are not present on the slave.
 For executors, this can lead to under-accounting in the master, causing an 
 overcommit on the slave.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1734) Reduce compile time replacing macro expansions with variadic templates

2014-08-26 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111420#comment-14111420
 ] 

Benjamin Mahler commented on MESOS-1734:


Hi [~preillyme], we can't yet assume C++11 support:
https://issues.apache.org/jira/browse/MESOS-750

[~dhamon] would have a better idea of when we'll move to C++11 as a requirement.

 Reduce compile time replacing macro expansions with variadic templates
 --

 Key: MESOS-1734
 URL: https://issues.apache.org/jira/browse/MESOS-1734
 Project: Mesos
  Issue Type: Improvement
Reporter: Patrick Reilly
Assignee: Patrick Reilly
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1735) Better Startup Failure For Duplicate Master

2014-09-02 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118544#comment-14118544
 ] 

Benjamin Mahler commented on MESOS-1735:


We could use the EXIT approach from stout/exit.hpp here to avoid the abort / 
stacktrace and to include a helpful message.

 Better Startup Failure For Duplicate Master
 ---

 Key: MESOS-1735
 URL: https://issues.apache.org/jira/browse/MESOS-1735
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.20.0
 Environment: Ubuntu 12.04
Reporter: Ken Sipe

 The error message is cryptic when starting a mesos-master when a mesos-master 
 is already running.   The error message is:
 mesos-master --ip=192.168.74.174 --work_dir=~/mesos
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 F0826 20:24:56.940961  3057 process.cpp:1632] Failed to initialize, bind: 
 Address already in use [98]
 *** Check failure stack trace: ***
 Aborted (core dumped)
 This can be a new person's first experience.  It isn't clear to them that the 
 process is already running.  And they are lost as to what to do next.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.

2014-09-03 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120138#comment-14120138
 ] 

Benjamin Mahler commented on MESOS-1714:


For now, this review avoids constructing an unflattened Resources object:
https://reviews.apache.org/r/25306/

 The C++ 'Resources' abstraction should keep the underlying resources 
 flattened.
 ---

 Key: MESOS-1714
 URL: https://issues.apache.org/jira/browse/MESOS-1714
 Project: Mesos
  Issue Type: Bug
  Components: c++ api
Reporter: Benjamin Mahler

 Currently, the C++ Resources class does not ensure that the underlying 
 Resources protobufs are kept flat.
 This is an issue because some of the methods, e.g. 
 [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269],
  assume the resources are flat.
 There is code that constructs unflattened resources, e.g. 
 [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353].
  We could prevent this type of construction, however it is perfectly fine if 
 we ensure the C++ 'Resources' class performs flattening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (MESOS-733) Speedup slave recovery tests

2014-09-03 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler closed MESOS-733.
-
Resolution: Incomplete

Closing this in favor of an epic to track testing speedups.

 Speedup slave recovery tests
 

 Key: MESOS-733
 URL: https://issues.apache.org/jira/browse/MESOS-733
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
  Labels: twitter

 Several of the tests are slow now that they do offer checking. I suspect this 
 is due to the Clock semantics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1757) Speed up the tests.

2014-09-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1757:
--

 Summary: Speed up the tests.
 Key: MESOS-1757
 URL: https://issues.apache.org/jira/browse/MESOS-1757
 Project: Mesos
  Issue Type: Epic
  Components: technical debt, test
Reporter: Benjamin Mahler


The full test suite is exceeding the 7 minute mark (440 seconds on my machine), 
this epic is to track techniques to improve this:

# The reaper takes a full second to reap an exited process (MESOS-1199), this 
adds a second to each slave recovery test, and possibly more for things that 
rely on Subprocess.
# The command executor sleeps for a second when shutting down (MESOS-442), this 
adds a second to every test that uses the command executor.
# Now that the master and the slave have to perform sync'ed disk writes, 
consider using a ramdisk to speed up the disk writes.

Additional options that hopefully will not be necessary:

# Use automake's [parallel test 
harness|http://www.gnu.org/software/automake/manual/html_node/Parallel-Test-Harness.html]
 to compile tests separately and run tests in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1758) Freezer failure leads to lost task during container destruction.

2014-09-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1758:
--

 Summary: Freezer failure leads to lost task during container 
destruction.
 Key: MESOS-1758
 URL: https://issues.apache.org/jira/browse/MESOS-1758
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Benjamin Mahler


In the past we've seen numerous issues around the freezer. Lately, on the 
2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:

(1) An oom occurs.
(2) No indication of oom in the kernel logs.
(3) The slave is unable to freeze the cgroup.
(4) The task is marked as lost.

{noformat}
I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
15488MB Maximum Used: 15488MB

MEMORY STATISTICS:
cache 7958691840
rss 8281653248
mapped_file 9474048
pgpgin 4487861
pgpgout 522933
pgfault 2533780
pgmajfault 11
inactive_anon 0
active_anon 8281653248
inactive_file 7631708160
active_file 326852608
unevictable 0
hierarchical_memory_limit 16240345088
total_cache 7958691840
total_rss 8281653248
total_mapped_file 9474048
total_pgpgin 4487861
total_pgpgout 522933
total_pgfault 2533780
total_pgmajfault 11
total_inactive_anon 0
total_active_anon 8281653248
total_inactive_file 7631728640
total_active_file 326852608
total_unevictable 0
I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
mem(*):1.62403e+10 and will be terminated
I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
'bbb9732a-d600-4c1b-b326-846338c608c3'
I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.710848ms
I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.588224ms
I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
2.15296ms
I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.643008ms
I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
age: 5.630238827780799days
I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
1.511168ms
I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
'/slave(1)/stats.json'
E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
framework '201104070004-002563-' failed: Failed to destroy container: 
discarded future
I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
(UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
201104070004-002563- from @0.0.0.0:0
I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 
0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container 
bbb9732a-d600-4c1b-b326-846338c608c3
I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status 
update TASK_LOST (UUID: 

[jira] [Resolved] (MESOS-186) Resource offers should be rescinded after some configurable timeout

2014-09-05 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-186.
---
   Resolution: Fixed
Fix Version/s: 0.21.0

{noformat}
commit 707bf3b1d6f042ee92e7a291d3f74a20ae2d494b
Author: Kapil Arya ka...@mesosphere.io
Date:   Fri Sep 5 11:15:15 2014 -0700

Added optional --offer_timeout to rescind unused offers.

The ability to set an offer timeout helps prevent unfair resource
allocations in the face of frameworks that hoard offers, or that
accidentally drop offers.

When optimistic offers are added, hoarding will not affect the
fairness for other frameworks.

Review: https://reviews.apache.org/r/22066
{noformat}

 Resource offers should be rescinded after some configurable timeout
 ---

 Key: MESOS-186
 URL: https://issues.apache.org/jira/browse/MESOS-186
 Project: Mesos
  Issue Type: Improvement
  Components: framework
Reporter: Benjamin Hindman
Assignee: Timothy Chen
 Fix For: 0.21.0


 Problem: a framework has a bug and holds on to resource offers by accident 
 for 24 hours/
 One suggestion: resource offers should be rescinded after some configurable 
 timeout.
 Possible issue: this might interfere with frameworks that are hoarding. But 
 one possible solution here is to add another API call which checks the status 
 of resource offers (i.e., remindAboutOffer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1476) Provide endpoints for deactivating / activating slaves.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1476:
---
Sprint:   (was: Mesos Q3 Sprint 5)

 Provide endpoints for deactivating / activating slaves.
 ---

 Key: MESOS-1476
 URL: https://issues.apache.org/jira/browse/MESOS-1476
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
  Labels: gsoc2014

 When performing maintenance operations on slaves, it is important to allow 
 these slaves to be drained of their tasks.
 The first essential primitive of draining slaves is to prevent them from 
 running more tasks. This can be achieved by deactivating them: stop sending 
 their resource offers to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1476) Provide endpoints for deactivating / activating slaves.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1476:
--

Assignee: (was: Alexandra Sava)

Un-assigning for now since there is no longer a need for this with the updated 
maintenance design in MESOS-1474.

 Provide endpoints for deactivating / activating slaves.
 ---

 Key: MESOS-1476
 URL: https://issues.apache.org/jira/browse/MESOS-1476
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
  Labels: gsoc2014

 When performing maintenance operations on slaves, it is important to allow 
 these slaves to be drained of their tasks.
 The first essential primitive of draining slaves is to prevent them from 
 running more tasks. This can be achieved by deactivating them: stop sending 
 their resource offers to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1592) Design inverse resource offer support

2014-09-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126421#comment-14126421
 ] 

Benjamin Mahler commented on MESOS-1592:


Moving this to reviewable as inverse offers were designed as part of the 
maintenance work: MESOS-1474.

We are currently considering how persistent resources will interact with 
inverse offers and the other maintenance primitives.

 Design inverse resource offer support
 -

 Key: MESOS-1592
 URL: https://issues.apache.org/jira/browse/MESOS-1592
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Benjamin Mahler
Assignee: Alexandra Sava

 An inverse resource offer means that Mesos is requesting resources back 
 from the framework, possibly within some time interval.
 This can be leveraged initially to provide more automated cluster 
 maintenance, by offering schedulers the opportunity to move tasks to 
 compensate for planned maintenance. Operators can set a time limit on how 
 long to wait for schedulers to relocate tasks before the tasks are forcibly 
 terminated.
 Inverse resource offers have many other potential uses, as it opens the 
 opportunity for the allocator to attempt to move tasks in the cluster through 
 the co-operation of the framework, possibly providing better 
 over-subscription, fairness, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.

2014-09-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1717:
---
Sprint: Q3 Sprint 4  (was: Q3 Sprint 4, Mesos Q3 Sprint 5)

 The slave does not show pending tasks in the JSON endpoints.
 

 Key: MESOS-1717
 URL: https://issues.apache.org/jira/browse/MESOS-1717
 Project: Mesos
  Issue Type: Bug
  Components: json api, slave
Reporter: Benjamin Mahler

 The slave does not show pending tasks in the /state.json endpoint.
 This is a bit tricky to add since we rely on knowing the executor directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.

2014-09-10 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1786:
--

 Summary: FaultToleranceTest.ReconcilePendingTasks is flaky.
 Key: MESOS-1786
 URL: https://issues.apache.org/jira/browse/MESOS-1786
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


{noformat}
[ RUN  ] FaultToleranceTest.ReconcilePendingTasks
Using temporary directory '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm'
I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms
I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms
I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns
I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 1781ns
I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the db 
in 537ns
I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery
I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status
I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to STARTING
I0910 20:18:02.321723 21652 master.cpp:286] Master 
20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361
I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing authenticated 
slaves to register
I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials'
I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled
I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 5.781277ms
I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to 
STARTING
I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status
I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING
I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : master@127.0.1.1:60361
I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is 
master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634
I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No 
resources available to allocate!
I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed 
allocation for 0 slaves in 118308ns
I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master!
I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar
I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar
I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 1.887284299secs
I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to VOTING
I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos group
I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated
I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer
I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 9.044596ms
I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1
I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 18.839475ms
I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0
I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request for 
position 0
I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb took 
43213ns
I0910 20:18:04.262251 21650 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 10.949353ms
I0910 20:18:04.262346 21650 replica.cpp:676] Persisted action at 0
I0910 20:18:04.262717 21650 replica.cpp:655] Replica received learned notice 
for position 0
I0910 20:18:04.271878 21650 

[jira] [Updated] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.

2014-09-10 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1786:
---
Sprint: Mesos Q3 Sprint 5

 FaultToleranceTest.ReconcilePendingTasks is flaky.
 --

 Key: MESOS-1786
 URL: https://issues.apache.org/jira/browse/MESOS-1786
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 {noformat}
 [ RUN  ] FaultToleranceTest.ReconcilePendingTasks
 Using temporary directory 
 '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm'
 I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms
 I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms
 I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns
 I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 
 1781ns
 I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 537ns
 I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery
 I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status
 I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to 
 STARTING
 I0910 20:18:02.321723 21652 master.cpp:286] Master 
 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361
 I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing 
 authenticated frameworks to register
 I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing 
 authenticated slaves to register
 I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials'
 I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled
 I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 5.781277ms
 I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to 
 STARTING
 I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status
 I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING
 I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:60361
 I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is 
 master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634
 I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No 
 resources available to allocate!
 I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed 
 allocation for 0 slaves in 118308ns
 I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master!
 I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar
 I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar
 I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 1.887284299secs
 I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to 
 VOTING
 I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos 
 group
 I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated
 I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer
 I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 9.044596ms
 I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1
 I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 18.839475ms
 I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0
 I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request 
 for position 0
 I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb 
 took 43213ns
 I0910 20:18:04.262251 21650 

[jira] [Updated] (MESOS-1696) Improve reconciliation between master and slave.

2014-09-11 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1696:
---
Description: 
As we update the Master to keep tasks in memory until they are both terminal 
and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
follows:

{code}
Master   Slave
 {}   {}
{Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
slave.
{Tn} {Tn} // Slave receives Task T, non-terminal.
{Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
{Tt} {Tt} // Master receives update, forwards to framework.
 {}  {Tt} // Master receives ack, forwards to slave.
 {}   {}  // Slave receives ack.
{code}

In the current form of reconciliation, the slave sends to the master all tasks 
that are not both terminal and acknowledged. At any point in the above 
lifecycle, the slave's re-registration message can reach the master.

Note the following properties:

*(1)* The master may have a non-terminal task, not present in the slave's 
re-registration message.
*(2)* The master may have a non-terminal task, present in the slave's 
re-registration message but in a different state.
*(3)* The slave's re-registration message may contain a terminal unacknowledged 
task unknown to the master.

In the current master / slave 
[reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
 code, the master assumes that case (1) is because a launch task message was 
dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when 
the task reaches the slave correctly, so this can lead to inconsistency!

After chatting with [~vinodkone], we're considering updating the reconciliation 
to occur as follows:


→ Slave sends all tasks that are not both terminal and acknowledged, during 
re-registration. This is the same as before.

→ If the master sees tasks that are missing in the slave, the master sends the 
tasks that need to be reconciled to the slave for the tasks. This can be 
piggy-backed on the re-registration message.

→ The slave will send TASK_LOST if the task is not known to it. Preferably in a 
retried manner, unless we update socket closure on the slave to force a 
re-registration.

  was:
As we update the Master to keep tasks in memory until they are both terminal 
and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
follows:

{code}
Master   Slave
 {}   {}
{Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
slave.
{Tn} {Tn} // Slave receives Task T, non-terminal.
{Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
{Tt} {Tt} // Master receives update, forwards to framework.
 {}  {Tt} // Master receives ack, forwards to slave.
 {}   {}  // Slave receives ack.
{code}

In the current form of reconciliation, the slave sends to the master all tasks 
that are not both terminal and acknowledged. At any point in the above 
lifecycle, the slave's re-registration message can reach the master.

Note the following properties:

*(1)* The master may have a non-terminal task, not present in the slave's 
re-registration message.
*(2)* The master may have a non-terminal task, present in the slave's 
re-registration message but in a different state.
*(3)* The slave's re-registration message may contain a terminal unacknowledged 
task unknown to the master.

In the current master / slave 
[reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
 code, the master assumes that case (1) is because a launch task message was 
dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when 
the task reaches the slave correctly, so this can lead to inconsistency!

After chatting with [~vinodkone], we're considering updating the reconciliation 
to occur as follows:


→ Slave sends all tasks that are not both terminal and acknowledged, during 
re-registration. This is the same as before.

→ If the master sees tasks that are missing in the slave, the master sends a 
reconcile message to the slave for the tasks.

→ The slave will reply to reconcile messages with the latest state, or 
TASK_LOST if the task is not known to it. Preferably in a retried manner, 
unless we update socket closure on the slave to force a re-registration.


 Improve reconciliation between master and slave.
 

 Key: MESOS-1696
 URL: https://issues.apache.org/jira/browse/MESOS-1696
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Vinod Kone

 As we update the Master to keep tasks in memory until they are both terminal 
 and acknowledged 

[jira] [Commented] (MESOS-1410) Keep terminal unacknowledged tasks in the master's state.

2014-09-11 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131014#comment-14131014
 ] 

Benjamin Mahler commented on MESOS-1410:


https://reviews.apache.org/r/25565/
https://reviews.apache.org/r/25566/
https://reviews.apache.org/r/25567/
https://reviews.apache.org/r/25568/

 Keep terminal unacknowledged tasks in the master's state.
 -

 Key: MESOS-1410
 URL: https://issues.apache.org/jira/browse/MESOS-1410
 Project: Mesos
  Issue Type: Task
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
 Fix For: 0.21.0


 Once we are sending acknowledgments through the master as per MESOS-1409, we 
 need to keep terminal tasks that are *unacknowledged* in the Master's memory.
 This will allow us to identify these tasks to frameworks when we haven't yet 
 forwarded them an update. Without this, we're susceptible to MESOS-1389.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky

2014-09-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132235#comment-14132235
 ] 

Benjamin Mahler commented on MESOS-1783:


{noformat}
commit d6c1ef6842b70af068ba14896693266ed6067724
Author: Niklas Nielsen n...@qni.dk
Date:   Fri Sep 12 14:40:54 2014 -0700

Fixed flaky MasterTest.LaunchDuplicateOfferTest.

A couple of races could occur in the launch tasks on multiple offers
tests where recovered resources from purposely-failed invocations turned
into a subsequent resource offer and oversaturated the expect's.

Review: https://reviews.apache.org/r/25588
{noformat}

 MasterTest.LaunchDuplicateOfferTest is flaky
 

 Key: MESOS-1783
 URL: https://issues.apache.org/jira/browse/MESOS-1783
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.20.0
 Environment: ubuntu-14.04-gcc Jenkins VM
Reporter: Yan Xu
Assignee: Niklas Quarfot Nielsen
 Fix For: 0.21.0


 {noformat:title=}
 [ RUN  ] MasterTest.LaunchDuplicateOfferTest
 Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg'
 I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms
 I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms
 I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns
 I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in 
 1365ns
 I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 658ns
 I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery
 I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status
 I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to 
 STARTING
 I0909 22:46:59.232590 21900 master.cpp:286] Master 
 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263
 I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing 
 authenticated frameworks to register
 I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing 
 authenticated slaves to register
 I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials'
 I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled
 I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:44263
 I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 16.245391ms
 I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to 
 STARTING
 I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status
 I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING
 I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is 
 master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883
 I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master!
 I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar
 I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar
 I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 7.864221ms
 I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to 
 VOTING
 I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos 
 group
 I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer
 I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated
 I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.292514ms
 I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1
 I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0909 22:46:59.267755 21897 

[jira] [Created] (MESOS-1789) MasterTest.RecoveredSlaveReregisters is flaky.

2014-09-12 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1789:
--

 Summary: MasterTest.RecoveredSlaveReregisters is flaky.
 Key: MESOS-1789
 URL: https://issues.apache.org/jira/browse/MESOS-1789
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Priority: Minor


Seen flaky on a fedora 19 VM w/ clang.

{noformat}
[ RUN  ] MasterTest.RecoveredSlaveReregisters
Using temporary directory '/tmp/MasterTest_RecoveredSlaveReregisters_CHREru'
I0910 23:37:24.522372 22914 leveldb.cpp:176] Opened db in 978us
I0910 23:37:24.522948 22914 leveldb.cpp:183] Compacted db in 554320ns
I0910 23:37:24.522981 22914 leveldb.cpp:198] Created db iterator in 15459ns
I0910 23:37:24.523000 22914 leveldb.cpp:204] Seeked to beginning of db in 9593ns
I0910 23:37:24.523020 22914 leveldb.cpp:273] Iterated through 0 keys in the db 
in 9137ns
I0910 23:37:24.523043 22914 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0910 23:37:24.525143 22935 recover.cpp:425] Starting replica recovery
I0910 23:37:24.525266 22935 recover.cpp:451] Replica is in EMPTY status
I0910 23:37:24.525774 22935 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0910 23:37:24.525871 22935 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0910 23:37:24.526028 22935 recover.cpp:542] Updating replica status to STARTING
I0910 23:37:24.526180 22935 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 83617ns
I0910 23:37:24.526211 22935 replica.cpp:320] Persisted replica status to 
STARTING
I0910 23:37:24.526283 22935 recover.cpp:451] Replica is in STARTING status
I0910 23:37:24.526725 22935 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0910 23:37:24.526813 22935 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0910 23:37:24.526964 22935 recover.cpp:542] Updating replica status to VOTING
I0910 23:37:24.527061 22935 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 44802ns
I0910 23:37:24.527091 22935 replica.cpp:320] Persisted replica status to VOTING
I0910 23:37:24.527139 22935 recover.cpp:556] Successfully joined the Paxos group
I0910 23:37:24.527230 22935 recover.cpp:440] Recover process terminated
I0910 23:37:24.527748 22928 master.cpp:286] Master 
20140910-233724-2272962752-36006-22914 (fedora-19) started on 
192.168.122.135:36006
I0910 23:37:24.527807 22928 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0910 23:37:24.527827 22928 master.cpp:337] Master only allowing authenticated 
slaves to register
I0910 23:37:24.527849 22928 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/MasterTest_RecoveredSlaveReregisters_CHREru/credentials'
I0910 23:37:24.528890 22928 master.cpp:366] Authorization enabled
I0910 23:37:24.529822 22928 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : 
master@192.168.122.135:36006
I0910 23:37:24.529903 22928 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0910 23:37:24.530275 22928 master.cpp:1212] The newly elected leader is 
master@192.168.122.135:36006 with id 20140910-233724-2272962752-36006-22914
I0910 23:37:24.530311 22928 master.cpp:1225] Elected as the leading master!
I0910 23:37:24.530328 22928 master.cpp:1043] Recovering from registrar
I0910 23:37:24.530426 22928 registrar.cpp:313] Recovering registrar
I0910 23:37:24.530993 22928 log.cpp:656] Attempting to start the writer
I0910 23:37:24.531601 22928 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0910 23:37:24.531677 22928 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 60319ns
I0910 23:37:24.531707 22928 replica.cpp:342] Persisted promised to 1
I0910 23:37:24.532016 22928 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0910 23:37:24.532691 22928 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0910 23:37:24.532752 22928 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 45735ns
I0910 23:37:24.532783 22928 replica.cpp:676] Persisted action at 0
I0910 23:37:24.533252 22928 replica.cpp:508] Replica received write request for 
position 0
I0910 23:37:24.533299 22928 leveldb.cpp:438] Reading position from leveldb took 
34066ns
I0910 23:37:24.533354 22928 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 37637ns
I0910 23:37:24.533381 22928 replica.cpp:676] Persisted action at 0
I0910 23:37:24.533701 22928 replica.cpp:655] Replica received learned notice 
for position 0
I0910 23:37:24.533757 22928 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 42842ns
I0910 23:37:24.533785 22928 replica.cpp:676] Persisted action at 0
I0910 23:37:24.533804 22928 replica.cpp:661] Replica learned NOP action at 

[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.

2014-09-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1653:
---
Fix Version/s: (was: 0.20.0)

 HealthCheckTest.GracePeriod is flaky.
 -

 Key: MESOS-1653
 URL: https://issues.apache.org/jira/browse/MESOS-1653
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Timothy Chen

 {noformat}
 [--] 3 tests from HealthCheckTest
 [ RUN  ] HealthCheckTest.GracePeriod
 Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr'
 I0729 17:10:10.484951  1176 leveldb.cpp:176] Opened db in 28.883552ms
 I0729 17:10:10.499487  1176 leveldb.cpp:183] Compacted db in 13.674118ms
 I0729 17:10:10.500200  1176 leveldb.cpp:198] Created db iterator in 7394ns
 I0729 17:10:10.500692  1176 leveldb.cpp:204] Seeked to beginning of db in 
 2317ns
 I0729 17:10:10.501113  1176 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 1367ns
 I0729 17:10:10.501535  1176 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0729 17:10:10.502233  1212 recover.cpp:425] Starting replica recovery
 I0729 17:10:10.502295  1212 recover.cpp:451] Replica is in EMPTY status
 I0729 17:10:10.502825  1212 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0729 17:10:10.502877  1212 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0729 17:10:10.502980  1212 recover.cpp:542] Updating replica status to 
 STARTING
 I0729 17:10:10.508482  1213 master.cpp:289] Master 
 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701
 I0729 17:10:10.508607  1213 master.cpp:326] Master only allowing 
 authenticated frameworks to register
 I0729 17:10:10.508632  1213 master.cpp:331] Master only allowing 
 authenticated slaves to register
 I0729 17:10:10.508656  1213 credentials.hpp:36] Loading credentials for 
 authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials'
 I0729 17:10:10.509407  1213 master.cpp:360] Authorization enabled
 I0729 17:10:10.510030  1207 hierarchical_allocator_process.hpp:301] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:54701
 I0729 17:10:10.510113  1207 master.cpp:123] No whitelist given. Advertising 
 offers for all slaves
 I0729 17:10:10.511699  1213 master.cpp:1129] The newly elected leader is 
 master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176
 I0729 17:10:10.512230  1213 master.cpp:1142] Elected as the leading master!
 I0729 17:10:10.512692  1213 master.cpp:960] Recovering from registrar
 I0729 17:10:10.513226  1210 registrar.cpp:313] Recovering registrar
 I0729 17:10:10.516006  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 12.946461ms
 I0729 17:10:10.516047  1212 replica.cpp:320] Persisted replica status to 
 STARTING
 I0729 17:10:10.516129  1212 recover.cpp:451] Replica is in STARTING status
 I0729 17:10:10.516520  1212 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0729 17:10:10.516592  1212 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0729 17:10:10.516767  1212 recover.cpp:542] Updating replica status to VOTING
 I0729 17:10:10.528376  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 11.537102ms
 I0729 17:10:10.528430  1212 replica.cpp:320] Persisted replica status to 
 VOTING
 I0729 17:10:10.528501  1212 recover.cpp:556] Successfully joined the Paxos 
 group
 I0729 17:10:10.528565  1212 recover.cpp:440] Recover process terminated
 I0729 17:10:10.528700  1212 log.cpp:656] Attempting to start the writer
 I0729 17:10:10.528960  1212 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I0729 17:10:10.537821  1212 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 8.830863ms
 I0729 17:10:10.537869  1212 replica.cpp:342] Persisted promised to 1
 I0729 17:10:10.540550  1209 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0729 17:10:10.540856  1209 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I0729 17:10:10.547430  1209 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 6.548344ms
 I0729 17:10:10.547471  1209 replica.cpp:676] Persisted action at 0
 I0729 17:10:10.547732  1209 replica.cpp:508] Replica received write request 
 for position 0
 I0729 17:10:10.547765  1209 leveldb.cpp:438] Reading position from leveldb 
 took 15676ns
 I0729 17:10:10.557169  1209 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 9.373798ms
 I0729 17:10:10.557241  1209 replica.cpp:676] Persisted action at 0
 I0729 17:10:10.560642  1210 replica.cpp:655] Replica received learned notice 
 for position 0
 I0729 

[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations

2014-09-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1791:
---
Affects Version/s: (was: 0.20.0)

 Introduce Master / Offer Resource Reservations
 --

 Key: MESOS-1791
 URL: https://issues.apache.org/jira/browse/MESOS-1791
 Project: Mesos
  Issue Type: Epic
Reporter: Tom Arnfeld

 Currently Mesos supports the ability to reserve resources (for a given role) 
 on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
 statically partition off a set of resources on a set of machines, to 
 guarantee certain types of frameworks get some resources.
 This is very useful, though it is also very useful to be able to control 
 these reservations through the master (instead of per-slave) for when I don't 
 care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
 (X,Y).
 I'm not sure what structure this could take, but apparently it has already 
 been discussed. Would this be a CLI flag? Could there be a (authenticated) 
 web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations

2014-09-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1791:
---
Epic Name: Offer Reservations

 Introduce Master / Offer Resource Reservations
 --

 Key: MESOS-1791
 URL: https://issues.apache.org/jira/browse/MESOS-1791
 Project: Mesos
  Issue Type: Epic
Reporter: Tom Arnfeld

 Currently Mesos supports the ability to reserve resources (for a given role) 
 on a per-slave basis, as introduced in MESOS-505. This allows you to almost 
 statically partition off a set of resources on a set of machines, to 
 guarantee certain types of frameworks get some resources.
 This is very useful, though it is also very useful to be able to control 
 these reservations through the master (instead of per-slave) for when I don't 
 care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of 
 (X,Y).
 I'm not sure what structure this could take, but apparently it has already 
 been discussed. Would this be a CLI flag? Could there be a (authenticated) 
 web interface to control these reservations?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1795) Assertion failure in state abstraction crashes JVM

2014-09-15 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134472#comment-14134472
 ] 

Benjamin Mahler commented on MESOS-1795:


Do you understand what transpired?

 Assertion failure in state abstraction crashes JVM
 --

 Key: MESOS-1795
 URL: https://issues.apache.org/jira/browse/MESOS-1795
 Project: Mesos
  Issue Type: Bug
  Components: java api
Affects Versions: 0.20.0
Reporter: Connor Doyle
Assignee: Connor Doyle

 Observed the following log output prior to a crash of the Marathon scheduler:
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 
 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] 
 CHECK_READY(*future): is PENDING 
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure 
 stack trace: ***
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febc2663a2d  google::LogMessage::Fail()
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febc26657e3  google::LogMessage::SendToLog()
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febc2663648  google::LogMessage::Flush()
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febc266603e  google::LogMessageFatal::~LogMessageFatal()
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febc26588a3  Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
 Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 
 0x7febcd107d98  (unknown)
 Listing 1: Crash log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-16 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135714#comment-14135714
 ] 

Benjamin Mahler commented on MESOS-1797:


Is there a ZooKeeper ticket related to this?

 Packaged Zookeeper does not compile on OSX Yosemite
 ---

 Key: MESOS-1797
 URL: https://issues.apache.org/jira/browse/MESOS-1797
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.20.0, 0.21.0, 0.19.1
Reporter: Dario Rexin
Priority: Minor

 I have been struggling with this for some time (due to my lack of knowledge 
 about C compiler error messages) and finally found a way to make it compile. 
 The problem is that Zookeeper defines a function `htonll` that is a builtin 
 function in Yosemite. For me it worked to just remove this function, but as 
 it needs to keep working on other systems as well, we would need some check 
 for the OS version or if the function is already defined.
 Here are the links to the source:
 https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1799) Reconciliation can send out-of-order updates.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1799:
--

 Summary: Reconciliation can send out-of-order updates.
 Key: MESOS-1799
 URL: https://issues.apache.org/jira/browse/MESOS-1799
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler


When a slave re-registers with the master, it currently sends the latest task 
state for all tasks that are not both terminal and acknowledged.

However, reconciliation assumes that we always have the latest unacknowledged 
state of the task represented in the master.

As a result, out-of-order updates are possible, e.g.

(1) Slave has task T in TASK_FINISHED, with unacknowledged updates: 
[TASK_RUNNING, TASK_FINISHED].
(2) Master fails over.
(3) New master re-registers the slave with T in TASK_FINISHED.
(4) Reconciliation request arrives, master sends TASK_FINISHED.
(5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING.

I think the fix here is to preserve the task state invariants in the master, 
namely, that the master has the latest unacknowledged state of the task. This 
means when the slave re-registers, it should instead send the latest 
unacknowledged state of each task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1800) The slave does not send pending executors during re-registration.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1800:
--

 Summary: The slave does not send pending executors during 
re-registration.
 Key: MESOS-1800
 URL: https://issues.apache.org/jira/browse/MESOS-1800
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler


In what looks like an oversight, the pending executors in the slave are not 
sent in the re-registration message.

This can lead to under-accounting in the master, causing an overcommit on the 
slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1466:
--

Assignee: (was: Benjamin Mahler)

 Race between executor exited event and launch task can cause overcommit of 
 resources
 

 Key: MESOS-1466
 URL: https://issues.apache.org/jira/browse/MESOS-1466
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Vinod Kone
  Labels: reliability

 The following sequence of events can cause an overcommit
 -- Launch task is called for a task whose executor is already running
 -- Executor's resources are not accounted for on the master
 -- Executor exits and the event is enqueued behind launch tasks on the master
 -- Master sends the task to the slave which needs to commit for resources 
 for task and the (new) executor.
 -- Master processes the executor exited event and re-offers the executor's 
 resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1802:
--

 Summary: HealthCheckTest.HealthStatusChange is flaky on jenkins.
 Key: MESOS-1802
 URL: https://issues.apache.org/jira/browse/MESOS-1802
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Timothy Chen


https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull

{noformat}
[ RUN  ] HealthCheckTest.HealthStatusChange
Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 642ns
I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 343ns
I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to STARTING
I0916 22:56:14.036603 21046 master.cpp:286] Master 
20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
67.195.81.186:47865
I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing authenticated 
slaves to register
I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
authentication from '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 480322ns
I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
STARTING
I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : 
master@67.195.81.186:47865
I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 330251ns
I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to VOTING
I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos group
I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 92623ns
I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 144215ns
I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0
I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request for 
position 0
I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb took 
10323ns
I0916 22:56:14.040362 21047 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 79471ns
I0916 22:56:14.040375 21047 replica.cpp:676] Persisted action at 0
I0916 22:56:14.040556 21054 replica.cpp:655] Replica received learned notice 
for position 0
I0916 22:56:14.040658 21054 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 83975ns
I0916 22:56:14.040676 21054 replica.cpp:676] 

[jira] [Created] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1803:
--

 Summary: Strict/RegistrarTest.remove test is flaky on jenkins.
 Key: MESOS-1803
 URL: https://issues.apache.org/jira/browse/MESOS-1803
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull

{noformat}
[ RUN  ] Strict/RegistrarTest.remove/1
Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW'
I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms
I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns
I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns
I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns
I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 330ns
I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 421460ns
I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING
I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms
I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns
I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns
I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns
I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db 
in 195ns
I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 472891ns
I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING
I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms
I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms
I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns
I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns
I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db 
in 8479ns
I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms
I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms
I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns
I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns
I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db 
in 8182ns
I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery
I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status
I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated
I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering registrar
I0916 22:59:02.126597 21050 log.cpp:656] Attempting to start the writer
I0916 22:59:02.127259 21041 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:59:02.127321 21050 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0916 22:59:02.127835 21041 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 547018ns
I0916 22:59:02.127858 21041 replica.cpp:342] Persisted promised to 1
I0916 22:59:02.127835 21050 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 487588ns
I0916 22:59:02.127887 21050 replica.cpp:342] Persisted promised to 1
I0916 22:59:02.128387 21055 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I0916 22:59:02.129546 21042 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:59:02.129600 21053 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0916 22:59:02.129982 21042 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 406954ns
I0916 22:59:02.129982 21053 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 357253ns
I0916 22:59:02.130009 21042 replica.cpp:676] Persisted action at 0
I0916 22:59:02.130029 21053 replica.cpp:676] Persisted action at 0
I0916 22:59:02.130543 21041 replica.cpp:508] Replica received write request for 
position 0
I0916 22:59:02.130585 21041 leveldb.cpp:438] Reading position from leveldb took 
17424ns
I0916 22:59:02.130599 21046 replica.cpp:508] Replica received write request for 
position 0
I0916 22:59:02.130635 21046 leveldb.cpp:438] Reading position from leveldb took 
12702ns
I0916 22:59:02.130728 

[jira] [Resolved] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.

2014-09-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1803.

Resolution: Cannot Reproduce

The log timings here look as if the threads were starved of CPU:

{noformat}
I0916 22:59:02.136256 21049 leveldb.cpp:343] Persisting action (165 bytes) to 
leveldb took 141908ns
I0916 22:59:02.136267 21047 leveldb.cpp:343] Persisting action (165 bytes) to 
leveldb took 111061ns
I../../src/tests/registrar_tests.cpp:257: Failure
0916 22:59:02.136276 21049 replica.cpp:676] Persisted action at 1
Failed to wait 10secs for registrar.recover(master)
I0916 22:59:14.265326 21049 replica.cpp:661] Replica learned APPEND action at 
position 1
I0916 22:59:02.136291 21047 replica.cpp:676] Persisted action at 1
E0916 22:59:07.135143 21046 registrar.cpp:500] Registrar aborting: Failed to 
update 'registry': Failed to perform store within 5secs
I0916 22:59:14.265393 21047 replica.cpp:661] Replica learned APPEND action at 
position 1
{noformat}

The logging time stamp is determined at the beginning of the LOG(INFO) 
expression, when the initial LogMessage object is created. The interleaving of 
times looks to be a stall of the VM or thread starvation:

{noformat}
22:59:02.136267 21047 // Thread 1, 1st LogMessage flushed.
22:59:02.136276 21049 // Thread 2, 2nd LogMessage flushed.
22:59:14.265326 21049 // Thread 2, 5th LogMessage flushed.
22:59:02.136291 21047 // Thread 1, 3rd LogMessage flushed.
22:59:07.135143 21046 // Thread 3, 4th LogMessage flushed.
22:59:14.265393 21047 // Thread 1, 6th LogMessage flushed.
{noformat}

 Strict/RegistrarTest.remove test is flaky on jenkins.
 -

 Key: MESOS-1803
 URL: https://issues.apache.org/jira/browse/MESOS-1803
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
 {noformat}
 [ RUN  ] Strict/RegistrarTest.remove/1
 Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW'
 I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms
 I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns
 I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns
 I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 
 475ns
 I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 330ns
 I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 421460ns
 I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to 
 VOTING
 I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms
 I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns
 I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns
 I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 
 468ns
 I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 195ns
 I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 472891ns
 I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to 
 VOTING
 I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms
 I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms
 I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns
 I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 
 7977ns
 I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the 
 db in 8479ns
 I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms
 I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms
 I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns
 I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 
 7167ns
 I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the 
 db in 8182ns
 I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery
 I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status
 I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated
 I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering 

[jira] [Updated] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults

2014-09-18 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1818:
---
Sprint: Mesos Q3 Sprint 5

 AllocatorTest/0.ResourcesUnused sometimes segfaults
 ---

 Key: MESOS-1818
 URL: https://issues.apache.org/jira/browse/MESOS-1818
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Benjamin Mahler
Priority: Critical

 {code}
 [ RUN  ] AllocatorTest/0.ResourcesUnused
 *** Aborted at 1411088950 (unix time) try date -d @1411088950 if you are 
 using GNU date ***
 PC: @   0x8649a4 mesos::SlaveID::value()
 *** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
 11753; stack trace: ***
 @ 0x7fb643ec4ca0 (unknown)
 @   0x8649a4 mesos::SlaveID::value()
 @   0x8741c7 mesos::hash_value()
 @   0x8f7448 boost::hash::operator()()
 @   0x8e0bed 
 boost::unordered::detail::mix64_policy::apply_hash()
 @ 0x7fb64694c1cf boost::unordered::detail::table::hash()
 @ 0x7fb646973615 boost::unordered::detail::table::find_node()
 @ 0x7fb64694c191 boost::unordered::detail::table_impl::count()
 @ 0x7fb64691f3c1 boost::unordered::unordered_map::count()
 @ 0x7fb6468f4373 hashmap::contains()
 @ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
 @ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
 @ 0x7fb6468afa9f 
 mesos::internal::master::Master::unregisterFramework()
 @ 0x7fb646904ab9 ProtobufProcess::handler1()
 @ 0x7fb6469a1e81 
 _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
 @ 0x7fb646983afe std::_Bind::operator()()
 @ 0x7fb64695f83c std::_Function_handler::_M_invoke()
 @   0xc4e17f std::function::operator()()
 @ 0x7fb6468ebd10 ProtobufProcess::visit()
 @ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
 @ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
 @ 0x7fb6468ce670 process::MessageEvent::visit()
 @   0x86ad54 process::ProcessBase::serve()
 @ 0x7fb6470e9738 process::ProcessManager::resume()
 @ 0x7fb6470dff3f process::schedule()
 @ 0x7fb643ebc83d start_thread
 @ 0x7fb642c2426d clone
 make[3]: *** [check-local] Segmentation fault
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1821:
---
Priority: Blocker  (was: Major)

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141186#comment-14141186
 ] 

Benjamin Mahler commented on MESOS-1821:


https://reviews.apache.org/r/25843/

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1821) CHECK failure in master.

2014-09-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1821:
---
Sprint: Mesos Q3 Sprint 5

 CHECK failure in master.
 

 Key: MESOS-1821
 URL: https://issues.apache.org/jira/browse/MESOS-1821
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 Looks like the recent CHECKs I've added exposed a bug in the framework 
 re-registration logic by which we didn't keep the executors consistent 
 between the Slave and Framework structs:
 {noformat: title=Master Log}
 I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME) exited with status 0
 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' 
 with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 
 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 
 at slave(1)@IP:5051 (HOSTNAME)
 F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: 
 hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 
 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0
 *** Check failure stack trace: ***
 @ 0x7fd16c81737d  google::LogMessage::Fail()
 @ 0x7fd16c8191c4  google::LogMessage::SendToLog()
 @ 0x7fd16c816f6c  google::LogMessage::Flush()
 @ 0x7fd16c819ab9  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7fd16c34e09b  mesos::internal::master::Framework::removeExecutor()
 @ 0x7fd16c2da2e4  mesos::internal::master::Master::removeExecutor()
 @ 0x7fd16c2e6255  mesos::internal::master::Master::exitedExecutor()
 @ 0x7fd16c348269  ProtobufProcess::handler4()
 @ 0x7fd16c2fc18e  std::_Function_handler::_M_invoke()
 @ 0x7fd16c322132  ProtobufProcess::visit()
 @ 0x7fd16c2cef7a  mesos::internal::master::Master::_visit()
 @ 0x7fd16c2dc3d8  mesos::internal::master::Master::visit()
 @ 0x7fd16c7c2502  process::ProcessManager::resume()
 @ 0x7fd16c7c280c  process::schedule()
 @ 0x7fd16b9c683d  start_thread
 @ 0x7fd16a2b626d  clone
 {noformat}
 This occurs sometime after a failover and indicates that the Slave and 
 Framework structs are not kept in sync.
 Problem seems to be here, when re-registering a framework on a failed over 
 master, we only consider executors for which there are tasks stored in the 
 master:
 {code}
 void Master::_reregisterFramework(
 const UPID from,
 const FrameworkInfo frameworkInfo,
 bool failover,
 const FutureOptionError  validationError)
 {
   ...
   if (frameworks.registered.count(frameworkInfo.id())  0) {
 ...
   } else {
 // We don't have a framework with this ID, so we must be a newly
 // elected Mesos master to which either an existing scheduler or a
 // failed-over one is connecting. Create a Framework object and add
 // any tasks it has that have been reported by reconnecting slaves.
 Framework* framework =
   new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now());
 framework-reregisteredTime = Clock::now();
 // TODO(benh): Check for root submissions like above!
 // Add any running tasks reported by slaves for this framework.
 foreachvalue (Slave* slave, slaves.registered) {
   foreachkey (const FrameworkID frameworkId, slave-tasks) {
 foreachvalue (Task* task, slave-tasks[frameworkId]) {
   if (framework-id == task-framework_id()) {
 framework-addTask(task);
 // Also add the task's executor for resource accounting
 // if it's still alive on the slave and we've not yet
 // added it to the framework.
 if (task-has_executor_id() 
 slave-hasExecutor(framework-id, task-executor_id()) 
 !framework-hasExecutor(slave-id, task-executor_id())) {
   // XXX: If an executor has no tasks, the executor will not
   // XXX: be added to the Framework struct!
   const ExecutorInfo executorInfo =
 slave-executors[framework-id][task-executor_id()];
   framework-addExecutor(slave-id, executorInfo);
 }
   }
 }
   }
 }
 // N.B. Need to add the framework _after_ we add its tasks
 // (above) so that we can properly determine the resources it's
 // currently using!
 addFramework(framework);
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-1461) Add task reconciliation to the Python API.

2014-09-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-1461:
--

Assignee: Niklas Quarfot Nielsen

 Add task reconciliation to the Python API.
 --

 Key: MESOS-1461
 URL: https://issues.apache.org/jira/browse/MESOS-1461
 Project: Mesos
  Issue Type: Task
  Components: python api
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Niklas Quarfot Nielsen

 Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but 
 was never added to the Python API.
 This may be obviated by the lower level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1461) Add task reconciliation to the Python API.

2014-09-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148364#comment-14148364
 ] 

Benjamin Mahler commented on MESOS-1461:


{noformat}
commit f1da58fb22b28afd65313f6801b35bce436199ab
Author: Niklas Nielsen n...@qni.dk
Date:   Thu Sep 25 14:12:12 2014 -0700

Added reconcileTasks to python scheduler.

The last step of wiring up reconcileTasks in the python bindings.

Review: https://reviews.apache.org/r/25986
{noformat}

 Add task reconciliation to the Python API.
 --

 Key: MESOS-1461
 URL: https://issues.apache.org/jira/browse/MESOS-1461
 Project: Mesos
  Issue Type: Task
  Components: python api
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Niklas Quarfot Nielsen
 Fix For: 0.21.0


 Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but 
 was never added to the Python API.
 This may be obviated by the lower level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-1389) Reconciliation can send TASK_LOST before a terminal update reaches the framework.

2014-09-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler resolved MESOS-1389.

Resolution: Fixed
  Assignee: Benjamin Mahler

Resolved via MESOS-1410.

 Reconciliation can send TASK_LOST before a terminal update reaches the 
 framework.
 -

 Key: MESOS-1389
 URL: https://issues.apache.org/jira/browse/MESOS-1389
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
 Fix For: 0.21.0


 There's an unfortunate case with reconciliation, where we end up sending 
 TASK_LOST first and then the slave sends the valid terminal status update.
 When the slave re-registers with terminal tasks that have un-acked updates. 
 The master does not store these tasks. So while the slave still needs to send 
 the terminal status updates, the master will reply with TASK_LOST for 
 reconciliation.
 We may need to ensure that all status update acknowledgements go through the 
 master to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-986) Add versioning to messages.

2014-09-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-986:
--
Description: 
Message versioning in Mesos provides a number of benefits:

# Prevent incompatible version combinations. For example, we want to reject 
slaves that are  1 version behind the master.

# The biggest win is providing the ability to determine behavior based on the 
cross-component versioning. For example, in MESOS-1668 we wanted the master to 
send a different ping message based on the version of the slave. In MESOS-1696, 
we wanted to perform reconciliation in the master based on the version of the 
slave. In both cases, when we don't have the version, we have to either rely on 
hacks/tricks, or add additional phases.

  was:
We currently do not prevent rogue versions of components from communicating 
within the cluster. Adding versioning to our messages will allow us to enforce 
version compatibility.

We would need to figure out the right semantics for each pair of components 
that communicate with each other.

For example, if an old incompatible Slave attempts to register with a Master, 
the Master can either ignore it, or shut it down.

Summary: Add versioning to messages.  (was: Add versioning to messages 
to prevent communication across incompatible versions of components.)

cc [~benjaminhindman]

[~vinodkone] and I chatted about an approach to add versioning at the 
libprocess layer.

We could add the ability for an application to initialize libprocess with the 
application version. This gets encoded via a special libprocess HTTP header. 
{{install}} handlers can optionally receive a {{Version}}, akin to how they can 
optionally receive the sender UPID currently.

 Add versioning to messages.
 ---

 Key: MESOS-986
 URL: https://issues.apache.org/jira/browse/MESOS-986
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler

 Message versioning in Mesos provides a number of benefits:
 # Prevent incompatible version combinations. For example, we want to reject 
 slaves that are  1 version behind the master.
 # The biggest win is providing the ability to determine behavior based on the 
 cross-component versioning. For example, in MESOS-1668 we wanted the master 
 to send a different ping message based on the version of the slave. In 
 MESOS-1696, we wanted to perform reconciliation in the master based on the 
 version of the slave. In both cases, when we don't have the version, we have 
 to either rely on hacks/tricks, or add additional phases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-986) Add versioning to messages.

2014-09-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-986:
--
Description: 
Message versioning in Mesos provides a number of benefits:

(1) Prevent incompatible version combinations. For example, we want to reject 
slaves that are  1 version behind the master.

(2) The biggest win is providing the ability to determine behavior based on the 
cross-component versioning. For example, in MESOS-1668 we wanted the master to 
send a different ping message based on the version of the slave. In MESOS-1696, 
we wanted to perform reconciliation in the master based on the version of the 
slave. In both cases, when we don't have the version, we have to either rely on 
hacks/tricks, or add additional phases.

  was:
Message versioning in Mesos provides a number of benefits:

# Prevent incompatible version combinations. For example, we want to reject 
slaves that are  1 version behind the master.

# The biggest win is providing the ability to determine behavior based on the 
cross-component versioning. For example, in MESOS-1668 we wanted the master to 
send a different ping message based on the version of the slave. In MESOS-1696, 
we wanted to perform reconciliation in the master based on the version of the 
slave. In both cases, when we don't have the version, we have to either rely on 
hacks/tricks, or add additional phases.


 Add versioning to messages.
 ---

 Key: MESOS-986
 URL: https://issues.apache.org/jira/browse/MESOS-986
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler

 Message versioning in Mesos provides a number of benefits:
 (1) Prevent incompatible version combinations. For example, we want to reject 
 slaves that are  1 version behind the master.
 (2) The biggest win is providing the ability to determine behavior based on 
 the cross-component versioning. For example, in MESOS-1668 we wanted the 
 master to send a different ping message based on the version of the slave. In 
 MESOS-1696, we wanted to perform reconciliation in the master based on the 
 version of the slave. In both cases, when we don't have the version, we have 
 to either rely on hacks/tricks, or add additional phases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite

2014-09-29 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152601#comment-14152601
 ] 

Benjamin Mahler commented on MESOS-1797:


Awesome job following up with a patch [~tillt]!

I would try to find out when your patch would be part of the next release.
Likely we'll need a .patch in the interim, but let's wait to see your patch 
land in ZooKeeper first so that we know it's the right approach?

 Packaged Zookeeper does not compile on OSX Yosemite
 ---

 Key: MESOS-1797
 URL: https://issues.apache.org/jira/browse/MESOS-1797
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 0.20.0, 0.21.0, 0.19.1
Reporter: Dario Rexin
Priority: Minor

 I have been struggling with this for some time (due to my lack of knowledge 
 about C compiler error messages) and finally found a way to make it compile. 
 The problem is that Zookeeper defines a function `htonll` that is a builtin 
 function in Yosemite. For me it worked to just remove this function, but as 
 it needs to keep working on other systems as well, we would need some check 
 for the OS version or if the function is already defined.
 Here are the links to the source:
 https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73
 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1696) Improve reconciliation between master and slave.

2014-09-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154012#comment-14154012
 ] 

Benjamin Mahler commented on MESOS-1696:


https://reviews.apache.org/r/26202/
https://reviews.apache.org/r/26206/
https://reviews.apache.org/r/26207/
https://reviews.apache.org/r/26208/

 Improve reconciliation between master and slave.
 

 Key: MESOS-1696
 URL: https://issues.apache.org/jira/browse/MESOS-1696
 Project: Mesos
  Issue Type: Bug
  Components: master, slave
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 As we update the Master to keep tasks in memory until they are both terminal 
 and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as 
 follows:
 {code}
 Master   Slave
  {}   {}
 {Tn}  {}  // Master receives Task T, non-terminal. Forwards to 
 slave.
 {Tn} {Tn} // Slave receives Task T, non-terminal.
 {Tn} {Tt} // Task becomes terminal on slave. Update forwarded.
 {Tt} {Tt} // Master receives update, forwards to framework.
  {}  {Tt} // Master receives ack, forwards to slave.
  {}   {}  // Slave receives ack.
 {code}
 In the current form of reconciliation, the slave sends to the master all 
 tasks that are not both terminal and acknowledged. At any point in the above 
 lifecycle, the slave's re-registration message can reach the master.
 Note the following properties:
 *(1)* The master may have a non-terminal task, not present in the slave's 
 re-registration message.
 *(2)* The master may have a non-terminal task, present in the slave's 
 re-registration message but in a different state.
 *(3)* The slave's re-registration message may contain a terminal 
 unacknowledged task unknown to the master.
 In the current master / slave 
 [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146]
  code, the master assumes that case (1) is because a launch task message was 
 dropped, and it sends TASK_LOST. We've seen above that (1) can happen even 
 when the task reaches the slave correctly, so this can lead to inconsistency!
 After chatting with [~vinodkone], we're considering updating the 
 reconciliation to occur as follows:
 → Slave sends all tasks that are not both terminal and acknowledged, during 
 re-registration. This is the same as before.
 → If the master sees tasks that are missing in the slave, the master sends 
 the tasks that need to be reconciled to the slave for the tasks. This can be 
 piggy-backed on the re-registration message.
 → The slave will send TASK_LOST if the task is not known to it. Preferably in 
 a retried manner, unless we update socket closure on the slave to force a 
 re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MESOS-1347) GarbageCollectorIntegrationTest.DiskUsage is flaky.

2014-10-01 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reopened MESOS-1347:


Re-opening as it appears to still be flaky:

https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME/2137/consoleFull
{noformat}
[ RUN  ] GarbageCollectorIntegrationTest.DiskUsage
Using temporary directory 
'/tmp/GarbageCollectorIntegrationTest_DiskUsage_tjpfEc'
I1001 03:47:36.653859  9413 leveldb.cpp:176] Opened db in 2.065433ms
I1001 03:47:36.654671  9413 leveldb.cpp:183] Compacted db in 784728ns
I1001 03:47:36.654695  9413 leveldb.cpp:198] Created db iterator in 3540ns
I1001 03:47:36.654711  9413 leveldb.cpp:204] Seeked to beginning of db in 683ns
I1001 03:47:36.654722  9413 leveldb.cpp:273] Iterated through 0 keys in the db 
in 208ns
I1001 03:47:36.654742  9413 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I1001 03:47:36.654906  9433 recover.cpp:425] Starting replica recovery
I1001 03:47:36.654992  9433 recover.cpp:451] Replica is in EMPTY status
I1001 03:47:36.655396  9429 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1001 03:47:36.655482  9437 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I1001 03:47:36.655678  9428 recover.cpp:542] Updating replica status to STARTING
I1001 03:47:36.656245  9434 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 494048ns
I1001 03:47:36.656272  9434 replica.cpp:320] Persisted replica status to 
STARTING
I1001 03:47:36.656352  9429 master.cpp:312] Master 
20141001-034736-3176252227-46678-9413 (proserpina.apache.org) started on 
67.195.81.189:46678
I1001 03:47:36.656388  9429 master.cpp:358] Master only allowing authenticated 
frameworks to register
I1001 03:47:36.656404  9429 master.cpp:363] Master only allowing authenticated 
slaves to register
I1001 03:47:36.656421  9429 credentials.hpp:36] Loading credentials for 
authentication from 
'/tmp/GarbageCollectorIntegrationTest_DiskUsage_tjpfEc/credentials'
I1001 03:47:36.656436  9442 recover.cpp:451] Replica is in STARTING status
I1001 03:47:36.656574  9429 master.cpp:392] Authorization enabled
I1001 03:47:36.656782  9431 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I1001 03:47:36.656842  9442 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I1001 03:47:36.656867  9438 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : 
master@67.195.81.189:46678
I1001 03:47:36.657053  9437 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I1001 03:47:36.657254  9441 master.cpp:1241] The newly elected leader is 
master@67.195.81.189:46678 with id 20141001-034736-3176252227-46678-9413
I1001 03:47:36.657279  9441 master.cpp:1254] Elected as the leading master!
I1001 03:47:36.657292  9441 master.cpp:1072] Recovering from registrar
I1001 03:47:36.657311  9440 recover.cpp:542] Updating replica status to VOTING
I1001 03:47:36.657403  9436 registrar.cpp:312] Recovering registrar
I1001 03:47:36.657766  9437 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 295743ns
I1001 03:47:36.657793  9437 replica.cpp:320] Persisted replica status to VOTING
I1001 03:47:36.657863  9433 recover.cpp:556] Successfully joined the Paxos group
I1001 03:47:36.657943  9433 recover.cpp:440] Recover process terminated
I1001 03:47:36.658114  9432 log.cpp:656] Attempting to start the writer
I1001 03:47:36.658612  9438 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I1001 03:47:36.658779  9438 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 141800ns
I1001 03:47:36.658797  9438 replica.cpp:342] Persisted promised to 1
I1001 03:47:36.659145  9432 coordinator.cpp:230] Coordinator attemping to fill 
missing position
I1001 03:47:36.659880  9437 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I1001 03:47:36.769940  9437 leveldb.cpp:343] Persisting action (8 bytes) to 
leveldb took 516875ns
I1001 03:47:36.769964  9437 replica.cpp:676] Persisted action at 0
I1001 03:47:36.770449  9437 replica.cpp:508] Replica received write request for 
position 0
I1001 03:47:36.770480  9437 leveldb.cpp:438] Reading position from leveldb took 
12227ns
I1001 03:47:36.770740  9437 leveldb.cpp:343] Persisting action (14 bytes) to 
leveldb took 237752ns
I1001 03:47:36.770764  9437 replica.cpp:676] Persisted action at 0
I1001 03:47:36.771070  9435 replica.cpp:655] Replica received learned notice 
for position 0
I1001 03:47:36.771237  9435 leveldb.cpp:343] Persisting action (16 bytes) to 
leveldb took 145713ns
I1001 03:47:36.771257  9435 replica.cpp:676] Persisted action at 0
I1001 03:47:36.771268  9435 replica.cpp:661] Replica learned NOP action at 
position 0
I1001 03:47:36.771442  9442 log.cpp:672] Writer started with 

[jira] [Updated] (MESOS-1854) SlaveRecoveryTest.MultipleSlaves is flaky.

2014-10-01 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1854:
---
Sprint: Mesos Q3 Sprint 6

 SlaveRecoveryTest.MultipleSlaves is flaky.
 --

 Key: MESOS-1854
 URL: https://issues.apache.org/jira/browse/MESOS-1854
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler

 https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2408/consoleFull
 {noformat}
 [ RUN  ] SlaveRecoveryTest/0.MultipleSlaves
 Using temporary directory '/tmp/SlaveRecoveryTest_0_MultipleSlaves_yOuJDJ'
 I1001 01:25:43.585139 23806 leveldb.cpp:176] Opened db in 2.028599ms
 I1001 01:25:43.585713 23806 leveldb.cpp:183] Compacted db in 552764ns
 I1001 01:25:43.585731 23806 leveldb.cpp:198] Created db iterator in 3825ns
 I1001 01:25:43.585738 23806 leveldb.cpp:204] Seeked to beginning of db in 
 700ns
 I1001 01:25:43.585744 23806 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 370ns
 I1001 01:25:43.585759 23806 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I1001 01:25:43.586006 23828 recover.cpp:425] Starting replica recovery
 I1001 01:25:43.586093 23828 recover.cpp:451] Replica is in EMPTY status
 I1001 01:25:43.586524 23828 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I1001 01:25:43.586606 23824 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I1001 01:25:43.586741 23831 recover.cpp:542] Updating replica status to 
 STARTING
 I1001 01:25:43.586899 23825 master.cpp:312] Master 
 20141001-012543-3176252227-55929-23806 (proserpina.apache.org) started on 
 67.195.81.189:55929
 I1001 01:25:43.586930 23825 master.cpp:358] Master only allowing 
 authenticated frameworks to register
 I1001 01:25:43.586942 23825 master.cpp:363] Master only allowing 
 authenticated slaves to register
 I1001 01:25:43.586953 23825 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveRecoveryTest_0_MultipleSlaves_yOuJDJ/credentials'
 I1001 01:25:43.587057 23825 master.cpp:392] Authorization enabled
 I1001 01:25:43.587241 23829 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I1001 01:25:43.587270 23828 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 484210ns
 I1001 01:25:43.587278 23823 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@67.195.81.189:55929
 I1001 01:25:43.587288 23828 replica.cpp:320] Persisted replica status to 
 STARTING
 I1001 01:25:43.587393 23828 recover.cpp:451] Replica is in STARTING status
 I1001 01:25:43.587611 23825 master.cpp:1241] The newly elected leader is 
 master@67.195.81.189:55929 with id 20141001-012543-3176252227-55929-23806
 I1001 01:25:43.587631 23825 master.cpp:1254] Elected as the leading master!
 I1001 01:25:43.587643 23825 master.cpp:1072] Recovering from registrar
 I1001 01:25:43.587704 23824 registrar.cpp:312] Recovering registrar
 I1001 01:25:43.587731 23827 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I1001 01:25:43.587937 23821 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I1001 01:25:43.588060 23827 recover.cpp:542] Updating replica status to VOTING
 I1001 01:25:43.588371 23830 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 143615ns
 I1001 01:25:43.588392 23830 replica.cpp:320] Persisted replica status to 
 VOTING
 I1001 01:25:43.588433 23821 recover.cpp:556] Successfully joined the Paxos 
 group
 I1001 01:25:43.588496 23821 recover.cpp:440] Recover process terminated
 I1001 01:25:43.588632 23820 log.cpp:656] Attempting to start the writer
 I1001 01:25:43.589174 23832 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I1001 01:25:43.589617 23832 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 427035ns
 I1001 01:25:43.589630 23832 replica.cpp:342] Persisted promised to 1
 I1001 01:25:43.589833 23821 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I1001 01:25:43.590340 23821 replica.cpp:375] Replica received explicit 
 promise request for position 0 with proposal 2
 I1001 01:25:43.590499 23821 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 142051ns
 I1001 01:25:43.590512 23821 replica.cpp:676] Persisted action at 0
 I1001 01:25:43.590903 23833 replica.cpp:508] Replica received write request 
 for position 0
 I1001 01:25:43.590932 23833 leveldb.cpp:438] Reading position from leveldb 
 took 14221ns
 I1001 01:25:43.591089 23833 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 140263ns
 I1001 01:25:43.591101 23833 replica.cpp:676] 

[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.

2014-10-01 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-920:
--
Component/s: technical debt

 Set GLOG_drop_log_memory=false in environment prior to logging initialization.
 --

 Key: MESOS-920
 URL: https://issues.apache.org/jira/browse/MESOS-920
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt
Affects Versions: 0.16.0, 0.15.0
Reporter: Benjamin Mahler

 We've observed performance scaling issues attributed to the posix_fadvise 
 calls made by glog. This can currently only disabled via the environment:
 GLOG_DEFINE_bool(drop_log_memory, true, Drop in-memory buffers of log 
 contents. 
  Logs can grow very quickly and they are rarely read before 
 they 
  need to be evicted from memory. Instead, drop them from 
 memory 
  as soon as they are flushed to disk.);
 if (FLAGS_drop_log_memory) {
   if (file_length_ = logging::kPageSize) {
 // don't evict the most recent page
 uint32 len = file_length_  ~(logging::kPageSize - 1);
 posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED);
   }
 }
 We should set GLOG_drop_log_memory=false prior to making our call to 
 google::InitGoogleLogging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1858) Leaked file descriptors in StatusUpdateStream.

2014-10-02 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156982#comment-14156982
 ] 

Benjamin Mahler commented on MESOS-1858:


Linking in MESOS-1432.

 Leaked file descriptors in StatusUpdateStream.
 --

 Key: MESOS-1858
 URL: https://issues.apache.org/jira/browse/MESOS-1858
 Project: Mesos
  Issue Type: Bug
Reporter: Jie Yu

 https://github.com/apache/mesos/blob/master/src/slave/status_update_manager.hpp#L180
 We should set cloexec for 'fd'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1862) Performance regression in the Master's http metrics.

2014-10-03 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1862:
--

 Summary: Performance regression in the Master's http metrics.
 Key: MESOS-1862
 URL: https://issues.apache.org/jira/browse/MESOS-1862
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker


As part of the change to hold on to terminal unacknowledged tasks in the 
master, we introduced a performance regression during the following patch:

https://github.com/apache/mesos/commit/0760b007ad65bc91e8cea377339978c78d36d247
{noformat}
commit 0760b007ad65bc91e8cea377339978c78d36d247
Author: Benjamin Mahler bmah...@twitter.com
Date:   Thu Sep 11 10:48:20 2014 -0700

Minor cleanups to the Master code.

Review: https://reviews.apache.org/r/25566
{noformat}

Rather than keeping a running count of allocated resources, we now compute 
resources on-demand. This was done in order to ignore terminal task's resources.

As a result of this change, the /stats.json and /metrics/snapshot endpoints on 
the master have slowed down substantially on large clusters.

{noformat}
$ time curl localhost:5050/health
real0m0.004s
user0m0.001s
sys 0m0.002s

$ time curl localhost:5050/stats.json  /dev/null
real0m15.402s
user0m0.001s
sys 0m0.003s

$ time curl localhost:5050/metrics/snapshot  /dev/null
real0m6.059s
user0m0.002s
sys 0m0.002s
{noformat}

{{perf top}} reveals some of the resource computation during a request to 
stats.json:
{noformat: perf top}
Events: 36K cycles
 10.53%  libc-2.5.so [.] _int_free
  9.90%  libc-2.5.so [.] malloc
  8.56%  libmesos-0.21.0.so  [.] std::_Rb_treeprocess::ProcessBase*, 
process::ProcessBase*, std::_Identityprocess::ProcessBase*, 
std::lessprocess::ProcessBase*, std::allocatorprocess::ProcessBase* ::
  8.23%  libc-2.5.so [.] _int_malloc
  5.80%  libstdc++.so.6.0.8  [.] 
std::_Rb_tree_increment(std::_Rb_tree_node_base*)
  5.33%  [kernel][k] _raw_spin_lock
  3.13%  libstdc++.so.6.0.8  [.] std::string::assign(std::string const)
  2.95%  libmesos-0.21.0.so  [.] 
process::SocketManager::exited(process::ProcessBase*)
  2.43%  libmesos-0.21.0.so  [.] mesos::Resource::MergeFrom(mesos::Resource 
const)
  1.88%  libmesos-0.21.0.so  [.] mesos::internal::master::Slave::used() const
  1.48%  libstdc++.so.6.0.8  [.] __gnu_cxx::__atomic_add(int volatile*, int)
  1.45%  [kernel][k] find_busiest_group
  1.41%  libc-2.5.so [.] free
  1.38%  libmesos-0.21.0.so  [.] 
mesos::Value_Range::MergeFrom(mesos::Value_Range const)
  1.13%  libmesos-0.21.0.so  [.] 
mesos::Value_Scalar::MergeFrom(mesos::Value_Scalar const)
  1.12%  libmesos-0.21.0.so  [.] mesos::Resource::SharedDtor()
  1.07%  libstdc++.so.6.0.8  [.] __gnu_cxx::__exchange_and_add(int 
volatile*, int)
  0.94%  libmesos-0.21.0.so  [.] 
google::protobuf::UnknownFieldSet::MergeFrom(google::protobuf::UnknownFieldSet 
const)
  0.92%  libstdc++.so.6.0.8  [.] operator new(unsigned long)
  0.88%  libmesos-0.21.0.so  [.] 
mesos::Value_Ranges::MergeFrom(mesos::Value_Ranges const)
  0.75%  libmesos-0.21.0.so  [.] mesos::matches(mesos::Resource const, 
mesos::Resource const)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1842) Create benchmark

2014-10-06 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160979#comment-14160979
 ] 

Benjamin Mahler commented on MESOS-1842:


Looks like this is a duplicate of MESOS-1018?

 Create benchmark
 

 Key: MESOS-1842
 URL: https://issues.apache.org/jira/browse/MESOS-1842
 Project: Mesos
  Issue Type: Technical task
  Components: libprocess
Reporter: Joris Van Remoortere
  Labels: performance, test





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-681) Document the reconciliation API.

2014-10-06 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-681:
--
 Description: 
Now that we have a reconciliation mechanism, we should document why it exists 
and how to use it going forward.

As we add the lower level API, reconciliation may be done slightly differently. 
Having documentation that reflects the changes would be great.

It might also be helpful to upload my slides from the May 19th meetup.

  was:
Now that we have a reconciliation mechanism, we should document why it exists 
and how to use it in 0.19.0 and going forward.

As we add the lower level API, reconciliation may be done slightly differently. 
Having documentation that reflects the changes would be great.

It might also be helpful to upload my slides from the May 19th meetup.

Story Points:   (was: 2)
 Summary: Document the reconciliation API.  (was: Document the 0.21.0 
reconciliation API.)

 Document the reconciliation API.
 

 Key: MESOS-681
 URL: https://issues.apache.org/jira/browse/MESOS-681
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
 Attachments: 0.19.0.key, 0.19.0.pdf


 Now that we have a reconciliation mechanism, we should document why it exists 
 and how to use it going forward.
 As we add the lower level API, reconciliation may be done slightly 
 differently. Having documentation that reflects the changes would be great.
 It might also be helpful to upload my slides from the May 19th meetup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1862) Performance regression in the Master's http metrics.

2014-10-06 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161297#comment-14161297
 ] 

Benjamin Mahler commented on MESOS-1862:


https://reviews.apache.org/r/26392/

 Performance regression in the Master's http metrics.
 

 Key: MESOS-1862
 URL: https://issues.apache.org/jira/browse/MESOS-1862
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.21.0
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
Priority: Blocker

 As part of the change to hold on to terminal unacknowledged tasks in the 
 master, we introduced a performance regression during the following patch:
 https://github.com/apache/mesos/commit/0760b007ad65bc91e8cea377339978c78d36d247
 {noformat}
 commit 0760b007ad65bc91e8cea377339978c78d36d247
 Author: Benjamin Mahler bmah...@twitter.com
 Date:   Thu Sep 11 10:48:20 2014 -0700
 Minor cleanups to the Master code.
 Review: https://reviews.apache.org/r/25566
 {noformat}
 Rather than keeping a running count of allocated resources, we now compute 
 resources on-demand. This was done in order to ignore terminal task's 
 resources.
 As a result of this change, the /stats.json and /metrics/snapshot endpoints 
 on the master have slowed down substantially on large clusters.
 {noformat}
 $ time curl localhost:5050/health
 real  0m0.004s
 user  0m0.001s
 sys   0m0.002s
 $ time curl localhost:5050/stats.json  /dev/null
 real  0m15.402s
 user  0m0.001s
 sys   0m0.003s
 $ time curl localhost:5050/metrics/snapshot  /dev/null
 real  0m6.059s
 user  0m0.002s
 sys   0m0.002s
 {noformat}
 {{perf top}} reveals some of the resource computation during a request to 
 stats.json:
 {noformat: perf top}
 Events: 36K cycles
  10.53%  libc-2.5.so [.] _int_free
   9.90%  libc-2.5.so [.] malloc
   8.56%  libmesos-0.21.0.so  [.] std::_Rb_treeprocess::ProcessBase*, 
 process::ProcessBase*, std::_Identityprocess::ProcessBase*, 
 std::lessprocess::ProcessBase*, std::allocatorprocess::ProcessBase* ::
   8.23%  libc-2.5.so [.] _int_malloc
   5.80%  libstdc++.so.6.0.8  [.] 
 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
   5.33%  [kernel][k] _raw_spin_lock
   3.13%  libstdc++.so.6.0.8  [.] std::string::assign(std::string const)
   2.95%  libmesos-0.21.0.so  [.] 
 process::SocketManager::exited(process::ProcessBase*)
   2.43%  libmesos-0.21.0.so  [.] mesos::Resource::MergeFrom(mesos::Resource 
 const)
   1.88%  libmesos-0.21.0.so  [.] mesos::internal::master::Slave::used() const
   1.48%  libstdc++.so.6.0.8  [.] __gnu_cxx::__atomic_add(int volatile*, 
 int)
   1.45%  [kernel][k] find_busiest_group
   1.41%  libc-2.5.so [.] free
   1.38%  libmesos-0.21.0.so  [.] 
 mesos::Value_Range::MergeFrom(mesos::Value_Range const)
   1.13%  libmesos-0.21.0.so  [.] 
 mesos::Value_Scalar::MergeFrom(mesos::Value_Scalar const)
   1.12%  libmesos-0.21.0.so  [.] mesos::Resource::SharedDtor()
   1.07%  libstdc++.so.6.0.8  [.] __gnu_cxx::__exchange_and_add(int 
 volatile*, int)
   0.94%  libmesos-0.21.0.so  [.] 
 google::protobuf::UnknownFieldSet::MergeFrom(google::protobuf::UnknownFieldSet
  const)
   0.92%  libstdc++.so.6.0.8  [.] operator new(unsigned long)
   0.88%  libmesos-0.21.0.so  [.] 
 mesos::Value_Ranges::MergeFrom(mesos::Value_Ranges const)
   0.75%  libmesos-0.21.0.so  [.] mesos::matches(mesos::Resource const, 
 mesos::Resource const)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.

2014-10-08 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1876:
--

 Summary: Remove deprecated 'slave_id' field in 
ReregisterSlaveMessage.
 Key: MESOS-1876
 URL: https://issues.apache.org/jira/browse/MESOS-1876
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Benjamin Mahler


This is to follow through on removing the deprecated field that we've been 
phasing out. In 0.21.0, this field will no longer be read:

{code}
message ReregisterSlaveMessage {
  // TODO(bmahler): slave_id is deprecated.
  // 0.21.0: Now an optional field. Always written, never read.
  // 0.22.0: Remove this field.
  optional SlaveID slave_id = 1;
  required SlaveInfo slave = 2;
  repeated ExecutorInfo executor_infos = 4;
  repeated Task tasks = 3;
  repeated Archive.Framework completed_frameworks = 5;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.

2014-10-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1876:
---
Priority: Trivial  (was: Major)

 Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
 -

 Key: MESOS-1876
 URL: https://issues.apache.org/jira/browse/MESOS-1876
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Benjamin Mahler
Priority: Trivial

 This is to follow through on removing the deprecated field that we've been 
 phasing out. In 0.21.0, this field will no longer be read:
 {code}
 message ReregisterSlaveMessage {
   // TODO(bmahler): slave_id is deprecated.
   // 0.21.0: Now an optional field. Always written, never read.
   // 0.22.0: Remove this field.
   optional SlaveID slave_id = 1;
   required SlaveInfo slave = 2;
   repeated ExecutorInfo executor_infos = 4;
   repeated Task tasks = 3;
   repeated Archive.Framework completed_frameworks = 5;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1879) Handle a temporary one-way slave -- master socket closure.

2014-10-08 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-1879:
--

 Summary: Handle a temporary one-way slave -- master socket 
closure.
 Key: MESOS-1879
 URL: https://issues.apache.org/jira/browse/MESOS-1879
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Mahler
Priority: Minor


In the same spirit as MESOS-1668, we want to correctly handle a scenario where 
the slave -- master socket closes, and a new socket can be immediately 
re-established.

If this occurs, the ping / pongs will resume but there may be dropped messages 
sent by the slave, and so a re-registration would be a good safety net.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents

2014-10-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1878:
---
 Target Version/s: 0.21.0
Affects Version/s: 0.21.0
 Assignee: Cody Maloney

[~cmaloney] can you take a look at this?

 Access to sandbox on slave from master UI does not show the sandbox contents
 

 Key: MESOS-1878
 URL: https://issues.apache.org/jira/browse/MESOS-1878
 Project: Mesos
  Issue Type: Bug
  Components: webui
Affects Versions: 0.21.0
Reporter: Anindya Sinha
Assignee: Cody Maloney
Priority: Minor

 From master UI, clicking Sandbox to go to slave sandbox does not list the 
 sandbox contents. The directory path of the sandbox shows up fine, but not 
 the actual contents of the sandbox that is displayed below.
 Looks like the issue is it fails in the following GET from the corresponding 
 slave:
 http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path
 Looking at the commits, I could confirm that the issue is not seen with 
 commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos 
 build with this commit being the last commit on mesos slave does not show 
 this behavior.
 Update: The issue has been introduced by the following 2 commits:
 ca2e8ef MESOS-1857 Fixed path::join() on older libstdc++ which lack back().
 b08fccf Switched path::join() to be variadic
 Note that the commit ca2e8ef fixes a build issue (on older libstd++) on top 
 of the commit b08fccf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1869) UpdateFramework message might reach the slave before Reregistered message and get dropped

2014-10-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164526#comment-14164526
 ] 

Benjamin Mahler commented on MESOS-1869:


Fixed as part of MESOS-1696:
https://reviews.apache.org/r/26206/

 UpdateFramework message might reach the slave before Reregistered message and 
 get dropped
 -

 Key: MESOS-1869
 URL: https://issues.apache.org/jira/browse/MESOS-1869
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Benjamin Mahler

 In reregisterSlave() we send 'SlaveReregisteredMessage' before we link the 
 slave pid, which means a temporary socket will be created and used.
 Subsequently, after linking, we send the UpdateFrameworkMessage, which 
 creates and uses a persistent socket.
 This might lead to out-of-order delivery, resulting in UpdateFrameworkMessage 
 reaching the slave before the SlaveReregisteredMessage and getting dropped 
 because the slave is not yet (re-)registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1469) No output from review bot on timeout

2014-10-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1469:
---
Component/s: reviewbot

 No output from review bot on timeout
 

 Key: MESOS-1469
 URL: https://issues.apache.org/jira/browse/MESOS-1469
 Project: Mesos
  Issue Type: Bug
  Components: build, reviewbot
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor

 When the mesos review build times out, likely due to a long-running failing 
 test, we have no output to debug. We should find a way to stream the output 
 from the build instead of waiting for the build to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1234) Mesos ReviewBot should look at old reviews first

2014-10-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1234:
---
Component/s: reviewbot

 Mesos ReviewBot should look at old reviews first
 

 Key: MESOS-1234
 URL: https://issues.apache.org/jira/browse/MESOS-1234
 Project: Mesos
  Issue Type: Improvement
  Components: reviewbot
Reporter: Vinod Kone
Assignee: Vinod Kone
 Fix For: 0.19.0


 Currently the ReviewBot looks at newest reviews first starving out old 
 reviews if there are enough new/updated reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1712) Automate disallowing of commits mixing mesos/libprocess/stout

2014-10-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1712:
---
Component/s: reviewbot

 Automate disallowing of commits mixing mesos/libprocess/stout
 -

 Key: MESOS-1712
 URL: https://issues.apache.org/jira/browse/MESOS-1712
 Project: Mesos
  Issue Type: Bug
  Components: reviewbot
Reporter: Vinod Kone

 For various reasons, we don't want to mix mesos/libprocess/stout changes into 
 a single commit. Typically, it is up to the reviewee/reviewer to catch this. 
 It wold be nice to automate this via the pre-commit hook .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >