[jira] [Updated] (MESOS-1550) MesosSchedulerDriver should never, ever, call 'stop'.
[ https://issues.apache.org/jira/browse/MESOS-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1550: --- Fix Version/s: 0.19.1 MesosSchedulerDriver should never, ever, call 'stop'. - Key: MESOS-1550 URL: https://issues.apache.org/jira/browse/MESOS-1550 Project: Mesos Issue Type: Bug Components: framework, java api, python api Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.17.0, 0.16.0, 0.15.0, 0.18.0, 0.19.0 Reporter: Benjamin Hindman Priority: Critical Fix For: 0.19.1 Using MesosSchedulerDriver.stop causes the master to unregister the framework. The library should never make this decision for a framework, it should defer to the framework itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1539) No longer able to spin up Mesos master in local mode
[ https://issues.apache.org/jira/browse/MESOS-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1539: --- Target Version/s: 0.19.1 Fix Version/s: (was: 0.19.1) No longer able to spin up Mesos master in local mode Key: MESOS-1539 URL: https://issues.apache.org/jira/browse/MESOS-1539 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.19.0 Environment: Ubuntu 14.04 / Mac OS X against Mesos 0.19.0 Reporter: Sunil Shah Assignee: Benjamin Mahler Fix For: 0.20.0 JVM frameworks such as Marathon use the local master mode for testing purposes (passed through as the `--master local` parameter). This doesn't not to work in Mesos 0.19.0 because of the new mandatory registry and quorum parameters. There is no way to set these for local masters - it emits the following message before terminating the framework: `--work_dir needed for replicated log based registry`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.
[ https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14046537#comment-14046537 ] Benjamin Mahler commented on MESOS-1517: There's only a few types of messages involved here. When a master fails over, the slaves and frameworks will try to re-register before doing anything else. This means that if we queue up the messages we'll only be reducing the need for frameworks and slaves to retry registration, which is already something that is required of them. So I think this change would mostly be beneficial for our integration tests where the retries are not desirable. :) Maintain a queue of messages that arrive before the master recovers. Key: MESOS-1517 URL: https://issues.apache.org/jira/browse/MESOS-1517 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: reliability Currently when the master is recovering, we drop all incoming messages. If slaves and frameworks knew about the leading master only once it has recovered, then we would only expect to see messages after we've recovered. We previously considered enqueuing all messages through the recovery future, but this has the downside of forcing all messages to go through the master's queue twice: {code} // TODO(bmahler): Consider instead re-enqueing *all* messages // through recover(). What are the performance implications of // the additional queueing delay and the accumulated backlog // of messages post-recovery? if (!recovered.get().isReady()) { VLOG(1) Dropping ' event.message-name ' message since not recovered yet; ++metrics.dropped_messages; return; } {code} However, an easy solution to this problem is to maintain an explicit queue of incoming messages that gets flushed once we finish recovery. This ensures that all messages post-recovery are processed normally. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1566) Support private docker registry.
[ https://issues.apache.org/jira/browse/MESOS-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1566: --- Summary: Support private docker registry. (was: Support private registry) Support private docker registry. Key: MESOS-1566 URL: https://issues.apache.org/jira/browse/MESOS-1566 Project: Mesos Issue Type: Technical task Reporter: Timothy Chen Need to support Docker launching images hosted in private registry service, which requires docker login. Can consider utilizing .dockercfg file for providing credentials. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1574) what to do when a rogue process binds to a port mesos didn't allocate to it?
[ https://issues.apache.org/jira/browse/MESOS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1574: --- Component/s: isolation what to do when a rogue process binds to a port mesos didn't allocate to it? Key: MESOS-1574 URL: https://issues.apache.org/jira/browse/MESOS-1574 Project: Mesos Issue Type: Improvement Components: allocation, isolation Reporter: Jay Buffington Priority: Minor I recently had an issue where a slave had a process who's parent was init that was bound to a port in the range that mesos thought was a free resource. I'm not sure if this is due to a bug in mesos (it lost track of this process during an upgrade?) or if there was a bad user who started a process on the host manually outside of mesos. The process is over a month old and I have no history in mesos to ask it if/when it launched the task :( If a rogue process binds to a port that mesos-slave has offered to the master as an available resource there should be some sort of reckoning. Mesos could: * kill the rogue process * rescind the offer for that port * have an api that can be plugged into a monitoring system to alert humans of this inconsistency -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1538) A container destruction in the middle of a launch leads to CHECK failure.
[ https://issues.apache.org/jira/browse/MESOS-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1538: --- Summary: A container destruction in the middle of a launch leads to CHECK failure. (was: A container destruction in the middle of a launch leads to CHECK failure) A container destruction in the middle of a launch leads to CHECK failure. - Key: MESOS-1538 URL: https://issues.apache.org/jira/browse/MESOS-1538 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Ian Downes Fix For: 0.19.1 There is a race between the destroy() and exec() in the containerizer process, when the destroy is called in the middle of the launch. In particular if the destroy is completed and the container removed from 'promises' map before 'exec()' was called, CHECK failure happens. The fix is to return a Failure instead of doing a CHECK in 'exec()'. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1567) Add logging of the user uid when receiving SIGTERM.
[ https://issues.apache.org/jira/browse/MESOS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1567: --- Description: We currently do not log the user id when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. (was: We currently do not log the user pid when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction.) Add logging of the user uid when receiving SIGTERM. --- Key: MESOS-1567 URL: https://issues.apache.org/jira/browse/MESOS-1567 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Benjamin Mahler Assignee: Alexandra Sava We currently do not log the user id when receiving a SIGTERM, this makes debugging a bit difficult. It's easy to get this information through sigaction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1466: --- Labels: reliability (was: ) Race between executor exited event and launch task can cause overcommit of resources Key: MESOS-1466 URL: https://issues.apache.org/jira/browse/MESOS-1466 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Vinod Kone Labels: reliability The following sequence of events can cause an overcommit -- Launch task is called for a task whose executor is already running -- Executor's resources are not accounted for on the master -- Executor exits and the event is enqueued behind launch tasks on the master -- Master sends the task to the slave which needs to commit for resources for task and the (new) executor. -- Master processes the executor exited event and re-offers the executor's resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1603: --- Sprint: Q3 Sprint 1 SlaveTest.TerminatingSlaveDoesNotReregister is flaky. - Key: MESOS-1603 URL: https://issues.apache.org/jira/browse/MESOS-1603 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] SlaveTest.TerminatingSlaveDoesNotReregister Using temporary directory '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU' I0715 18:16:03.231495 5857 leveldb.cpp:176] Opened db in 27.552259ms I0715 18:16:03.240953 5857 leveldb.cpp:183] Compacted db in 8.801497ms I0715 18:16:03.241580 5857 leveldb.cpp:198] Created db iterator in 39823ns I0715 18:16:03.241945 5857 leveldb.cpp:204] Seeked to beginning of db in 15498ns I0715 18:16:03.242385 5857 leveldb.cpp:273] Iterated through 0 keys in the db in 15153ns I0715 18:16:03.242780 5857 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0715 18:16:03.243475 5882 recover.cpp:425] Starting replica recovery I0715 18:16:03.243540 5882 recover.cpp:451] Replica is in EMPTY status I0715 18:16:03.243862 5882 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0715 18:16:03.243919 5882 recover.cpp:188] Received a recover response from a replica in EMPTY status I0715 18:16:03.244112 5875 recover.cpp:542] Updating replica status to STARTING I0715 18:16:03.249405 5880 master.cpp:288] Master 20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514 I0715 18:16:03.249445 5880 master.cpp:325] Master only allowing authenticated frameworks to register I0715 18:16:03.249454 5880 master.cpp:330] Master only allowing authenticated slaves to register I0715 18:16:03.249480 5880 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials' I0715 18:16:03.250130 5880 master.cpp:359] Authorization enabled I0715 18:16:03.250900 5880 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:36514 I0715 18:16:03.250951 5880 master.cpp:122] No whitelist given. Advertising offers for all slaves I0715 18:16:03.251145 5880 master.cpp:1128] The newly elected leader is master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857 I0715 18:16:03.251164 5880 master.cpp:1141] Elected as the leading master! I0715 18:16:03.251173 5880 master.cpp:959] Recovering from registrar I0715 18:16:03.251225 5880 registrar.cpp:313] Recovering registrar I0715 18:16:03.254640 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 10.421369ms I0715 18:16:03.254683 5875 replica.cpp:320] Persisted replica status to STARTING I0715 18:16:03.254770 5875 recover.cpp:451] Replica is in STARTING status I0715 18:16:03.255097 5875 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0715 18:16:03.255166 5875 recover.cpp:188] Received a recover response from a replica in STARTING status I0715 18:16:03.255280 5875 recover.cpp:542] Updating replica status to VOTING I0715 18:16:03.263897 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.581313ms I0715 18:16:03.263944 5875 replica.cpp:320] Persisted replica status to VOTING I0715 18:16:03.264010 5875 recover.cpp:556] Successfully joined the Paxos group I0715 18:16:03.264085 5875 recover.cpp:440] Recover process terminated I0715 18:16:03.264227 5875 log.cpp:656] Attempting to start the writer I0715 18:16:03.264570 5875 replica.cpp:474] Replica received implicit promise request with proposal 1 I0715 18:16:03.322881 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 58.31469ms I0715 18:16:03.323349 5875 replica.cpp:342] Persisted promised to 1 I0715 18:16:03.328495 5876 coordinator.cpp:230] Coordinator attemping to fill missing position I0715 18:16:03.328910 5876 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0715 18:16:03.338655 5876 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 9.73834ms I0715 18:16:03.338693 5876 replica.cpp:676] Persisted action at 0 I0715 18:16:03.338964 5876 replica.cpp:508] Replica received write request for position 0 I0715 18:16:03.338997 5876 leveldb.cpp:438] Reading position from leveldb took 21691ns I0715 18:16:03.349257 5876 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 10.25515ms I0715 18:16:03.349551 5876 replica.cpp:676] Persisted action at 0 I0715 18:16:03.354379 5877 replica.cpp:655] Replica received learned notice
[jira] [Resolved] (MESOS-1460) SlaveTest.TerminatingSlaveDoesNotRegister is flaky
[ https://issues.apache.org/jira/browse/MESOS-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1460. Resolution: Fixed {noformat} commit ebee9afee7f6e5f04a5f259642c12eb0b99c35e0 Author: Yifan Gu yi...@mesosphere.io Date: Thu Jun 12 12:24:46 2014 -0700 Fixed a flaky test: SlaveTest.TerminatingSlaveDoesNotReregister. Review: https://reviews.apache.org/r/22472 {noformat} SlaveTest.TerminatingSlaveDoesNotRegister is flaky -- Key: MESOS-1460 URL: https://issues.apache.org/jira/browse/MESOS-1460 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon Assignee: Yifan Gu [ RUN ] SlaveTest.TerminatingSlaveDoesNotReregister Using temporary directory '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_U2FkN5' I0605 11:04:21.890828 32082 leveldb.cpp:176] Opened db in 49.661187ms I0605 11:04:21.908869 32082 leveldb.cpp:183] Compacted db in 17.671793ms I0605 11:04:21.909230 32082 leveldb.cpp:198] Created db iterator in 26848ns I0605 11:04:21.909484 32082 leveldb.cpp:204] Seeked to beginning of db in 1705ns I0605 11:04:21.909740 32082 leveldb.cpp:273] Iterated through 0 keys in the db in 815ns I0605 11:04:21.910032 32082 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0605 11:04:21.910549 32105 recover.cpp:425] Starting replica recovery I0605 11:04:21.910626 32105 recover.cpp:451] Replica is in EMPTY status I0605 11:04:21.910951 32105 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0605 11:04:21.911013 32105 recover.cpp:188] Received a recover response from a replica in EMPTY status I0605 11:04:21.93 32105 recover.cpp:542] Updating replica status to STARTING I0605 11:04:21.914664 32109 master.cpp:272] Master 20140605-110421-16842879-56385-32082 (precise) started on 127.0.1.1:56385 I0605 11:04:21.914690 32109 master.cpp:309] Master only allowing authenticated frameworks to register I0605 11:04:21.914695 32109 master.cpp:314] Master only allowing authenticated slaves to register I0605 11:04:21.914702 32109 credentials.hpp:35] Loading credentials for authentication from '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_U2FkN5/credentials' I0605 11:04:21.914765 32109 master.cpp:340] Master enabling authorization I0605 11:04:21.915194 32109 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:56385 I0605 11:04:21.915230 32109 master.cpp:108] No whitelist given. Advertising offers for all slaves I0605 11:04:21.915393 32109 master.cpp:957] The newly elected leader is master@127.0.1.1:56385 with id 20140605-110421-16842879-56385-32082 I0605 11:04:21.915405 32109 master.cpp:970] Elected as the leading master! I0605 11:04:21.915410 32109 master.cpp:788] Recovering from registrar I0605 11:04:21.915458 32109 registrar.cpp:313] Recovering registrar I0605 11:04:21.931046 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 19.869329ms I0605 11:04:21.931084 32105 replica.cpp:320] Persisted replica status to STARTING I0605 11:04:21.931169 32105 recover.cpp:451] Replica is in STARTING status I0605 11:04:21.931500 32105 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0605 11:04:21.931560 32105 recover.cpp:188] Received a recover response from a replica in STARTING status I0605 11:04:21.931656 32105 recover.cpp:542] Updating replica status to VOTING I0605 11:04:21.945734 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.013731ms I0605 11:04:21.945791 32105 replica.cpp:320] Persisted replica status to VOTING I0605 11:04:21.945868 32105 recover.cpp:556] Successfully joined the Paxos group I0605 11:04:21.945930 32105 recover.cpp:440] Recover process terminated I0605 11:04:21.946071 32105 log.cpp:656] Attempting to start the writer I0605 11:04:21.946374 32105 replica.cpp:474] Replica received implicit promise request with proposal 1 I0605 11:04:21.960847 32105 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 14.444258ms I0605 11:04:21.961493 32105 replica.cpp:342] Persisted promised to 1 I0605 11:04:21.965292 32107 coordinator.cpp:230] Coordinator attemping to fill missing position I0605 11:04:21.965626 32107 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0605 11:04:21.982533 32107 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 16.8754ms I0605 11:04:21.982589 32107 replica.cpp:676] Persisted action at 0 I0605 11:04:21.982921 32107 replica.cpp:508] Replica received write request for position 0 I0605 11:04:21.982952 32107 leveldb.cpp:438] Reading position from leveldb took 16276ns I0605 11:04:21.999135 32107
[jira] [Created] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
Benjamin Mahler created MESOS-1603: -- Summary: SlaveTest.TerminatingSlaveDoesNotReregister is flaky. Key: MESOS-1603 URL: https://issues.apache.org/jira/browse/MESOS-1603 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] SlaveTest.TerminatingSlaveDoesNotReregister Using temporary directory '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU' I0715 18:16:03.231495 5857 leveldb.cpp:176] Opened db in 27.552259ms I0715 18:16:03.240953 5857 leveldb.cpp:183] Compacted db in 8.801497ms I0715 18:16:03.241580 5857 leveldb.cpp:198] Created db iterator in 39823ns I0715 18:16:03.241945 5857 leveldb.cpp:204] Seeked to beginning of db in 15498ns I0715 18:16:03.242385 5857 leveldb.cpp:273] Iterated through 0 keys in the db in 15153ns I0715 18:16:03.242780 5857 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0715 18:16:03.243475 5882 recover.cpp:425] Starting replica recovery I0715 18:16:03.243540 5882 recover.cpp:451] Replica is in EMPTY status I0715 18:16:03.243862 5882 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0715 18:16:03.243919 5882 recover.cpp:188] Received a recover response from a replica in EMPTY status I0715 18:16:03.244112 5875 recover.cpp:542] Updating replica status to STARTING I0715 18:16:03.249405 5880 master.cpp:288] Master 20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514 I0715 18:16:03.249445 5880 master.cpp:325] Master only allowing authenticated frameworks to register I0715 18:16:03.249454 5880 master.cpp:330] Master only allowing authenticated slaves to register I0715 18:16:03.249480 5880 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials' I0715 18:16:03.250130 5880 master.cpp:359] Authorization enabled I0715 18:16:03.250900 5880 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:36514 I0715 18:16:03.250951 5880 master.cpp:122] No whitelist given. Advertising offers for all slaves I0715 18:16:03.251145 5880 master.cpp:1128] The newly elected leader is master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857 I0715 18:16:03.251164 5880 master.cpp:1141] Elected as the leading master! I0715 18:16:03.251173 5880 master.cpp:959] Recovering from registrar I0715 18:16:03.251225 5880 registrar.cpp:313] Recovering registrar I0715 18:16:03.254640 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 10.421369ms I0715 18:16:03.254683 5875 replica.cpp:320] Persisted replica status to STARTING I0715 18:16:03.254770 5875 recover.cpp:451] Replica is in STARTING status I0715 18:16:03.255097 5875 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0715 18:16:03.255166 5875 recover.cpp:188] Received a recover response from a replica in STARTING status I0715 18:16:03.255280 5875 recover.cpp:542] Updating replica status to VOTING I0715 18:16:03.263897 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.581313ms I0715 18:16:03.263944 5875 replica.cpp:320] Persisted replica status to VOTING I0715 18:16:03.264010 5875 recover.cpp:556] Successfully joined the Paxos group I0715 18:16:03.264085 5875 recover.cpp:440] Recover process terminated I0715 18:16:03.264227 5875 log.cpp:656] Attempting to start the writer I0715 18:16:03.264570 5875 replica.cpp:474] Replica received implicit promise request with proposal 1 I0715 18:16:03.322881 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 58.31469ms I0715 18:16:03.323349 5875 replica.cpp:342] Persisted promised to 1 I0715 18:16:03.328495 5876 coordinator.cpp:230] Coordinator attemping to fill missing position I0715 18:16:03.328910 5876 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0715 18:16:03.338655 5876 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 9.73834ms I0715 18:16:03.338693 5876 replica.cpp:676] Persisted action at 0 I0715 18:16:03.338964 5876 replica.cpp:508] Replica received write request for position 0 I0715 18:16:03.338997 5876 leveldb.cpp:438] Reading position from leveldb took 21691ns I0715 18:16:03.349257 5876 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 10.25515ms I0715 18:16:03.349551 5876 replica.cpp:676] Persisted action at 0 I0715 18:16:03.354379 5877 replica.cpp:655] Replica received learned notice for position 0 I0715 18:16:03.367383 5877 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 12.99789ms I0715 18:16:03.367434 5877 replica.cpp:676] Persisted action at 0 I0715 18:16:03.367444 5877 replica.cpp:661] Replica learned NOP action at
[jira] [Commented] (MESOS-1525) Don't require slave id for reconciliation requests.
[ https://issues.apache.org/jira/browse/MESOS-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063057#comment-14063057 ] Benjamin Mahler commented on MESOS-1525: Review: https://reviews.apache.org/r/23542/ Don't require slave id for reconciliation requests. --- Key: MESOS-1525 URL: https://issues.apache.org/jira/browse/MESOS-1525 Project: Mesos Issue Type: Improvement Affects Versions: 0.19.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Reconciliation requests currently specify a list of TaskStatuses. SlaveID is optional inside TaskStatus but reconciliation requests are dropped when the SlaveID is not specified. We can answer reconciliation requests for a task so long as there are no transient slaves, this is what we should do when the slave id is not specified. There's an open question around whether we want the Reconcile Event to specify TaskID/SlaveID instead of TaskStatus, but I'll save that for later. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1603) SlaveTest.TerminatingSlaveDoesNotReregister is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063058#comment-14063058 ] Benjamin Mahler commented on MESOS-1603: Review: https://reviews.apache.org/r/23543/ SlaveTest.TerminatingSlaveDoesNotReregister is flaky. - Key: MESOS-1603 URL: https://issues.apache.org/jira/browse/MESOS-1603 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] SlaveTest.TerminatingSlaveDoesNotReregister Using temporary directory '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU' I0715 18:16:03.231495 5857 leveldb.cpp:176] Opened db in 27.552259ms I0715 18:16:03.240953 5857 leveldb.cpp:183] Compacted db in 8.801497ms I0715 18:16:03.241580 5857 leveldb.cpp:198] Created db iterator in 39823ns I0715 18:16:03.241945 5857 leveldb.cpp:204] Seeked to beginning of db in 15498ns I0715 18:16:03.242385 5857 leveldb.cpp:273] Iterated through 0 keys in the db in 15153ns I0715 18:16:03.242780 5857 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0715 18:16:03.243475 5882 recover.cpp:425] Starting replica recovery I0715 18:16:03.243540 5882 recover.cpp:451] Replica is in EMPTY status I0715 18:16:03.243862 5882 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0715 18:16:03.243919 5882 recover.cpp:188] Received a recover response from a replica in EMPTY status I0715 18:16:03.244112 5875 recover.cpp:542] Updating replica status to STARTING I0715 18:16:03.249405 5880 master.cpp:288] Master 20140715-181603-16842879-36514-5857 (trusty) started on 127.0.1.1:36514 I0715 18:16:03.249445 5880 master.cpp:325] Master only allowing authenticated frameworks to register I0715 18:16:03.249454 5880 master.cpp:330] Master only allowing authenticated slaves to register I0715 18:16:03.249480 5880 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveTest_TerminatingSlaveDoesNotReregister_6OCQiU/credentials' I0715 18:16:03.250130 5880 master.cpp:359] Authorization enabled I0715 18:16:03.250900 5880 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:36514 I0715 18:16:03.250951 5880 master.cpp:122] No whitelist given. Advertising offers for all slaves I0715 18:16:03.251145 5880 master.cpp:1128] The newly elected leader is master@127.0.1.1:36514 with id 20140715-181603-16842879-36514-5857 I0715 18:16:03.251164 5880 master.cpp:1141] Elected as the leading master! I0715 18:16:03.251173 5880 master.cpp:959] Recovering from registrar I0715 18:16:03.251225 5880 registrar.cpp:313] Recovering registrar I0715 18:16:03.254640 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 10.421369ms I0715 18:16:03.254683 5875 replica.cpp:320] Persisted replica status to STARTING I0715 18:16:03.254770 5875 recover.cpp:451] Replica is in STARTING status I0715 18:16:03.255097 5875 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0715 18:16:03.255166 5875 recover.cpp:188] Received a recover response from a replica in STARTING status I0715 18:16:03.255280 5875 recover.cpp:542] Updating replica status to VOTING I0715 18:16:03.263897 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.581313ms I0715 18:16:03.263944 5875 replica.cpp:320] Persisted replica status to VOTING I0715 18:16:03.264010 5875 recover.cpp:556] Successfully joined the Paxos group I0715 18:16:03.264085 5875 recover.cpp:440] Recover process terminated I0715 18:16:03.264227 5875 log.cpp:656] Attempting to start the writer I0715 18:16:03.264570 5875 replica.cpp:474] Replica received implicit promise request with proposal 1 I0715 18:16:03.322881 5875 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 58.31469ms I0715 18:16:03.323349 5875 replica.cpp:342] Persisted promised to 1 I0715 18:16:03.328495 5876 coordinator.cpp:230] Coordinator attemping to fill missing position I0715 18:16:03.328910 5876 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0715 18:16:03.338655 5876 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 9.73834ms I0715 18:16:03.338693 5876 replica.cpp:676] Persisted action at 0 I0715 18:16:03.338964 5876 replica.cpp:508] Replica received write request for position 0 I0715 18:16:03.338997 5876 leveldb.cpp:438] Reading position from leveldb took 21691ns I0715 18:16:03.349257 5876 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 10.25515ms I0715 18:16:03.349551 5876 replica.cpp:676] Persisted action at 0
[jira] [Created] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.
Benjamin Mahler created MESOS-1620: -- Summary: Reconciliation does not send back tasks pending validation / authorization. Key: MESOS-1620 URL: https://issues.apache.org/jira/browse/MESOS-1620 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Benjamin Mahler Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send back TASK_STAGING for those tasks that are pending in the Master (validation / authorization still in progress). For both implicit and explicit task reconciliation, the master could reply with TASK_STAGING for these tasks, as this provides additional information to the framework. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MESOS-1529) Handle a network partition between Master and Slave
[ https://issues.apache.org/jira/browse/MESOS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072537#comment-14072537 ] Benjamin Mahler edited comment on MESOS-1529 at 7/24/14 2:55 AM: - For now we will proceed by adding a ping timeout on the slave to ensure that the slave re-registers when the master is no longer pinging it. This will resolve the case that motivated this ticket: https://reviews.apache.org/r/23874/ https://reviews.apache.org/r/23875/ https://reviews.apache.org/r/23866/ https://reviews.apache.org/r/23867/ https://reviews.apache.org/r/23868/ I decided to punt on the failover timeout in the master in the first pass because it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. was (Author: bmahler): For now we will proceed by adding a ping timeout on the slave to ensure that the slave re-registers when the master is no longer pinging it. This will resolve the case that motivated this ticket: https://reviews.apache.org/r/23866/ https://reviews.apache.org/r/23867/ https://reviews.apache.org/r/23868/ I decided to punt on the failover timeout in the master in the first pass because it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. Handle a network partition between Master and Slave --- Key: MESOS-1529 URL: https://issues.apache.org/jira/browse/MESOS-1529 Project: Mesos Issue Type: Bug Reporter: Dominic Hamon Assignee: Benjamin Mahler If a network partition occurs between a Master and Slave, the Master will remove the Slave (as it fails health check) and mark the tasks being run there as LOST. However, the Slave is not aware that it has been removed so the tasks will continue to run. (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating that the connection between the master and slave is not closed). There are at least two possible approaches to solving this issue: 1. Introduce a health check from Slave to Master so they have a consistent view of a network partition. We may still see this issue should a one-way connection error occur. 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state) but maybe the Scheduler can make a more intelligent decision. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1635) zk flag fails when specifying a file and the
[ https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073707#comment-14073707 ] Benjamin Mahler commented on MESOS-1635: Looks like there was a TODO left for this: https://github.com/apache/mesos/blob/0.19.1/src/master/main.cpp#L197 I think we should improve URL::parse per the TODO and update these: https://github.com/apache/mesos/blob/0.19.1/src/master/detector.cpp#L107 https://github.com/apache/mesos/blob/0.19.1/src/master/contender.cpp#L73 Should be a simple fix, would be happy to shepherd this if someone wants to pick it up! zk flag fails when specifying a file and the - Key: MESOS-1635 URL: https://issues.apache.org/jira/browse/MESOS-1635 Project: Mesos Issue Type: Bug Components: cli Affects Versions: 0.19.1 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Reporter: Ken Sipe The zk flag supports referencing a file. It works when registry is in_memory, however in a real environment it fails. the following starts up just fine. /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory however when the follow is executed it fails: /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 --work_dir=/tmp/mesos It uses the same working format for the zk flag, but now we are using the replicated logs. it fails with: I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: dc0b7bf2a1a7981079b33a16b689892f9cda0d8d Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1406) Master stats.json using boolean instead of integral value for 'elected'.
[ https://issues.apache.org/jira/browse/MESOS-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1406: --- Description: All stats.json values should be numeric, but it looks like a regression was introduced here: {noformat} commit dee9bd96e88053ab96c84253578ed332d343fe41 Author: Charlie Carson charliecar...@gmail.com Date: Thu Feb 20 16:24:09 2014 -0800 Add JSON::Boolean to stout/json.hpp. If you assign an JSON::Object a bool then it will get coerced into a JSON::Number w/ value of 0.0 or 1.0. This is because JSON::True and JSON::False do not have constructors from bool. The fix is to introduce a common super class, JSON::Boolean, which both JSON::True and JSON::False inherit from. JSON::Boolean has the necessary constructor which takes a bool. However, this leads to ambiguity when assigning a cstring to a JSON::Value, since JSON::String already takes a const char * and a const char * is implicitly convertable to a bool. The solution for that is to rename the variant from JSON::Value to JSON::inner::Variant and to create a new class JSON::Value which inherits from JSON::inner::Variant. The new JSON::Value can have all the conversion constructors in a single place, so is no ambiguity, and delegate everythign else to the Variant. Also added a bunch of unit tests. SEE: https://issues.apache.org/jira/browse/MESOS-939 Review: https://reviews.apache.org/r/17520 {noformat} This caused all JSON values constructed from booleans to implicitly change from 0/1 to true/false. was: All stats.json values should be numeric, but it looks like a regression was introduced here: {noformat} commit f9d1dd819b6cc3843e4d1287ac10276d62cbfed4 Author: Vinod Kone vi...@twitter.com Date: Tue Nov 19 10:39:27 2013 -0800 Replaced usage of old detector with new Master contender and detector abstractions. From: Jiang Yan Xu y...@jxu.me Review: https://reviews.apache.org/r/15510 {noformat} Which appears to have been included since 0.16.0. Old stats.json: {code} { ... elected: 0, ... } {code} {code} { ... elected: false, ... } {code} Affects Version/s: (was: 0.18.0) (was: 0.17.0) (was: 0.16.0) Fix Version/s: (was: 0.19.0) Master stats.json using boolean instead of integral value for 'elected'. Key: MESOS-1406 URL: https://issues.apache.org/jira/browse/MESOS-1406 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Benjamin Mahler All stats.json values should be numeric, but it looks like a regression was introduced here: {noformat} commit dee9bd96e88053ab96c84253578ed332d343fe41 Author: Charlie Carson charliecar...@gmail.com Date: Thu Feb 20 16:24:09 2014 -0800 Add JSON::Boolean to stout/json.hpp. If you assign an JSON::Object a bool then it will get coerced into a JSON::Number w/ value of 0.0 or 1.0. This is because JSON::True and JSON::False do not have constructors from bool. The fix is to introduce a common super class, JSON::Boolean, which both JSON::True and JSON::False inherit from. JSON::Boolean has the necessary constructor which takes a bool. However, this leads to ambiguity when assigning a cstring to a JSON::Value, since JSON::String already takes a const char * and a const char * is implicitly convertable to a bool. The solution for that is to rename the variant from JSON::Value to JSON::inner::Variant and to create a new class JSON::Value which inherits from JSON::inner::Variant. The new JSON::Value can have all the conversion constructors in a single place, so is no ambiguity, and delegate everythign else to the Variant. Also added a bunch of unit tests. SEE: https://issues.apache.org/jira/browse/MESOS-939 Review: https://reviews.apache.org/r/17520 {noformat} This caused all JSON values constructed from booleans to implicitly change from 0/1 to true/false. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (MESOS-1219) Master should disallow frameworks that reconnect after failover timeout.
[ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reopened MESOS-1219: This was not committed after all in light of: MESOS-1630 Master should disallow frameworks that reconnect after failover timeout. Key: MESOS-1219 URL: https://issues.apache.org/jira/browse/MESOS-1219 Project: Mesos Issue Type: Bug Components: master, webui Reporter: Robert Lacroix Assignee: Vinod Kone Fix For: 0.20.0 When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework. The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one. Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1635) zk flag fails when specifying a file and the
[ https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1635: --- Shepherd: Benjamin Mahler zk flag fails when specifying a file and the - Key: MESOS-1635 URL: https://issues.apache.org/jira/browse/MESOS-1635 Project: Mesos Issue Type: Bug Components: cli Affects Versions: 0.19.1 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Reporter: Ken Sipe The zk flag supports referencing a file. It works when registry is in_memory, however in a real environment it fails. the following starts up just fine. /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory however when the follow is executed it fails: /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 --work_dir=/tmp/mesos It uses the same working format for the zk flag, but now we are using the replicated logs. it fails with: I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: dc0b7bf2a1a7981079b33a16b689892f9cda0d8d Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1635) zk flag fails when specifying a file and the replicated logs
[ https://issues.apache.org/jira/browse/MESOS-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1635. Resolution: Fixed Fix Version/s: 0.20.0 {noformat} commit cd61a228ecb3bf0d40fe3658a9ec58f645a9ecd2 Author: Ken Sipe kens...@gmail.com Date: Tue Jul 29 12:24:38 2014 -0700 Fixed the master to accept a file:// based zk flag. Review: https://reviews.apache.org/r/23997 {noformat} zk flag fails when specifying a file and the replicated logs Key: MESOS-1635 URL: https://issues.apache.org/jira/browse/MESOS-1635 Project: Mesos Issue Type: Bug Components: cli Affects Versions: 0.19.1 Environment: Linux ubuntu 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Reporter: Ken Sipe Fix For: 0.20.0 The zk flag supports referencing a file. It works when registry is in_memory, however in a real environment it fails. the following starts up just fine. /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --registry=in_memory however when the follow is executed it fails: /usr/local/sbin/mesos-master --zk=file:///etc/mesos/zk --quorum=1 --work_dir=/tmp/mesos It uses the same working format for the zk flag, but now we are using the replicated logs. it fails with: I0723 19:24:34.755506 39856 main.cpp:150] Build: 2014-07-18 18:50:58 by root I0723 19:24:34.755580 39856 main.cpp:152] Version: 0.19.1 I0723 19:24:34.755591 39856 main.cpp:155] Git tag: 0.19.1 I0723 19:24:34.755601 39856 main.cpp:159] Git SHA: dc0b7bf2a1a7981079b33a16b689892f9cda0d8d Error parsing ZooKeeper URL: Expecting 'zk://' at the beginning of the URL -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1619) OsTest.User test is flaky
[ https://issues.apache.org/jira/browse/MESOS-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078585#comment-14078585 ] Benjamin Mahler commented on MESOS-1619: This thread (http://www.redhat.com/archives/libvir-list/2012-December/msg00208.html) seems relevant: {quote} virGetUserIDByName returns an error when the return value of getpwnam_r is non-0. However on my RHEL system, getpwnam_r returns ENOENT when the requested user cannot be found, which then causes virGetUserID not to behave as documented (it returns an error instead of falling back to parsing the passed-in value as an uid). {quote} Let's fix up os.hpp based on this knowledge? OsTest.User test is flaky - Key: MESOS-1619 URL: https://issues.apache.org/jira/browse/MESOS-1619 Project: Mesos Issue Type: Bug Components: test Environment: centos7 w/ gcc Reporter: Vinod Kone [ RUN ] OsTest.user stout/tests/os_tests.cpp:720: Failure Value of: os::getuid(UUID::random().toString()).isNone() Actual: false Expected: true stout/tests/os_tests.cpp:721: Failure Value of: os::getgid(UUID::random().toString()).isNone() Actual: false Expected: true WARNING: Logging before InitGoogleLogging() is written to STDERR E0721 06:15:58.656255 13440 os.hpp:731] Failed to set gid: Failed to get username information: No such file or directory [ FAILED ] OsTest.user (20 ms) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
Benjamin Mahler created MESOS-1653: -- Summary: HealthCheckTest.GracePeriod is flaky. Key: MESOS-1653 URL: https://issues.apache.org/jira/browse/MESOS-1653 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler {noformat} [--] 3 tests from HealthCheckTest [ RUN ] HealthCheckTest.GracePeriod Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in 2317ns I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the db in 1367ns I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from a replica in EMPTY status I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to STARTING I0729 17:10:10.508482 1213 master.cpp:289] Master 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing authenticated frameworks to register I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing authenticated slaves to register I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:54701 I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising offers for all slaves I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 12.946461ms I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to STARTING I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from a replica in STARTING status I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 11.537102ms I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to VOTING I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos group I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit promise request with proposal 1 I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.830863ms I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to fill missing position I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 6.548344ms I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request for position 0 I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb took 15676ns I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 9.373798ms I0729 17:10:10.557241 1209 replica.cpp:676] Persisted action at 0 I0729 17:10:10.560642 1210 replica.cpp:655] Replica received learned notice for position 0 I0729 17:10:10.570312 1210 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 9.61503ms I0729 17:10:10.570380 1210 replica.cpp:676] Persisted action at 0 I0729 17:10:10.570406 1210 replica.cpp:661] Replica learned NOP action at position 0 I0729 17:10:10.570746 1210 log.cpp:672] Writer started
[jira] [Created] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
Benjamin Mahler created MESOS-1668: -- Summary: Handle a temporary one-way master -- slave socket closure. Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Priority: Minor In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1470) Add cluster maintenance documentation.
[ https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1470: --- Target Version/s: (was: 0.20.0) Add cluster maintenance documentation. -- Key: MESOS-1470 URL: https://issues.apache.org/jira/browse/MESOS-1470 Project: Mesos Issue Type: Documentation Components: documentation Affects Versions: 0.19.0 Reporter: Benjamin Mahler Now that the master has replicated state on the disk, we should add documentation that guides operators for common maintenance work: * Swapping a master in the ensemble. * Growing the master ensemble. * Shrinking the master ensemble. This would help craft similar documentation for users of the replicated log. We should also add documentation for existing slave maintenance documentation: * Best practices for rolling upgrades. * How to shut down a slave. This latter category will be incorporated with [~alexandra.sava]'s maintenance work! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1461) Add task reconciliation to the Python API.
[ https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1461: --- Target Version/s: 0.21.0 (was: 0.20.0) Add task reconciliation to the Python API. -- Key: MESOS-1461 URL: https://issues.apache.org/jira/browse/MESOS-1461 Project: Mesos Issue Type: Task Components: python api Affects Versions: 0.19.0 Reporter: Benjamin Mahler Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but was never added to the Python API. This may be obviated by the lower level API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.
[ https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1517: --- Target Version/s: (was: 0.20.0) Maintain a queue of messages that arrive before the master recovers. Key: MESOS-1517 URL: https://issues.apache.org/jira/browse/MESOS-1517 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: reliability Currently when the master is recovering, we drop all incoming messages. If slaves and frameworks knew about the leading master only once it has recovered, then we would only expect to see messages after we've recovered. We previously considered enqueuing all messages through the recovery future, but this has the downside of forcing all messages to go through the master's queue twice: {code} // TODO(bmahler): Consider instead re-enqueing *all* messages // through recover(). What are the performance implications of // the additional queueing delay and the accumulated backlog // of messages post-recovery? if (!recovered.get().isReady()) { VLOG(1) Dropping ' event.message-name ' message since not recovered yet; ++metrics.dropped_messages; return; } {code} However, an easy solution to this problem is to maintain an explicit queue of incoming messages that gets flushed once we finish recovery. This ensures that all messages post-recovery are processed normally. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1653. Resolution: Fixed Fix Version/s: 0.20.0 Assignee: Timothy Chen Fix was included here: {noformat} commit 656b0e075c79e03cf6937bbe7302424768729aa2 Author: Timothy Chen tnac...@apache.org Date: Wed Aug 6 11:34:03 2014 -0700 Re-enabled HealthCheckTest.ConsecutiveFailures test. The test originally was flaky because the time to process the number of consecutive checks configured exceeds the task itself, so the task finished but the number of expected task health check didn't match. Review: https://reviews.apache.org/r/23772 {noformat} HealthCheckTest.GracePeriod is flaky. - Key: MESOS-1653 URL: https://issues.apache.org/jira/browse/MESOS-1653 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Timothy Chen Fix For: 0.20.0 {noformat} [--] 3 tests from HealthCheckTest [ RUN ] HealthCheckTest.GracePeriod Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in 2317ns I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the db in 1367ns I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from a replica in EMPTY status I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to STARTING I0729 17:10:10.508482 1213 master.cpp:289] Master 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing authenticated frameworks to register I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing authenticated slaves to register I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:54701 I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising offers for all slaves I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 12.946461ms I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to STARTING I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from a replica in STARTING status I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 11.537102ms I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to VOTING I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos group I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit promise request with proposal 1 I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.830863ms I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to fill missing position I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting
[jira] [Reopened] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reopened MESOS-1613: Looks like it's still flaky: {noformat} Changes Summary Made ephemeral ports a resource and killed private resources. (details) Do not send ephemeral_ports resource to frameworks. (details) Create static mesos library. (details) Re-enabled HealthCheckTest.ConsecutiveFailures test. (details) Merge resourcesRecovered and resourcesUnused. (details) Added executor metrics for slave. (details) [ RUN ] HealthCheckTest.ConsecutiveFailures Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu' I0806 15:06:59.268267 9210 leveldb.cpp:176] Opened db in 29.926087ms I0806 15:06:59.275971 9210 leveldb.cpp:183] Compacted db in 7.40006ms I0806 15:06:59.276254 9210 leveldb.cpp:198] Created db iterator in 7678ns I0806 15:06:59.276741 9210 leveldb.cpp:204] Seeked to beginning of db in 2076ns I0806 15:06:59.277034 9210 leveldb.cpp:273] Iterated through 0 keys in the db in 1908ns I0806 15:06:59.277307 9210 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0806 15:06:59.277868 9233 recover.cpp:425] Starting replica recovery I0806 15:06:59.277946 9233 recover.cpp:451] Replica is in EMPTY status I0806 15:06:59.278240 9233 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0806 15:06:59.278296 9233 recover.cpp:188] Received a recover response from a replica in EMPTY status I0806 15:06:59.278391 9233 recover.cpp:542] Updating replica status to STARTING I0806 15:06:59.282282 9234 master.cpp:287] Master 20140806-150659-16842879-60888-9210 (lucid) started on 127.0.1.1:60888 I0806 15:06:59.282316 9234 master.cpp:324] Master only allowing authenticated frameworks to register I0806 15:06:59.282322 9234 master.cpp:329] Master only allowing authenticated slaves to register I0806 15:06:59.282330 9234 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_ConsecutiveFailures_fBrAEu/credentials' I0806 15:06:59.282508 9234 master.cpp:358] Authorization enabled I0806 15:06:59.283121 9234 hierarchical_allocator_process.hpp:296] Initializing hierarchical allocator process with master : master@127.0.1.1:60888 I0806 15:06:59.283174 9234 master.cpp:121] No whitelist given. Advertising offers for all slaves I0806 15:06:59.283413 9234 master.cpp:1127] The newly elected leader is master@127.0.1.1:60888 with id 20140806-150659-16842879-60888-9210 I0806 15:06:59.283429 9234 master.cpp:1140] Elected as the leading master! I0806 15:06:59.283435 9234 master.cpp:958] Recovering from registrar I0806 15:06:59.283491 9234 registrar.cpp:313] Recovering registrar I0806 15:06:59.284046 9233 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.600113ms I0806 15:06:59.284080 9233 replica.cpp:320] Persisted replica status to STARTING I0806 15:06:59.284226 9233 recover.cpp:451] Replica is in STARTING status I0806 15:06:59.284580 9233 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0806 15:06:59.284643 9233 recover.cpp:188] Received a recover response from a replica in STARTING status I0806 15:06:59.284747 9233 recover.cpp:542] Updating replica status to VOTING I0806 15:06:59.289934 9233 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.119357ms I0806 15:06:59.290256 9233 replica.cpp:320] Persisted replica status to VOTING I0806 15:06:59.290876 9237 recover.cpp:556] Successfully joined the Paxos group I0806 15:06:59.291131 9232 recover.cpp:440] Recover process terminated I0806 15:06:59.300732 9236 log.cpp:656] Attempting to start the writer I0806 15:06:59.301061 9236 replica.cpp:474] Replica received implicit promise request with proposal 1 I0806 15:06:59.306172 9236 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.070193ms I0806 15:06:59.306229 9236 replica.cpp:342] Persisted promised to 1 I0806 15:06:59.306747 9236 coordinator.cpp:230] Coordinator attemping to fill missing position I0806 15:06:59.307143 9236 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0806 15:06:59.309715 9236 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 2.521311ms I0806 15:06:59.310199 9236 replica.cpp:676] Persisted action at 0 I0806 15:06:59.320276 9234 replica.cpp:508] Replica received write request for position 0 I0806 15:06:59.320335 9234 leveldb.cpp:438] Reading position from leveldb took 26656ns I0806 15:06:59.325726 9234 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 5.358479ms I0806 15:06:59.325781 9234 replica.cpp:676] Persisted action at 0 I0806 15:06:59.325999 9234 replica.cpp:655] Replica received learned notice for position 0 I0806 15:06:59.328487 9234 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 2.458504ms I0806
[jira] [Updated] (MESOS-1668) Handle a temporary one-way master -- slave socket closure.
[ https://issues.apache.org/jira/browse/MESOS-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1668: --- Placing this under reconciliation because, although extremely rare, it can lead to some inconsistent state between the master and slave for an arbitrary amount of time. For example, if the launchTask message is dropped as a result of the socket closure between Master → Slave in the scenario above. Handle a temporary one-way master -- slave socket closure. --- Key: MESOS-1668 URL: https://issues.apache.org/jira/browse/MESOS-1668 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Priority: Minor Labels: reliability In MESOS-1529, we realized that it's possible for a slave to remain disconnected in the master if the following occurs: → Master and Slave connected operating normally. → Temporary one-way network failure, master→slave link breaks. → Master marks slave as disconnected. → Network restored and health checking continues normally, slave is not removed as a result. Slave does not attempt to re-register since it is receiving pings once again. → Slave remains disconnected according to the master, and the slave does not try to re-register. Bad! We were originally thinking of using a failover timeout in the master to remove these slaves that don't re-register. However, it can be dangerous when ZooKeeper issues are preventing the slave from re-registering with the master; we do not want to remove a ton of slaves in this situation. Rather, when the slave is health checking correctly but does not re-register within a timeout, we could send a registration request from the master to the slave, telling the slave that it must re-register. This message could also be used when receiving status updates (or other messages) from slaves that are disconnected in the master. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088425#comment-14088425 ] Benjamin Mahler commented on MESOS-1613: So far only Twitter CI is exposing this flakiness. I've pasted the full logs in the comment above, are you looking for logging from the command executor? We may want to investigate wiring up the tests to expose them in the output to make this easier for you to debug. HealthCheckTest.ConsecutiveFailures is flaky Key: MESOS-1613 URL: https://issues.apache.org/jira/browse/MESOS-1613 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.20.0 Environment: Ubuntu 10.04 GCC Reporter: Vinod Kone Assignee: Timothy Chen {code} [ RUN ] HealthCheckTest.ConsecutiveFailures Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in 2125ns I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the db in 10747ns I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from a replica in EMPTY status I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to STARTING I0717 04:39:59.301985 5031 master.cpp:288] Master 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing authenticated frameworks to register I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing authenticated slaves to register I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:40280 I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising offers for all slaves I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.325097ms I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to STARTING I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from a replica in STARTING status I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.204157ms I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to VOTING I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos group I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit promise request with proposal 1 I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 2.271822ms I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to fill missing position I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 5.004323ms I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 I0717 04:39:59.323799 5031 replica.cpp:508] Replica received write request for position 0 I0717 04:39:59.323837 5031
[jira] [Updated] (MESOS-887) Scheduler Driver should use exited() to detect disconnection with Master.
[ https://issues.apache.org/jira/browse/MESOS-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-887: -- Labels: framework reliability (was: ) Scheduler Driver should use exited() to detect disconnection with Master. - Key: MESOS-887 URL: https://issues.apache.org/jira/browse/MESOS-887 Project: Mesos Issue Type: Improvement Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.14.2, 0.16.0, 0.15.0 Reporter: Benjamin Mahler Labels: framework, reliability The Scheduler Driver already links with the master, but it does not use the built in exited() notification from libprocess to detect socket closure. This would fast-track the delay from zookeeper detecting a leadership change, and would minimize the number of dropped messages leaving the driver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090065#comment-14090065 ] Benjamin Mahler commented on MESOS-1613: [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? HealthCheckTest.ConsecutiveFailures is flaky Key: MESOS-1613 URL: https://issues.apache.org/jira/browse/MESOS-1613 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.20.0 Environment: Ubuntu 10.04 GCC Reporter: Vinod Kone Assignee: Timothy Chen {code} [ RUN ] HealthCheckTest.ConsecutiveFailures Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in 2125ns I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the db in 10747ns I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from a replica in EMPTY status I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to STARTING I0717 04:39:59.301985 5031 master.cpp:288] Master 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing authenticated frameworks to register I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing authenticated slaves to register I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:40280 I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising offers for all slaves I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.325097ms I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to STARTING I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from a replica in STARTING status I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.204157ms I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to VOTING I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos group I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit promise request with proposal 1 I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 2.271822ms I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to fill missing position I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 5.004323ms I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 I0717 04:39:59.323799 5031 replica.cpp:508] Replica received write request for position 0 I0717 04:39:59.323837 5031 leveldb.cpp:438] Reading position from leveldb took 21901ns I0717 04:39:59.329038 5031 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 5.175998ms I0717
[jira] [Comment Edited] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090065#comment-14090065 ] Benjamin Mahler edited comment on MESOS-1613 at 8/7/14 11:56 PM: - [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? E.g. https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2299/consoleFull was (Author: bmahler): [~tnachen] it's failing on ASF CI as well, can you triage or disable in the interim? HealthCheckTest.ConsecutiveFailures is flaky Key: MESOS-1613 URL: https://issues.apache.org/jira/browse/MESOS-1613 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.20.0 Environment: Ubuntu 10.04 GCC Reporter: Vinod Kone Assignee: Timothy Chen {code} [ RUN ] HealthCheckTest.ConsecutiveFailures Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in 2125ns I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the db in 10747ns I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from a replica in EMPTY status I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to STARTING I0717 04:39:59.301985 5031 master.cpp:288] Master 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing authenticated frameworks to register I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing authenticated slaves to register I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:40280 I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising offers for all slaves I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.325097ms I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to STARTING I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from a replica in STARTING status I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.204157ms I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to VOTING I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos group I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit promise request with proposal 1 I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 2.271822ms I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to fill missing position I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 5.004323ms I0717 04:39:59.323493 5031 replica.cpp:676] Persisted action at 0 I0717
[jira] [Commented] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.
[ https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093505#comment-14093505 ] Benjamin Mahler commented on MESOS-1620: Review chain for this one, did some cleanups along the way: https://reviews.apache.org/r/24582/ https://reviews.apache.org/r/24583/ https://reviews.apache.org/r/24576/ https://reviews.apache.org/r/24515/ https://reviews.apache.org/r/24516/ Reconciliation does not send back tasks pending validation / authorization. --- Key: MESOS-1620 URL: https://issues.apache.org/jira/browse/MESOS-1620 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Assignee: Benjamin Mahler Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send back TASK_STAGING for those tasks that are pending in the Master (validation / authorization still in progress). For both implicit and explicit task reconciliation, the master could reply with TASK_STAGING for these tasks, as this provides additional information to the framework. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1620) Reconciliation does not send back tasks pending validation / authorization.
[ https://issues.apache.org/jira/browse/MESOS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1620: --- Shepherd: Vinod Kone (was: Dominic Hamon) Reconciliation does not send back tasks pending validation / authorization. --- Key: MESOS-1620 URL: https://issues.apache.org/jira/browse/MESOS-1620 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Assignee: Benjamin Mahler Per Vinod's feedback on https://reviews.apache.org/r/23542/, we do not send back TASK_STAGING for those tasks that are pending in the Master (validation / authorization still in progress). For both implicit and explicit task reconciliation, the master could reply with TASK_STAGING for these tasks, as this provides additional information to the framework. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.
[ https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1700: --- Sprint: Q3 Sprint 3 ThreadLocal does not release pthread keys or log properly. -- Key: MESOS-1700 URL: https://issues.apache.org/jira/browse/MESOS-1700 Project: Mesos Issue Type: Bug Components: stout Reporter: Benjamin Mahler Assignee: Benjamin Mahler The ThreadLocalT abstraction in stout does not release the allocated pthread keys upon destruction: https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22 It also does not log the errors correctly. Fortunately this does not impact mesos at the current time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1700) ThreadLocal does not release pthread keys or log properly.
[ https://issues.apache.org/jira/browse/MESOS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096175#comment-14096175 ] Benjamin Mahler commented on MESOS-1700: https://reviews.apache.org/r/24669/ ThreadLocal does not release pthread keys or log properly. -- Key: MESOS-1700 URL: https://issues.apache.org/jira/browse/MESOS-1700 Project: Mesos Issue Type: Bug Components: stout Reporter: Benjamin Mahler Assignee: Benjamin Mahler The ThreadLocalT abstraction in stout does not release the allocated pthread keys upon destruction: https://github.com/apache/mesos/blob/0.19.1/3rdparty/libprocess/3rdparty/stout/include/stout/thread.hpp#L22 It also does not log the errors correctly. Fortunately this does not impact mesos at the current time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.
Benjamin Mahler created MESOS-1714: -- Summary: The C++ 'Resources' abstraction should keep the underlying resources flattened. Key: MESOS-1714 URL: https://issues.apache.org/jira/browse/MESOS-1714 Project: Mesos Issue Type: Bug Components: c++ api Reporter: Benjamin Mahler Currently, the C++ Resources class does not ensure that the underlying Resources protobufs are kept flat. This is an issue because some of the methods, e.g. [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269], assume the resources are flat. There is code that constructs unflattened resources, e.g. [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353]. We could prevent this type of construction, however it is perfectly fine if we ensure the C++ 'Resources' class performs flattening. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1715) The slave does not send pending tasks during re-registration.
Benjamin Mahler created MESOS-1715: -- Summary: The slave does not send pending tasks during re-registration. Key: MESOS-1715 URL: https://issues.apache.org/jira/browse/MESOS-1715 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
Benjamin Mahler created MESOS-1717: -- Summary: The slave does not show pending tasks in the JSON endpoints. Key: MESOS-1717 URL: https://issues.apache.org/jira/browse/MESOS-1717 Project: Mesos Issue Type: Bug Components: json api, slave Reporter: Benjamin Mahler The slave does not show pending tasks in the /state.json endpoint. This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1718) Command executor can overcommit the slave.
Benjamin Mahler created MESOS-1718: -- Summary: Command executor can overcommit the slave. Key: MESOS-1718 URL: https://issues.apache.org/jira/browse/MESOS-1718 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Currently we give a small amount of resources to the command executor, in addition to resources used by the command task: https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448 {code: title=} ExecutorInfo Slave::getExecutorInfo( const FrameworkID frameworkId, const TaskInfo task) { ... // Add an allowance for the command executor. This does lead to a // small overcommit of resources. executor.mutable_resources()-MergeFrom( Resources::parse( cpus: + stringify(DEFAULT_EXECUTOR_CPUS) + ; + mem: + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get()); ... } {code} This leads to an overcommit of the slave. Ideally, for command tasks we can transfer all of the task resources to the executor at the slave / isolation level. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MESOS-1720) Slave should send exited executor message when the executor is never launched.
Benjamin Mahler created MESOS-1720: -- Summary: Slave should send exited executor message when the executor is never launched. Key: MESOS-1720 URL: https://issues.apache.org/jira/browse/MESOS-1720 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler When the slave sends TASK_LOST before launching an executor for a task, the slave does not send an exited executor message to the master. Since the master receives no exited executor message, it still thinks the executor's resources are consumed on the slave. One possible fix for this would be to send the exited executor message to the master in these cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101757#comment-14101757 ] Benjamin Mahler commented on MESOS-1466: We're going to proceed with a mitigation of this by rejecting tasks once the slave is overcommitted: https://issues.apache.org/jira/browse/MESOS-1721 However, we would also like to ensure that this kind of race is not possible. One solution is to use master acknowledgments for executor exits: (1) When an executor terminates (or the executor could not be launched: MESOS-1720), we send an exited executor message. (2) The master acknowledges these message. (3) The slave will not accept tasks for unacknowledged terminal executors (this must include those executors that could not be launched, per MESOS-1720). The result of this is that a new executor cannot be launched until the master is aware of the old executor exiting. Race between executor exited event and launch task can cause overcommit of resources Key: MESOS-1466 URL: https://issues.apache.org/jira/browse/MESOS-1466 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Vinod Kone Assignee: Benjamin Mahler Labels: reliability The following sequence of events can cause an overcommit -- Launch task is called for a task whose executor is already running -- Executor's resources are not accounted for on the master -- Executor exits and the event is enqueued behind launch tasks on the master -- Master sends the task to the slave which needs to commit for resources for task and the (new) executor. -- Master processes the executor exited event and re-offers the executor's resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1715: --- Summary: The slave does not send pending tasks / executors during re-registration. (was: The slave does not send pending tasks during re-registration.) The slave does not send pending tasks / executors during re-registration. - Key: MESOS-1715 URL: https://issues.apache.org/jira/browse/MESOS-1715 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Assignee: Benjamin Mahler In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MESOS-1715) The slave does not send pending tasks / executors during re-registration.
[ https://issues.apache.org/jira/browse/MESOS-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1715: --- Description: In what looks like an oversight, the pending tasks and executors in the slave (Framework::pending) are not sent in the re-registration message. For tasks, this can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. For executors, this can lead to under-accounting in the master, causing an overcommit on the slave. was: In what looks like an oversight, the pending tasks in the slave (Framework::pending) are not sent in the re-registration message. This can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. The slave does not send pending tasks / executors during re-registration. - Key: MESOS-1715 URL: https://issues.apache.org/jira/browse/MESOS-1715 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Assignee: Benjamin Mahler In what looks like an oversight, the pending tasks and executors in the slave (Framework::pending) are not sent in the re-registration message. For tasks, this can lead to spurious TASK_LOST notifications being generated by the master when it falsely thinks the tasks are not present on the slave. For executors, this can lead to under-accounting in the master, causing an overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1734) Reduce compile time replacing macro expansions with variadic templates
[ https://issues.apache.org/jira/browse/MESOS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111420#comment-14111420 ] Benjamin Mahler commented on MESOS-1734: Hi [~preillyme], we can't yet assume C++11 support: https://issues.apache.org/jira/browse/MESOS-750 [~dhamon] would have a better idea of when we'll move to C++11 as a requirement. Reduce compile time replacing macro expansions with variadic templates -- Key: MESOS-1734 URL: https://issues.apache.org/jira/browse/MESOS-1734 Project: Mesos Issue Type: Improvement Reporter: Patrick Reilly Assignee: Patrick Reilly Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1735) Better Startup Failure For Duplicate Master
[ https://issues.apache.org/jira/browse/MESOS-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118544#comment-14118544 ] Benjamin Mahler commented on MESOS-1735: We could use the EXIT approach from stout/exit.hpp here to avoid the abort / stacktrace and to include a helpful message. Better Startup Failure For Duplicate Master --- Key: MESOS-1735 URL: https://issues.apache.org/jira/browse/MESOS-1735 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.20.0 Environment: Ubuntu 12.04 Reporter: Ken Sipe The error message is cryptic when starting a mesos-master when a mesos-master is already running. The error message is: mesos-master --ip=192.168.74.174 --work_dir=~/mesos WARNING: Logging before InitGoogleLogging() is written to STDERR F0826 20:24:56.940961 3057 process.cpp:1632] Failed to initialize, bind: Address already in use [98] *** Check failure stack trace: *** Aborted (core dumped) This can be a new person's first experience. It isn't clear to them that the process is already running. And they are lost as to what to do next. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1714) The C++ 'Resources' abstraction should keep the underlying resources flattened.
[ https://issues.apache.org/jira/browse/MESOS-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120138#comment-14120138 ] Benjamin Mahler commented on MESOS-1714: For now, this review avoids constructing an unflattened Resources object: https://reviews.apache.org/r/25306/ The C++ 'Resources' abstraction should keep the underlying resources flattened. --- Key: MESOS-1714 URL: https://issues.apache.org/jira/browse/MESOS-1714 Project: Mesos Issue Type: Bug Components: c++ api Reporter: Benjamin Mahler Currently, the C++ Resources class does not ensure that the underlying Resources protobufs are kept flat. This is an issue because some of the methods, e.g. [Resources::get|https://github.com/apache/mesos/blob/0.19.1/src/common/resources.cpp#L269], assume the resources are flat. There is code that constructs unflattened resources, e.g. [Slave::launchExecutor|https://github.com/apache/mesos/blob/0.19.1/src/slave/slave.cpp#L3353]. We could prevent this type of construction, however it is perfectly fine if we ensure the C++ 'Resources' class performs flattening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-733) Speedup slave recovery tests
[ https://issues.apache.org/jira/browse/MESOS-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler closed MESOS-733. - Resolution: Incomplete Closing this in favor of an epic to track testing speedups. Speedup slave recovery tests Key: MESOS-733 URL: https://issues.apache.org/jira/browse/MESOS-733 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Labels: twitter Several of the tests are slow now that they do offer checking. I suspect this is due to the Clock semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1757) Speed up the tests.
Benjamin Mahler created MESOS-1757: -- Summary: Speed up the tests. Key: MESOS-1757 URL: https://issues.apache.org/jira/browse/MESOS-1757 Project: Mesos Issue Type: Epic Components: technical debt, test Reporter: Benjamin Mahler The full test suite is exceeding the 7 minute mark (440 seconds on my machine), this epic is to track techniques to improve this: # The reaper takes a full second to reap an exited process (MESOS-1199), this adds a second to each slave recovery test, and possibly more for things that rely on Subprocess. # The command executor sleeps for a second when shutting down (MESOS-442), this adds a second to every test that uses the command executor. # Now that the master and the slave have to perform sync'ed disk writes, consider using a ramdisk to speed up the disk writes. Additional options that hopefully will not be necessary: # Use automake's [parallel test harness|http://www.gnu.org/software/automake/manual/html_node/Parallel-Test-Harness.html] to compile tests separately and run tests in parallel. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1758) Freezer failure leads to lost task during container destruction.
Benjamin Mahler created MESOS-1758: -- Summary: Freezer failure leads to lost task during container destruction. Key: MESOS-1758 URL: https://issues.apache.org/jira/browse/MESOS-1758 Project: Mesos Issue Type: Bug Components: containerization Reporter: Benjamin Mahler In the past we've seen numerous issues around the freezer. Lately, on the 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup: (1) An oom occurs. (2) No indication of oom in the kernel logs. (3) The slave is unable to freeze the cgroup. (4) The task is marked as lost. {noformat} I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 15488MB Maximum Used: 15488MB MEMORY STATISTICS: cache 7958691840 rss 8281653248 mapped_file 9474048 pgpgin 4487861 pgpgout 522933 pgfault 2533780 pgmajfault 11 inactive_anon 0 active_anon 8281653248 inactive_file 7631708160 active_file 326852608 unevictable 0 hierarchical_memory_limit 16240345088 total_cache 7958691840 total_rss 8281653248 total_mapped_file 9474048 total_pgpgin 4487861 total_pgpgout 522933 total_pgfault 2533780 total_pgmajfault 11 total_inactive_anon 0 total_active_anon 8281653248 total_inactive_file 7631728640 total_active_file 326852608 total_unevictable 0 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource mem(*):1.62403e+10 and will be terminated I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 'bbb9732a-d600-4c1b-b326-846338c608c3' I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.710848ms I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.588224ms I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 2.15296ms I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.643008ms I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed age: 5.630238827780799days I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 1.511168ms I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for '/slave(1)/stats.json' E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of framework '201104070004-002563-' failed: Failed to destroy container: discarded future I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 201104070004-002563- from @0.0.0.0:0 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:24.963541 25471 cpushare.cpp:338] Updated 'cpu.shares' to 256 (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:24.964756 25471 cpushare.cpp:359] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 25ms (cpus 0.25) for container bbb9732a-d600-4c1b-b326-846338c608c3 I0903 16:47:43.406610 25476 status_update_manager.cpp:320] Received status update TASK_LOST (UUID:
[jira] [Resolved] (MESOS-186) Resource offers should be rescinded after some configurable timeout
[ https://issues.apache.org/jira/browse/MESOS-186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-186. --- Resolution: Fixed Fix Version/s: 0.21.0 {noformat} commit 707bf3b1d6f042ee92e7a291d3f74a20ae2d494b Author: Kapil Arya ka...@mesosphere.io Date: Fri Sep 5 11:15:15 2014 -0700 Added optional --offer_timeout to rescind unused offers. The ability to set an offer timeout helps prevent unfair resource allocations in the face of frameworks that hoard offers, or that accidentally drop offers. When optimistic offers are added, hoarding will not affect the fairness for other frameworks. Review: https://reviews.apache.org/r/22066 {noformat} Resource offers should be rescinded after some configurable timeout --- Key: MESOS-186 URL: https://issues.apache.org/jira/browse/MESOS-186 Project: Mesos Issue Type: Improvement Components: framework Reporter: Benjamin Hindman Assignee: Timothy Chen Fix For: 0.21.0 Problem: a framework has a bug and holds on to resource offers by accident for 24 hours/ One suggestion: resource offers should be rescinded after some configurable timeout. Possible issue: this might interfere with frameworks that are hoarding. But one possible solution here is to add another API call which checks the status of resource offers (i.e., remindAboutOffer). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1476) Provide endpoints for deactivating / activating slaves.
[ https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1476: --- Sprint: (was: Mesos Q3 Sprint 5) Provide endpoints for deactivating / activating slaves. --- Key: MESOS-1476 URL: https://issues.apache.org/jira/browse/MESOS-1476 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: gsoc2014 When performing maintenance operations on slaves, it is important to allow these slaves to be drained of their tasks. The first essential primitive of draining slaves is to prevent them from running more tasks. This can be achieved by deactivating them: stop sending their resource offers to frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1476) Provide endpoints for deactivating / activating slaves.
[ https://issues.apache.org/jira/browse/MESOS-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1476: -- Assignee: (was: Alexandra Sava) Un-assigning for now since there is no longer a need for this with the updated maintenance design in MESOS-1474. Provide endpoints for deactivating / activating slaves. --- Key: MESOS-1476 URL: https://issues.apache.org/jira/browse/MESOS-1476 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: gsoc2014 When performing maintenance operations on slaves, it is important to allow these slaves to be drained of their tasks. The first essential primitive of draining slaves is to prevent them from running more tasks. This can be achieved by deactivating them: stop sending their resource offers to frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1592) Design inverse resource offer support
[ https://issues.apache.org/jira/browse/MESOS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126421#comment-14126421 ] Benjamin Mahler commented on MESOS-1592: Moving this to reviewable as inverse offers were designed as part of the maintenance work: MESOS-1474. We are currently considering how persistent resources will interact with inverse offers and the other maintenance primitives. Design inverse resource offer support - Key: MESOS-1592 URL: https://issues.apache.org/jira/browse/MESOS-1592 Project: Mesos Issue Type: Task Components: allocation Reporter: Benjamin Mahler Assignee: Alexandra Sava An inverse resource offer means that Mesos is requesting resources back from the framework, possibly within some time interval. This can be leveraged initially to provide more automated cluster maintenance, by offering schedulers the opportunity to move tasks to compensate for planned maintenance. Operators can set a time limit on how long to wait for schedulers to relocate tasks before the tasks are forcibly terminated. Inverse resource offers have many other potential uses, as it opens the opportunity for the allocator to attempt to move tasks in the cluster through the co-operation of the framework, possibly providing better over-subscription, fairness, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1717) The slave does not show pending tasks in the JSON endpoints.
[ https://issues.apache.org/jira/browse/MESOS-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1717: --- Sprint: Q3 Sprint 4 (was: Q3 Sprint 4, Mesos Q3 Sprint 5) The slave does not show pending tasks in the JSON endpoints. Key: MESOS-1717 URL: https://issues.apache.org/jira/browse/MESOS-1717 Project: Mesos Issue Type: Bug Components: json api, slave Reporter: Benjamin Mahler The slave does not show pending tasks in the /state.json endpoint. This is a bit tricky to add since we rely on knowing the executor directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.
Benjamin Mahler created MESOS-1786: -- Summary: FaultToleranceTest.ReconcilePendingTasks is flaky. Key: MESOS-1786 URL: https://issues.apache.org/jira/browse/MESOS-1786 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] FaultToleranceTest.ReconcilePendingTasks Using temporary directory '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm' I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 1781ns I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the db in 537ns I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from a replica in EMPTY status I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to STARTING I0910 20:18:02.321723 21652 master.cpp:286] Master 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361 I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing authenticated frameworks to register I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing authenticated slaves to register I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for authentication from '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials' I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.781277ms I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to STARTING I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from a replica in STARTING status I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@127.0.1.1:60361 I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634 I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising offers for all slaves I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No resources available to allocate! I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed allocation for 0 slaves in 118308ns I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master! I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.887284299secs I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to VOTING I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos group I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit promise request with proposal 1 I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 9.044596ms I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1 I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to fill missing position I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 18.839475ms I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0 I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request for position 0 I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb took 43213ns I0910 20:18:04.262251 21650 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 10.949353ms I0910 20:18:04.262346 21650 replica.cpp:676] Persisted action at 0 I0910 20:18:04.262717 21650 replica.cpp:655] Replica received learned notice for position 0 I0910 20:18:04.271878 21650
[jira] [Updated] (MESOS-1786) FaultToleranceTest.ReconcilePendingTasks is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1786: --- Sprint: Mesos Q3 Sprint 5 FaultToleranceTest.ReconcilePendingTasks is flaky. -- Key: MESOS-1786 URL: https://issues.apache.org/jira/browse/MESOS-1786 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] FaultToleranceTest.ReconcilePendingTasks Using temporary directory '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm' I0910 20:18:02.308562 21634 leveldb.cpp:176] Opened db in 28.520372ms I0910 20:18:02.315268 21634 leveldb.cpp:183] Compacted db in 6.37495ms I0910 20:18:02.315588 21634 leveldb.cpp:198] Created db iterator in 6338ns I0910 20:18:02.315745 21634 leveldb.cpp:204] Seeked to beginning of db in 1781ns I0910 20:18:02.315901 21634 leveldb.cpp:273] Iterated through 0 keys in the db in 537ns I0910 20:18:02.316076 21634 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0910 20:18:02.316524 21654 recover.cpp:425] Starting replica recovery I0910 20:18:02.316800 21654 recover.cpp:451] Replica is in EMPTY status I0910 20:18:02.317245 21654 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0910 20:18:02.317445 21654 recover.cpp:188] Received a recover response from a replica in EMPTY status I0910 20:18:02.317672 21654 recover.cpp:542] Updating replica status to STARTING I0910 20:18:02.321723 21652 master.cpp:286] Master 20140910-201802-16842879-60361-21634 (precise) started on 127.0.1.1:60361 I0910 20:18:02.322041 21652 master.cpp:332] Master only allowing authenticated frameworks to register I0910 20:18:02.322320 21652 master.cpp:337] Master only allowing authenticated slaves to register I0910 20:18:02.322568 21652 credentials.hpp:36] Loading credentials for authentication from '/tmp/FaultToleranceTest_ReconcilePendingTasks_TwmFlm/credentials' I0910 20:18:02.323031 21652 master.cpp:366] Authorization enabled I0910 20:18:02.323663 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 5.781277ms I0910 20:18:02.324074 21654 replica.cpp:320] Persisted replica status to STARTING I0910 20:18:02.324443 21654 recover.cpp:451] Replica is in STARTING status I0910 20:18:02.325106 21654 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0910 20:18:02.325454 21654 recover.cpp:188] Received a recover response from a replica in STARTING status I0910 20:18:02.326408 21654 recover.cpp:542] Updating replica status to VOTING I0910 20:18:02.323892 21649 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@127.0.1.1:60361 I0910 20:18:02.326120 21652 master.cpp:1212] The newly elected leader is master@127.0.1.1:60361 with id 20140910-201802-16842879-60361-21634 I0910 20:18:02.323938 21651 master.cpp:120] No whitelist given. Advertising offers for all slaves I0910 20:18:04.209081 21655 hierarchical_allocator_process.hpp:697] No resources available to allocate! I0910 20:18:04.209183 21655 hierarchical_allocator_process.hpp:659] Performed allocation for 0 slaves in 118308ns I0910 20:18:04.209230 21652 master.cpp:1225] Elected as the leading master! I0910 20:18:04.209246 21652 master.cpp:1043] Recovering from registrar I0910 20:18:04.209360 21650 registrar.cpp:313] Recovering registrar I0910 20:18:04.214040 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.887284299secs I0910 20:18:04.214094 21654 replica.cpp:320] Persisted replica status to VOTING I0910 20:18:04.214190 21654 recover.cpp:556] Successfully joined the Paxos group I0910 20:18:04.214258 21654 recover.cpp:440] Recover process terminated I0910 20:18:04.214437 21654 log.cpp:656] Attempting to start the writer I0910 20:18:04.214756 21654 replica.cpp:474] Replica received implicit promise request with proposal 1 I0910 20:18:04.223865 21654 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 9.044596ms I0910 20:18:04.223944 21654 replica.cpp:342] Persisted promised to 1 I0910 20:18:04.229053 21652 coordinator.cpp:230] Coordinator attemping to fill missing position I0910 20:18:04.229552 21652 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0910 20:18:04.248437 21652 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 18.839475ms I0910 20:18:04.248525 21652 replica.cpp:676] Persisted action at 0 I0910 20:18:04.251194 21650 replica.cpp:508] Replica received write request for position 0 I0910 20:18:04.251260 21650 leveldb.cpp:438] Reading position from leveldb took 43213ns I0910 20:18:04.262251 21650
[jira] [Updated] (MESOS-1696) Improve reconciliation between master and slave.
[ https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1696: --- Description: As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends the tasks that need to be reconciled to the slave for the tasks. This can be piggy-backed on the re-registration message. → The slave will send TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. was: As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends a reconcile message to the slave for the tasks. → The slave will reply to reconcile messages with the latest state, or TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. Improve reconciliation between master and slave. Key: MESOS-1696 URL: https://issues.apache.org/jira/browse/MESOS-1696 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Vinod Kone As we update the Master to keep tasks in memory until they are both terminal and acknowledged
[jira] [Commented] (MESOS-1410) Keep terminal unacknowledged tasks in the master's state.
[ https://issues.apache.org/jira/browse/MESOS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14131014#comment-14131014 ] Benjamin Mahler commented on MESOS-1410: https://reviews.apache.org/r/25565/ https://reviews.apache.org/r/25566/ https://reviews.apache.org/r/25567/ https://reviews.apache.org/r/25568/ Keep terminal unacknowledged tasks in the master's state. - Key: MESOS-1410 URL: https://issues.apache.org/jira/browse/MESOS-1410 Project: Mesos Issue Type: Task Affects Versions: 0.19.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Fix For: 0.21.0 Once we are sending acknowledgments through the master as per MESOS-1409, we need to keep terminal tasks that are *unacknowledged* in the Master's memory. This will allow us to identify these tasks to frameworks when we haven't yet forwarded them an update. Without this, we're susceptible to MESOS-1389. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1783) MasterTest.LaunchDuplicateOfferTest is flaky
[ https://issues.apache.org/jira/browse/MESOS-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132235#comment-14132235 ] Benjamin Mahler commented on MESOS-1783: {noformat} commit d6c1ef6842b70af068ba14896693266ed6067724 Author: Niklas Nielsen n...@qni.dk Date: Fri Sep 12 14:40:54 2014 -0700 Fixed flaky MasterTest.LaunchDuplicateOfferTest. A couple of races could occur in the launch tasks on multiple offers tests where recovered resources from purposely-failed invocations turned into a subsequent resource offer and oversaturated the expect's. Review: https://reviews.apache.org/r/25588 {noformat} MasterTest.LaunchDuplicateOfferTest is flaky Key: MESOS-1783 URL: https://issues.apache.org/jira/browse/MESOS-1783 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.20.0 Environment: ubuntu-14.04-gcc Jenkins VM Reporter: Yan Xu Assignee: Niklas Quarfot Nielsen Fix For: 0.21.0 {noformat:title=} [ RUN ] MasterTest.LaunchDuplicateOfferTest Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg' I0909 22:46:59.212977 21883 leveldb.cpp:176] Opened db in 20.307533ms I0909 22:46:59.219717 21883 leveldb.cpp:183] Compacted db in 6.470397ms I0909 22:46:59.219925 21883 leveldb.cpp:198] Created db iterator in 5571ns I0909 22:46:59.220100 21883 leveldb.cpp:204] Seeked to beginning of db in 1365ns I0909 22:46:59.220268 21883 leveldb.cpp:273] Iterated through 0 keys in the db in 658ns I0909 22:46:59.220448 21883 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0909 22:46:59.220855 21903 recover.cpp:425] Starting replica recovery I0909 22:46:59.221103 21903 recover.cpp:451] Replica is in EMPTY status I0909 22:46:59.221626 21903 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0909 22:46:59.221914 21903 recover.cpp:188] Received a recover response from a replica in EMPTY status I0909 22:46:59.04 21903 recover.cpp:542] Updating replica status to STARTING I0909 22:46:59.232590 21900 master.cpp:286] Master 20140909-224659-16842879-44263-21883 (trusty) started on 127.0.1.1:44263 I0909 22:46:59.233278 21900 master.cpp:332] Master only allowing authenticated frameworks to register I0909 22:46:59.233543 21900 master.cpp:337] Master only allowing authenticated slaves to register I0909 22:46:59.233934 21900 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterTest_LaunchDuplicateOfferTest_3ifzmg/credentials' I0909 22:46:59.236431 21900 master.cpp:366] Authorization enabled I0909 22:46:59.237522 21898 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@127.0.1.1:44263 I0909 22:46:59.237877 21904 master.cpp:120] No whitelist given. Advertising offers for all slaves I0909 22:46:59.238723 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 16.245391ms I0909 22:46:59.238916 21903 replica.cpp:320] Persisted replica status to STARTING I0909 22:46:59.239203 21903 recover.cpp:451] Replica is in STARTING status I0909 22:46:59.239724 21903 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0909 22:46:59.239967 21903 recover.cpp:188] Received a recover response from a replica in STARTING status I0909 22:46:59.240304 21903 recover.cpp:542] Updating replica status to VOTING I0909 22:46:59.240684 21900 master.cpp:1212] The newly elected leader is master@127.0.1.1:44263 with id 20140909-224659-16842879-44263-21883 I0909 22:46:59.240846 21900 master.cpp:1225] Elected as the leading master! I0909 22:46:59.241149 21900 master.cpp:1043] Recovering from registrar I0909 22:46:59.241509 21898 registrar.cpp:313] Recovering registrar I0909 22:46:59.248440 21903 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 7.864221ms I0909 22:46:59.248644 21903 replica.cpp:320] Persisted replica status to VOTING I0909 22:46:59.248846 21903 recover.cpp:556] Successfully joined the Paxos group I0909 22:46:59.249330 21897 log.cpp:656] Attempting to start the writer I0909 22:46:59.249809 21897 replica.cpp:474] Replica received implicit promise request with proposal 1 I0909 22:46:59.250075 21903 recover.cpp:440] Recover process terminated I0909 22:46:59.258286 21897 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.292514ms I0909 22:46:59.258489 21897 replica.cpp:342] Persisted promised to 1 I0909 22:46:59.258848 21897 coordinator.cpp:230] Coordinator attemping to fill missing position I0909 22:46:59.259454 21897 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0909 22:46:59.267755 21897
[jira] [Created] (MESOS-1789) MasterTest.RecoveredSlaveReregisters is flaky.
Benjamin Mahler created MESOS-1789: -- Summary: MasterTest.RecoveredSlaveReregisters is flaky. Key: MESOS-1789 URL: https://issues.apache.org/jira/browse/MESOS-1789 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Priority: Minor Seen flaky on a fedora 19 VM w/ clang. {noformat} [ RUN ] MasterTest.RecoveredSlaveReregisters Using temporary directory '/tmp/MasterTest_RecoveredSlaveReregisters_CHREru' I0910 23:37:24.522372 22914 leveldb.cpp:176] Opened db in 978us I0910 23:37:24.522948 22914 leveldb.cpp:183] Compacted db in 554320ns I0910 23:37:24.522981 22914 leveldb.cpp:198] Created db iterator in 15459ns I0910 23:37:24.523000 22914 leveldb.cpp:204] Seeked to beginning of db in 9593ns I0910 23:37:24.523020 22914 leveldb.cpp:273] Iterated through 0 keys in the db in 9137ns I0910 23:37:24.523043 22914 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0910 23:37:24.525143 22935 recover.cpp:425] Starting replica recovery I0910 23:37:24.525266 22935 recover.cpp:451] Replica is in EMPTY status I0910 23:37:24.525774 22935 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0910 23:37:24.525871 22935 recover.cpp:188] Received a recover response from a replica in EMPTY status I0910 23:37:24.526028 22935 recover.cpp:542] Updating replica status to STARTING I0910 23:37:24.526180 22935 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 83617ns I0910 23:37:24.526211 22935 replica.cpp:320] Persisted replica status to STARTING I0910 23:37:24.526283 22935 recover.cpp:451] Replica is in STARTING status I0910 23:37:24.526725 22935 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0910 23:37:24.526813 22935 recover.cpp:188] Received a recover response from a replica in STARTING status I0910 23:37:24.526964 22935 recover.cpp:542] Updating replica status to VOTING I0910 23:37:24.527061 22935 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 44802ns I0910 23:37:24.527091 22935 replica.cpp:320] Persisted replica status to VOTING I0910 23:37:24.527139 22935 recover.cpp:556] Successfully joined the Paxos group I0910 23:37:24.527230 22935 recover.cpp:440] Recover process terminated I0910 23:37:24.527748 22928 master.cpp:286] Master 20140910-233724-2272962752-36006-22914 (fedora-19) started on 192.168.122.135:36006 I0910 23:37:24.527807 22928 master.cpp:332] Master only allowing authenticated frameworks to register I0910 23:37:24.527827 22928 master.cpp:337] Master only allowing authenticated slaves to register I0910 23:37:24.527849 22928 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterTest_RecoveredSlaveReregisters_CHREru/credentials' I0910 23:37:24.528890 22928 master.cpp:366] Authorization enabled I0910 23:37:24.529822 22928 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@192.168.122.135:36006 I0910 23:37:24.529903 22928 master.cpp:120] No whitelist given. Advertising offers for all slaves I0910 23:37:24.530275 22928 master.cpp:1212] The newly elected leader is master@192.168.122.135:36006 with id 20140910-233724-2272962752-36006-22914 I0910 23:37:24.530311 22928 master.cpp:1225] Elected as the leading master! I0910 23:37:24.530328 22928 master.cpp:1043] Recovering from registrar I0910 23:37:24.530426 22928 registrar.cpp:313] Recovering registrar I0910 23:37:24.530993 22928 log.cpp:656] Attempting to start the writer I0910 23:37:24.531601 22928 replica.cpp:474] Replica received implicit promise request with proposal 1 I0910 23:37:24.531677 22928 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 60319ns I0910 23:37:24.531707 22928 replica.cpp:342] Persisted promised to 1 I0910 23:37:24.532016 22928 coordinator.cpp:230] Coordinator attemping to fill missing position I0910 23:37:24.532691 22928 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0910 23:37:24.532752 22928 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 45735ns I0910 23:37:24.532783 22928 replica.cpp:676] Persisted action at 0 I0910 23:37:24.533252 22928 replica.cpp:508] Replica received write request for position 0 I0910 23:37:24.533299 22928 leveldb.cpp:438] Reading position from leveldb took 34066ns I0910 23:37:24.533354 22928 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 37637ns I0910 23:37:24.533381 22928 replica.cpp:676] Persisted action at 0 I0910 23:37:24.533701 22928 replica.cpp:655] Replica received learned notice for position 0 I0910 23:37:24.533757 22928 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 42842ns I0910 23:37:24.533785 22928 replica.cpp:676] Persisted action at 0 I0910 23:37:24.533804 22928 replica.cpp:661] Replica learned NOP action at
[jira] [Updated] (MESOS-1653) HealthCheckTest.GracePeriod is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1653: --- Fix Version/s: (was: 0.20.0) HealthCheckTest.GracePeriod is flaky. - Key: MESOS-1653 URL: https://issues.apache.org/jira/browse/MESOS-1653 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Timothy Chen {noformat} [--] 3 tests from HealthCheckTest [ RUN ] HealthCheckTest.GracePeriod Using temporary directory '/tmp/HealthCheckTest_GracePeriod_d7zCPr' I0729 17:10:10.484951 1176 leveldb.cpp:176] Opened db in 28.883552ms I0729 17:10:10.499487 1176 leveldb.cpp:183] Compacted db in 13.674118ms I0729 17:10:10.500200 1176 leveldb.cpp:198] Created db iterator in 7394ns I0729 17:10:10.500692 1176 leveldb.cpp:204] Seeked to beginning of db in 2317ns I0729 17:10:10.501113 1176 leveldb.cpp:273] Iterated through 0 keys in the db in 1367ns I0729 17:10:10.501535 1176 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0729 17:10:10.502233 1212 recover.cpp:425] Starting replica recovery I0729 17:10:10.502295 1212 recover.cpp:451] Replica is in EMPTY status I0729 17:10:10.502825 1212 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0729 17:10:10.502877 1212 recover.cpp:188] Received a recover response from a replica in EMPTY status I0729 17:10:10.502980 1212 recover.cpp:542] Updating replica status to STARTING I0729 17:10:10.508482 1213 master.cpp:289] Master 20140729-171010-16842879-54701-1176 (trusty) started on 127.0.1.1:54701 I0729 17:10:10.508607 1213 master.cpp:326] Master only allowing authenticated frameworks to register I0729 17:10:10.508632 1213 master.cpp:331] Master only allowing authenticated slaves to register I0729 17:10:10.508656 1213 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_GracePeriod_d7zCPr/credentials' I0729 17:10:10.509407 1213 master.cpp:360] Authorization enabled I0729 17:10:10.510030 1207 hierarchical_allocator_process.hpp:301] Initializing hierarchical allocator process with master : master@127.0.1.1:54701 I0729 17:10:10.510113 1207 master.cpp:123] No whitelist given. Advertising offers for all slaves I0729 17:10:10.511699 1213 master.cpp:1129] The newly elected leader is master@127.0.1.1:54701 with id 20140729-171010-16842879-54701-1176 I0729 17:10:10.512230 1213 master.cpp:1142] Elected as the leading master! I0729 17:10:10.512692 1213 master.cpp:960] Recovering from registrar I0729 17:10:10.513226 1210 registrar.cpp:313] Recovering registrar I0729 17:10:10.516006 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 12.946461ms I0729 17:10:10.516047 1212 replica.cpp:320] Persisted replica status to STARTING I0729 17:10:10.516129 1212 recover.cpp:451] Replica is in STARTING status I0729 17:10:10.516520 1212 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0729 17:10:10.516592 1212 recover.cpp:188] Received a recover response from a replica in STARTING status I0729 17:10:10.516767 1212 recover.cpp:542] Updating replica status to VOTING I0729 17:10:10.528376 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 11.537102ms I0729 17:10:10.528430 1212 replica.cpp:320] Persisted replica status to VOTING I0729 17:10:10.528501 1212 recover.cpp:556] Successfully joined the Paxos group I0729 17:10:10.528565 1212 recover.cpp:440] Recover process terminated I0729 17:10:10.528700 1212 log.cpp:656] Attempting to start the writer I0729 17:10:10.528960 1212 replica.cpp:474] Replica received implicit promise request with proposal 1 I0729 17:10:10.537821 1212 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 8.830863ms I0729 17:10:10.537869 1212 replica.cpp:342] Persisted promised to 1 I0729 17:10:10.540550 1209 coordinator.cpp:230] Coordinator attemping to fill missing position I0729 17:10:10.540856 1209 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0729 17:10:10.547430 1209 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 6.548344ms I0729 17:10:10.547471 1209 replica.cpp:676] Persisted action at 0 I0729 17:10:10.547732 1209 replica.cpp:508] Replica received write request for position 0 I0729 17:10:10.547765 1209 leveldb.cpp:438] Reading position from leveldb took 15676ns I0729 17:10:10.557169 1209 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 9.373798ms I0729 17:10:10.557241 1209 replica.cpp:676] Persisted action at 0 I0729 17:10:10.560642 1210 replica.cpp:655] Replica received learned notice for position 0 I0729
[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1791: --- Affects Version/s: (was: 0.20.0) Introduce Master / Offer Resource Reservations -- Key: MESOS-1791 URL: https://issues.apache.org/jira/browse/MESOS-1791 Project: Mesos Issue Type: Epic Reporter: Tom Arnfeld Currently Mesos supports the ability to reserve resources (for a given role) on a per-slave basis, as introduced in MESOS-505. This allows you to almost statically partition off a set of resources on a set of machines, to guarantee certain types of frameworks get some resources. This is very useful, though it is also very useful to be able to control these reservations through the master (instead of per-slave) for when I don't care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of (X,Y). I'm not sure what structure this could take, but apparently it has already been discussed. Would this be a CLI flag? Could there be a (authenticated) web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1791) Introduce Master / Offer Resource Reservations
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1791: --- Epic Name: Offer Reservations Introduce Master / Offer Resource Reservations -- Key: MESOS-1791 URL: https://issues.apache.org/jira/browse/MESOS-1791 Project: Mesos Issue Type: Epic Reporter: Tom Arnfeld Currently Mesos supports the ability to reserve resources (for a given role) on a per-slave basis, as introduced in MESOS-505. This allows you to almost statically partition off a set of resources on a set of machines, to guarantee certain types of frameworks get some resources. This is very useful, though it is also very useful to be able to control these reservations through the master (instead of per-slave) for when I don't care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of (X,Y). I'm not sure what structure this could take, but apparently it has already been discussed. Would this be a CLI flag? Could there be a (authenticated) web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1795) Assertion failure in state abstraction crashes JVM
[ https://issues.apache.org/jira/browse/MESOS-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134472#comment-14134472 ] Benjamin Mahler commented on MESOS-1795: Do you understand what transpired? Assertion failure in state abstraction crashes JVM -- Key: MESOS-1795 URL: https://issues.apache.org/jira/browse/MESOS-1795 Project: Mesos Issue Type: Bug Components: java api Affects Versions: 0.20.0 Reporter: Connor Doyle Assignee: Connor Doyle Observed the following log output prior to a crash of the Marathon scheduler: Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] CHECK_READY(*future): is PENDING Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure stack trace: *** Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663a2d google::LogMessage::Fail() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26657e3 google::LogMessage::SendToLog() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663648 google::LogMessage::Flush() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc266603e google::LogMessageFatal::~LogMessageFatal() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26588a3 Java_org_apache_mesos_state_AbstractState__1_1fetch_1get Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febcd107d98 (unknown) Listing 1: Crash log output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite
[ https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14135714#comment-14135714 ] Benjamin Mahler commented on MESOS-1797: Is there a ZooKeeper ticket related to this? Packaged Zookeeper does not compile on OSX Yosemite --- Key: MESOS-1797 URL: https://issues.apache.org/jira/browse/MESOS-1797 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.20.0, 0.21.0, 0.19.1 Reporter: Dario Rexin Priority: Minor I have been struggling with this for some time (due to my lack of knowledge about C compiler error messages) and finally found a way to make it compile. The problem is that Zookeeper defines a function `htonll` that is a builtin function in Yosemite. For me it worked to just remove this function, but as it needs to keep working on other systems as well, we would need some check for the OS version or if the function is already defined. Here are the links to the source: https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1799) Reconciliation can send out-of-order updates.
Benjamin Mahler created MESOS-1799: -- Summary: Reconciliation can send out-of-order updates. Key: MESOS-1799 URL: https://issues.apache.org/jira/browse/MESOS-1799 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler When a slave re-registers with the master, it currently sends the latest task state for all tasks that are not both terminal and acknowledged. However, reconciliation assumes that we always have the latest unacknowledged state of the task represented in the master. As a result, out-of-order updates are possible, e.g. (1) Slave has task T in TASK_FINISHED, with unacknowledged updates: [TASK_RUNNING, TASK_FINISHED]. (2) Master fails over. (3) New master re-registers the slave with T in TASK_FINISHED. (4) Reconciliation request arrives, master sends TASK_FINISHED. (5) Slave sends TASK_RUNNING to master, master sends TASK_RUNNING. I think the fix here is to preserve the task state invariants in the master, namely, that the master has the latest unacknowledged state of the task. This means when the slave re-registers, it should instead send the latest unacknowledged state of each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1800) The slave does not send pending executors during re-registration.
Benjamin Mahler created MESOS-1800: -- Summary: The slave does not send pending executors during re-registration. Key: MESOS-1800 URL: https://issues.apache.org/jira/browse/MESOS-1800 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler In what looks like an oversight, the pending executors in the slave are not sent in the re-registration message. This can lead to under-accounting in the master, causing an overcommit on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1466: -- Assignee: (was: Benjamin Mahler) Race between executor exited event and launch task can cause overcommit of resources Key: MESOS-1466 URL: https://issues.apache.org/jira/browse/MESOS-1466 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Vinod Kone Labels: reliability The following sequence of events can cause an overcommit -- Launch task is called for a task whose executor is already running -- Executor's resources are not accounted for on the master -- Executor exits and the event is enqueued behind launch tasks on the master -- Master sends the task to the slave which needs to commit for resources for task and the (new) executor. -- Master processes the executor exited event and re-offers the executor's resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.
Benjamin Mahler created MESOS-1802: -- Summary: HealthCheckTest.HealthStatusChange is flaky on jenkins. Key: MESOS-1802 URL: https://issues.apache.org/jira/browse/MESOS-1802 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Timothy Chen https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] HealthCheckTest.HealthStatusChange Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2' I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 642ns I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 343ns I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from a replica in EMPTY status I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to STARTING I0916 22:56:14.036603 21046 master.cpp:286] Master 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 67.195.81.186:47865 I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing authenticated frameworks to register I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing authenticated slaves to register I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for authentication from '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials' I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 480322ns I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to STARTING I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising offers for all slaves I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.186:47865 I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026 I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master! I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from a replica in STARTING status I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 330251ns I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to VOTING I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos group I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 92623ns I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1 I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to fill missing position I0916 22:56:14.039676 21047 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:56:14.039836 21047 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 144215ns I0916 22:56:14.039850 21047 replica.cpp:676] Persisted action at 0 I0916 22:56:14.040243 21047 replica.cpp:508] Replica received write request for position 0 I0916 22:56:14.040267 21047 leveldb.cpp:438] Reading position from leveldb took 10323ns I0916 22:56:14.040362 21047 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 79471ns I0916 22:56:14.040375 21047 replica.cpp:676] Persisted action at 0 I0916 22:56:14.040556 21054 replica.cpp:655] Replica received learned notice for position 0 I0916 22:56:14.040658 21054 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 83975ns I0916 22:56:14.040676 21054 replica.cpp:676]
[jira] [Created] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.
Benjamin Mahler created MESOS-1803: -- Summary: Strict/RegistrarTest.remove test is flaky on jenkins. Key: MESOS-1803 URL: https://issues.apache.org/jira/browse/MESOS-1803 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] Strict/RegistrarTest.remove/1 Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW' I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 330ns I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 421460ns I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 195ns I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 472891ns I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8479ns I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8182ns I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering registrar I0916 22:59:02.126597 21050 log.cpp:656] Attempting to start the writer I0916 22:59:02.127259 21041 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:59:02.127321 21050 replica.cpp:474] Replica received implicit promise request with proposal 1 I0916 22:59:02.127835 21041 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 547018ns I0916 22:59:02.127858 21041 replica.cpp:342] Persisted promised to 1 I0916 22:59:02.127835 21050 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 487588ns I0916 22:59:02.127887 21050 replica.cpp:342] Persisted promised to 1 I0916 22:59:02.128387 21055 coordinator.cpp:230] Coordinator attemping to fill missing position I0916 22:59:02.129546 21042 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:59:02.129600 21053 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I0916 22:59:02.129982 21042 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 406954ns I0916 22:59:02.129982 21053 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 357253ns I0916 22:59:02.130009 21042 replica.cpp:676] Persisted action at 0 I0916 22:59:02.130029 21053 replica.cpp:676] Persisted action at 0 I0916 22:59:02.130543 21041 replica.cpp:508] Replica received write request for position 0 I0916 22:59:02.130585 21041 leveldb.cpp:438] Reading position from leveldb took 17424ns I0916 22:59:02.130599 21046 replica.cpp:508] Replica received write request for position 0 I0916 22:59:02.130635 21046 leveldb.cpp:438] Reading position from leveldb took 12702ns I0916 22:59:02.130728
[jira] [Resolved] (MESOS-1803) Strict/RegistrarTest.remove test is flaky on jenkins.
[ https://issues.apache.org/jira/browse/MESOS-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1803. Resolution: Cannot Reproduce The log timings here look as if the threads were starved of CPU: {noformat} I0916 22:59:02.136256 21049 leveldb.cpp:343] Persisting action (165 bytes) to leveldb took 141908ns I0916 22:59:02.136267 21047 leveldb.cpp:343] Persisting action (165 bytes) to leveldb took 111061ns I../../src/tests/registrar_tests.cpp:257: Failure 0916 22:59:02.136276 21049 replica.cpp:676] Persisted action at 1 Failed to wait 10secs for registrar.recover(master) I0916 22:59:14.265326 21049 replica.cpp:661] Replica learned APPEND action at position 1 I0916 22:59:02.136291 21047 replica.cpp:676] Persisted action at 1 E0916 22:59:07.135143 21046 registrar.cpp:500] Registrar aborting: Failed to update 'registry': Failed to perform store within 5secs I0916 22:59:14.265393 21047 replica.cpp:661] Replica learned APPEND action at position 1 {noformat} The logging time stamp is determined at the beginning of the LOG(INFO) expression, when the initial LogMessage object is created. The interleaving of times looks to be a stall of the VM or thread starvation: {noformat} 22:59:02.136267 21047 // Thread 1, 1st LogMessage flushed. 22:59:02.136276 21049 // Thread 2, 2nd LogMessage flushed. 22:59:14.265326 21049 // Thread 2, 5th LogMessage flushed. 22:59:02.136291 21047 // Thread 1, 3rd LogMessage flushed. 22:59:07.135143 21046 // Thread 3, 4th LogMessage flushed. 22:59:14.265393 21047 // Thread 1, 6th LogMessage flushed. {noformat} Strict/RegistrarTest.remove test is flaky on jenkins. - Key: MESOS-1803 URL: https://issues.apache.org/jira/browse/MESOS-1803 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull {noformat} [ RUN ] Strict/RegistrarTest.remove/1 Using temporary directory '/tmp/Strict_RegistrarTest_remove_1_3QvnOW' I0916 22:59:02.112568 21026 leveldb.cpp:176] Opened db in 1.779835ms I0916 22:59:02.112896 21026 leveldb.cpp:183] Compacted db in 301862ns I0916 22:59:02.112916 21026 leveldb.cpp:198] Created db iterator in 3065ns I0916 22:59:02.112926 21026 leveldb.cpp:204] Seeked to beginning of db in 475ns I0916 22:59:02.112936 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 330ns I0916 22:59:02.112951 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.113654 21054 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 421460ns I0916 22:59:02.113674 21054 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.115900 21026 leveldb.cpp:176] Opened db in 1.947919ms I0916 22:59:02.116263 21026 leveldb.cpp:183] Compacted db in 338043ns I0916 22:59:02.116283 21026 leveldb.cpp:198] Created db iterator in 2809ns I0916 22:59:02.116293 21026 leveldb.cpp:204] Seeked to beginning of db in 468ns I0916 22:59:02.116302 21026 leveldb.cpp:273] Iterated through 0 keys in the db in 195ns I0916 22:59:02.116317 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.117013 21043 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 472891ns I0916 22:59:02.117034 21043 replica.cpp:320] Persisted replica status to VOTING I0916 22:59:02.119240 21026 leveldb.cpp:176] Opened db in 1.950367ms I0916 22:59:02.120455 21026 leveldb.cpp:183] Compacted db in 1.188056ms I0916 22:59:02.120481 21026 leveldb.cpp:198] Created db iterator in 4370ns I0916 22:59:02.120499 21026 leveldb.cpp:204] Seeked to beginning of db in 7977ns I0916 22:59:02.120517 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8479ns I0916 22:59:02.120533 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.122890 21026 leveldb.cpp:176] Opened db in 2.301327ms I0916 22:59:02.124325 21026 leveldb.cpp:183] Compacted db in 1.406223ms I0916 22:59:02.124351 21026 leveldb.cpp:198] Created db iterator in 4185ns I0916 22:59:02.124368 21026 leveldb.cpp:204] Seeked to beginning of db in 7167ns I0916 22:59:02.124387 21026 leveldb.cpp:273] Iterated through 1 keys in the db in 8182ns I0916 22:59:02.124403 21026 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0916 22:59:02.124579 21047 recover.cpp:425] Starting replica recovery I0916 22:59:02.124651 21047 recover.cpp:451] Replica is in VOTING status I0916 22:59:02.124793 21047 recover.cpp:440] Recover process terminated I0916 22:59:02.126404 21046 registrar.cpp:313] Recovering
[jira] [Updated] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults
[ https://issues.apache.org/jira/browse/MESOS-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1818: --- Sprint: Mesos Q3 Sprint 5 AllocatorTest/0.ResourcesUnused sometimes segfaults --- Key: MESOS-1818 URL: https://issues.apache.org/jira/browse/MESOS-1818 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.21.0 Reporter: Vinod Kone Assignee: Benjamin Mahler Priority: Critical {code} [ RUN ] AllocatorTest/0.ResourcesUnused *** Aborted at 1411088950 (unix time) try date -d @1411088950 if you are using GNU date *** PC: @ 0x8649a4 mesos::SlaveID::value() *** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 11753; stack trace: *** @ 0x7fb643ec4ca0 (unknown) @ 0x8649a4 mesos::SlaveID::value() @ 0x8741c7 mesos::hash_value() @ 0x8f7448 boost::hash::operator()() @ 0x8e0bed boost::unordered::detail::mix64_policy::apply_hash() @ 0x7fb64694c1cf boost::unordered::detail::table::hash() @ 0x7fb646973615 boost::unordered::detail::table::find_node() @ 0x7fb64694c191 boost::unordered::detail::table_impl::count() @ 0x7fb64691f3c1 boost::unordered::unordered_map::count() @ 0x7fb6468f4373 hashmap::contains() @ 0x7fb6468c5eda mesos::internal::master::Master::getSlave() @ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework() @ 0x7fb6468afa9f mesos::internal::master::Master::unregisterFramework() @ 0x7fb646904ab9 ProtobufProcess::handler1() @ 0x7fb6469a1e81 _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE @ 0x7fb646983afe std::_Bind::operator()() @ 0x7fb64695f83c std::_Function_handler::_M_invoke() @ 0xc4e17f std::function::operator()() @ 0x7fb6468ebd10 ProtobufProcess::visit() @ 0x7fb6468a9892 mesos::internal::master::Master::_visit() @ 0x7fb6468a8f46 mesos::internal::master::Master::visit() @ 0x7fb6468ce670 process::MessageEvent::visit() @ 0x86ad54 process::ProcessBase::serve() @ 0x7fb6470e9738 process::ProcessManager::resume() @ 0x7fb6470dff3f process::schedule() @ 0x7fb643ebc83d start_thread @ 0x7fb642c2426d clone make[3]: *** [check-local] Segmentation fault {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1821: --- Priority: Blocker (was: Major) CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141186#comment-14141186 ] Benjamin Mahler commented on MESOS-1821: https://reviews.apache.org/r/25843/ CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1821) CHECK failure in master.
[ https://issues.apache.org/jira/browse/MESOS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1821: --- Sprint: Mesos Q3 Sprint 5 CHECK failure in master. Key: MESOS-1821 URL: https://issues.apache.org/jira/browse/MESOS-1821 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker Looks like the recent CHECKs I've added exposed a bug in the framework re-registration logic by which we didn't keep the executors consistent between the Slave and Framework structs: {noformat: title=Master Log} I0919 18:05:06.915204 28914 master.cpp:3291] Executor aurora.gc of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) exited with status 0 I0919 18:05:06.915271 28914 master.cpp:4430] Removing executor 'aurora.gc' with resources cpus(*):0.19; disk(*):15; mem(*):127 of framework 201103282247-19- on slave 20140905-173231-1890854154-5050-31333-0 at slave(1)@IP:5051 (HOSTNAME) F0919 18:05:06.915375 28914 master.hpp:1061] Check failed: hasExecutor(slaveId, executorId) Unknown executor aurora.gc of framework 201103282247-19- of slave 20140905-173231-1890854154-5050-31333-0 *** Check failure stack trace: *** @ 0x7fd16c81737d google::LogMessage::Fail() @ 0x7fd16c8191c4 google::LogMessage::SendToLog() @ 0x7fd16c816f6c google::LogMessage::Flush() @ 0x7fd16c819ab9 google::LogMessageFatal::~LogMessageFatal() @ 0x7fd16c34e09b mesos::internal::master::Framework::removeExecutor() @ 0x7fd16c2da2e4 mesos::internal::master::Master::removeExecutor() @ 0x7fd16c2e6255 mesos::internal::master::Master::exitedExecutor() @ 0x7fd16c348269 ProtobufProcess::handler4() @ 0x7fd16c2fc18e std::_Function_handler::_M_invoke() @ 0x7fd16c322132 ProtobufProcess::visit() @ 0x7fd16c2cef7a mesos::internal::master::Master::_visit() @ 0x7fd16c2dc3d8 mesos::internal::master::Master::visit() @ 0x7fd16c7c2502 process::ProcessManager::resume() @ 0x7fd16c7c280c process::schedule() @ 0x7fd16b9c683d start_thread @ 0x7fd16a2b626d clone {noformat} This occurs sometime after a failover and indicates that the Slave and Framework structs are not kept in sync. Problem seems to be here, when re-registering a framework on a failed over master, we only consider executors for which there are tasks stored in the master: {code} void Master::_reregisterFramework( const UPID from, const FrameworkInfo frameworkInfo, bool failover, const FutureOptionError validationError) { ... if (frameworks.registered.count(frameworkInfo.id()) 0) { ... } else { // We don't have a framework with this ID, so we must be a newly // elected Mesos master to which either an existing scheduler or a // failed-over one is connecting. Create a Framework object and add // any tasks it has that have been reported by reconnecting slaves. Framework* framework = new Framework(frameworkInfo, frameworkInfo.id(), from, Clock::now()); framework-reregisteredTime = Clock::now(); // TODO(benh): Check for root submissions like above! // Add any running tasks reported by slaves for this framework. foreachvalue (Slave* slave, slaves.registered) { foreachkey (const FrameworkID frameworkId, slave-tasks) { foreachvalue (Task* task, slave-tasks[frameworkId]) { if (framework-id == task-framework_id()) { framework-addTask(task); // Also add the task's executor for resource accounting // if it's still alive on the slave and we've not yet // added it to the framework. if (task-has_executor_id() slave-hasExecutor(framework-id, task-executor_id()) !framework-hasExecutor(slave-id, task-executor_id())) { // XXX: If an executor has no tasks, the executor will not // XXX: be added to the Framework struct! const ExecutorInfo executorInfo = slave-executors[framework-id][task-executor_id()]; framework-addExecutor(slave-id, executorInfo); } } } } } // N.B. Need to add the framework _after_ we add its tasks // (above) so that we can properly determine the resources it's // currently using! addFramework(framework); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1461) Add task reconciliation to the Python API.
[ https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-1461: -- Assignee: Niklas Quarfot Nielsen Add task reconciliation to the Python API. -- Key: MESOS-1461 URL: https://issues.apache.org/jira/browse/MESOS-1461 Project: Mesos Issue Type: Task Components: python api Affects Versions: 0.19.0 Reporter: Benjamin Mahler Assignee: Niklas Quarfot Nielsen Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but was never added to the Python API. This may be obviated by the lower level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1461) Add task reconciliation to the Python API.
[ https://issues.apache.org/jira/browse/MESOS-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148364#comment-14148364 ] Benjamin Mahler commented on MESOS-1461: {noformat} commit f1da58fb22b28afd65313f6801b35bce436199ab Author: Niklas Nielsen n...@qni.dk Date: Thu Sep 25 14:12:12 2014 -0700 Added reconcileTasks to python scheduler. The last step of wiring up reconcileTasks in the python bindings. Review: https://reviews.apache.org/r/25986 {noformat} Add task reconciliation to the Python API. -- Key: MESOS-1461 URL: https://issues.apache.org/jira/browse/MESOS-1461 Project: Mesos Issue Type: Task Components: python api Affects Versions: 0.19.0 Reporter: Benjamin Mahler Assignee: Niklas Quarfot Nielsen Fix For: 0.21.0 Looks like the 'reconcileTasks' call was added to the C++ and Java APIs but was never added to the Python API. This may be obviated by the lower level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-1389) Reconciliation can send TASK_LOST before a terminal update reaches the framework.
[ https://issues.apache.org/jira/browse/MESOS-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler resolved MESOS-1389. Resolution: Fixed Assignee: Benjamin Mahler Resolved via MESOS-1410. Reconciliation can send TASK_LOST before a terminal update reaches the framework. - Key: MESOS-1389 URL: https://issues.apache.org/jira/browse/MESOS-1389 Project: Mesos Issue Type: Bug Affects Versions: 0.19.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Fix For: 0.21.0 There's an unfortunate case with reconciliation, where we end up sending TASK_LOST first and then the slave sends the valid terminal status update. When the slave re-registers with terminal tasks that have un-acked updates. The master does not store these tasks. So while the slave still needs to send the terminal status updates, the master will reply with TASK_LOST for reconciliation. We may need to ensure that all status update acknowledgements go through the master to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-986) Add versioning to messages.
[ https://issues.apache.org/jira/browse/MESOS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-986: -- Description: Message versioning in Mesos provides a number of benefits: # Prevent incompatible version combinations. For example, we want to reject slaves that are 1 version behind the master. # The biggest win is providing the ability to determine behavior based on the cross-component versioning. For example, in MESOS-1668 we wanted the master to send a different ping message based on the version of the slave. In MESOS-1696, we wanted to perform reconciliation in the master based on the version of the slave. In both cases, when we don't have the version, we have to either rely on hacks/tricks, or add additional phases. was: We currently do not prevent rogue versions of components from communicating within the cluster. Adding versioning to our messages will allow us to enforce version compatibility. We would need to figure out the right semantics for each pair of components that communicate with each other. For example, if an old incompatible Slave attempts to register with a Master, the Master can either ignore it, or shut it down. Summary: Add versioning to messages. (was: Add versioning to messages to prevent communication across incompatible versions of components.) cc [~benjaminhindman] [~vinodkone] and I chatted about an approach to add versioning at the libprocess layer. We could add the ability for an application to initialize libprocess with the application version. This gets encoded via a special libprocess HTTP header. {{install}} handlers can optionally receive a {{Version}}, akin to how they can optionally receive the sender UPID currently. Add versioning to messages. --- Key: MESOS-986 URL: https://issues.apache.org/jira/browse/MESOS-986 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Message versioning in Mesos provides a number of benefits: # Prevent incompatible version combinations. For example, we want to reject slaves that are 1 version behind the master. # The biggest win is providing the ability to determine behavior based on the cross-component versioning. For example, in MESOS-1668 we wanted the master to send a different ping message based on the version of the slave. In MESOS-1696, we wanted to perform reconciliation in the master based on the version of the slave. In both cases, when we don't have the version, we have to either rely on hacks/tricks, or add additional phases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-986) Add versioning to messages.
[ https://issues.apache.org/jira/browse/MESOS-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-986: -- Description: Message versioning in Mesos provides a number of benefits: (1) Prevent incompatible version combinations. For example, we want to reject slaves that are 1 version behind the master. (2) The biggest win is providing the ability to determine behavior based on the cross-component versioning. For example, in MESOS-1668 we wanted the master to send a different ping message based on the version of the slave. In MESOS-1696, we wanted to perform reconciliation in the master based on the version of the slave. In both cases, when we don't have the version, we have to either rely on hacks/tricks, or add additional phases. was: Message versioning in Mesos provides a number of benefits: # Prevent incompatible version combinations. For example, we want to reject slaves that are 1 version behind the master. # The biggest win is providing the ability to determine behavior based on the cross-component versioning. For example, in MESOS-1668 we wanted the master to send a different ping message based on the version of the slave. In MESOS-1696, we wanted to perform reconciliation in the master based on the version of the slave. In both cases, when we don't have the version, we have to either rely on hacks/tricks, or add additional phases. Add versioning to messages. --- Key: MESOS-986 URL: https://issues.apache.org/jira/browse/MESOS-986 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Message versioning in Mesos provides a number of benefits: (1) Prevent incompatible version combinations. For example, we want to reject slaves that are 1 version behind the master. (2) The biggest win is providing the ability to determine behavior based on the cross-component versioning. For example, in MESOS-1668 we wanted the master to send a different ping message based on the version of the slave. In MESOS-1696, we wanted to perform reconciliation in the master based on the version of the slave. In both cases, when we don't have the version, we have to either rely on hacks/tricks, or add additional phases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1797) Packaged Zookeeper does not compile on OSX Yosemite
[ https://issues.apache.org/jira/browse/MESOS-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152601#comment-14152601 ] Benjamin Mahler commented on MESOS-1797: Awesome job following up with a patch [~tillt]! I would try to find out when your patch would be part of the next release. Likely we'll need a .patch in the interim, but let's wait to see your patch land in ZooKeeper first so that we know it's the right approach? Packaged Zookeeper does not compile on OSX Yosemite --- Key: MESOS-1797 URL: https://issues.apache.org/jira/browse/MESOS-1797 Project: Mesos Issue Type: Improvement Components: build Affects Versions: 0.20.0, 0.21.0, 0.19.1 Reporter: Dario Rexin Priority: Minor I have been struggling with this for some time (due to my lack of knowledge about C compiler error messages) and finally found a way to make it compile. The problem is that Zookeeper defines a function `htonll` that is a builtin function in Yosemite. For me it worked to just remove this function, but as it needs to keep working on other systems as well, we would need some check for the OS version or if the function is already defined. Here are the links to the source: https://github.com/apache/zookeeper/blob/trunk/src/c/include/recordio.h#L73 https://github.com/apache/zookeeper/blob/trunk/src/c/src/recordio.c#L83-L97 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1696) Improve reconciliation between master and slave.
[ https://issues.apache.org/jira/browse/MESOS-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154012#comment-14154012 ] Benjamin Mahler commented on MESOS-1696: https://reviews.apache.org/r/26202/ https://reviews.apache.org/r/26206/ https://reviews.apache.org/r/26207/ https://reviews.apache.org/r/26208/ Improve reconciliation between master and slave. Key: MESOS-1696 URL: https://issues.apache.org/jira/browse/MESOS-1696 Project: Mesos Issue Type: Bug Components: master, slave Reporter: Benjamin Mahler Assignee: Benjamin Mahler As we update the Master to keep tasks in memory until they are both terminal and acknowledged (MESOS-1410), the lifetime of tasks in Mesos will look as follows: {code} Master Slave {} {} {Tn} {} // Master receives Task T, non-terminal. Forwards to slave. {Tn} {Tn} // Slave receives Task T, non-terminal. {Tn} {Tt} // Task becomes terminal on slave. Update forwarded. {Tt} {Tt} // Master receives update, forwards to framework. {} {Tt} // Master receives ack, forwards to slave. {} {} // Slave receives ack. {code} In the current form of reconciliation, the slave sends to the master all tasks that are not both terminal and acknowledged. At any point in the above lifecycle, the slave's re-registration message can reach the master. Note the following properties: *(1)* The master may have a non-terminal task, not present in the slave's re-registration message. *(2)* The master may have a non-terminal task, present in the slave's re-registration message but in a different state. *(3)* The slave's re-registration message may contain a terminal unacknowledged task unknown to the master. In the current master / slave [reconciliation|https://github.com/apache/mesos/blob/0.19.1/src/master/master.cpp#L3146] code, the master assumes that case (1) is because a launch task message was dropped, and it sends TASK_LOST. We've seen above that (1) can happen even when the task reaches the slave correctly, so this can lead to inconsistency! After chatting with [~vinodkone], we're considering updating the reconciliation to occur as follows: → Slave sends all tasks that are not both terminal and acknowledged, during re-registration. This is the same as before. → If the master sees tasks that are missing in the slave, the master sends the tasks that need to be reconciled to the slave for the tasks. This can be piggy-backed on the re-registration message. → The slave will send TASK_LOST if the task is not known to it. Preferably in a retried manner, unless we update socket closure on the slave to force a re-registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-1347) GarbageCollectorIntegrationTest.DiskUsage is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reopened MESOS-1347: Re-opening as it appears to still be flaky: https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME/2137/consoleFull {noformat} [ RUN ] GarbageCollectorIntegrationTest.DiskUsage Using temporary directory '/tmp/GarbageCollectorIntegrationTest_DiskUsage_tjpfEc' I1001 03:47:36.653859 9413 leveldb.cpp:176] Opened db in 2.065433ms I1001 03:47:36.654671 9413 leveldb.cpp:183] Compacted db in 784728ns I1001 03:47:36.654695 9413 leveldb.cpp:198] Created db iterator in 3540ns I1001 03:47:36.654711 9413 leveldb.cpp:204] Seeked to beginning of db in 683ns I1001 03:47:36.654722 9413 leveldb.cpp:273] Iterated through 0 keys in the db in 208ns I1001 03:47:36.654742 9413 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I1001 03:47:36.654906 9433 recover.cpp:425] Starting replica recovery I1001 03:47:36.654992 9433 recover.cpp:451] Replica is in EMPTY status I1001 03:47:36.655396 9429 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I1001 03:47:36.655482 9437 recover.cpp:188] Received a recover response from a replica in EMPTY status I1001 03:47:36.655678 9428 recover.cpp:542] Updating replica status to STARTING I1001 03:47:36.656245 9434 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 494048ns I1001 03:47:36.656272 9434 replica.cpp:320] Persisted replica status to STARTING I1001 03:47:36.656352 9429 master.cpp:312] Master 20141001-034736-3176252227-46678-9413 (proserpina.apache.org) started on 67.195.81.189:46678 I1001 03:47:36.656388 9429 master.cpp:358] Master only allowing authenticated frameworks to register I1001 03:47:36.656404 9429 master.cpp:363] Master only allowing authenticated slaves to register I1001 03:47:36.656421 9429 credentials.hpp:36] Loading credentials for authentication from '/tmp/GarbageCollectorIntegrationTest_DiskUsage_tjpfEc/credentials' I1001 03:47:36.656436 9442 recover.cpp:451] Replica is in STARTING status I1001 03:47:36.656574 9429 master.cpp:392] Authorization enabled I1001 03:47:36.656782 9431 master.cpp:120] No whitelist given. Advertising offers for all slaves I1001 03:47:36.656842 9442 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I1001 03:47:36.656867 9438 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.189:46678 I1001 03:47:36.657053 9437 recover.cpp:188] Received a recover response from a replica in STARTING status I1001 03:47:36.657254 9441 master.cpp:1241] The newly elected leader is master@67.195.81.189:46678 with id 20141001-034736-3176252227-46678-9413 I1001 03:47:36.657279 9441 master.cpp:1254] Elected as the leading master! I1001 03:47:36.657292 9441 master.cpp:1072] Recovering from registrar I1001 03:47:36.657311 9440 recover.cpp:542] Updating replica status to VOTING I1001 03:47:36.657403 9436 registrar.cpp:312] Recovering registrar I1001 03:47:36.657766 9437 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 295743ns I1001 03:47:36.657793 9437 replica.cpp:320] Persisted replica status to VOTING I1001 03:47:36.657863 9433 recover.cpp:556] Successfully joined the Paxos group I1001 03:47:36.657943 9433 recover.cpp:440] Recover process terminated I1001 03:47:36.658114 9432 log.cpp:656] Attempting to start the writer I1001 03:47:36.658612 9438 replica.cpp:474] Replica received implicit promise request with proposal 1 I1001 03:47:36.658779 9438 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 141800ns I1001 03:47:36.658797 9438 replica.cpp:342] Persisted promised to 1 I1001 03:47:36.659145 9432 coordinator.cpp:230] Coordinator attemping to fill missing position I1001 03:47:36.659880 9437 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I1001 03:47:36.769940 9437 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 516875ns I1001 03:47:36.769964 9437 replica.cpp:676] Persisted action at 0 I1001 03:47:36.770449 9437 replica.cpp:508] Replica received write request for position 0 I1001 03:47:36.770480 9437 leveldb.cpp:438] Reading position from leveldb took 12227ns I1001 03:47:36.770740 9437 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 237752ns I1001 03:47:36.770764 9437 replica.cpp:676] Persisted action at 0 I1001 03:47:36.771070 9435 replica.cpp:655] Replica received learned notice for position 0 I1001 03:47:36.771237 9435 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 145713ns I1001 03:47:36.771257 9435 replica.cpp:676] Persisted action at 0 I1001 03:47:36.771268 9435 replica.cpp:661] Replica learned NOP action at position 0 I1001 03:47:36.771442 9442 log.cpp:672] Writer started with
[jira] [Updated] (MESOS-1854) SlaveRecoveryTest.MultipleSlaves is flaky.
[ https://issues.apache.org/jira/browse/MESOS-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1854: --- Sprint: Mesos Q3 Sprint 6 SlaveRecoveryTest.MultipleSlaves is flaky. -- Key: MESOS-1854 URL: https://issues.apache.org/jira/browse/MESOS-1854 Project: Mesos Issue Type: Bug Components: test Reporter: Benjamin Mahler Assignee: Benjamin Mahler https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2408/consoleFull {noformat} [ RUN ] SlaveRecoveryTest/0.MultipleSlaves Using temporary directory '/tmp/SlaveRecoveryTest_0_MultipleSlaves_yOuJDJ' I1001 01:25:43.585139 23806 leveldb.cpp:176] Opened db in 2.028599ms I1001 01:25:43.585713 23806 leveldb.cpp:183] Compacted db in 552764ns I1001 01:25:43.585731 23806 leveldb.cpp:198] Created db iterator in 3825ns I1001 01:25:43.585738 23806 leveldb.cpp:204] Seeked to beginning of db in 700ns I1001 01:25:43.585744 23806 leveldb.cpp:273] Iterated through 0 keys in the db in 370ns I1001 01:25:43.585759 23806 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I1001 01:25:43.586006 23828 recover.cpp:425] Starting replica recovery I1001 01:25:43.586093 23828 recover.cpp:451] Replica is in EMPTY status I1001 01:25:43.586524 23828 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I1001 01:25:43.586606 23824 recover.cpp:188] Received a recover response from a replica in EMPTY status I1001 01:25:43.586741 23831 recover.cpp:542] Updating replica status to STARTING I1001 01:25:43.586899 23825 master.cpp:312] Master 20141001-012543-3176252227-55929-23806 (proserpina.apache.org) started on 67.195.81.189:55929 I1001 01:25:43.586930 23825 master.cpp:358] Master only allowing authenticated frameworks to register I1001 01:25:43.586942 23825 master.cpp:363] Master only allowing authenticated slaves to register I1001 01:25:43.586953 23825 credentials.hpp:36] Loading credentials for authentication from '/tmp/SlaveRecoveryTest_0_MultipleSlaves_yOuJDJ/credentials' I1001 01:25:43.587057 23825 master.cpp:392] Authorization enabled I1001 01:25:43.587241 23829 master.cpp:120] No whitelist given. Advertising offers for all slaves I1001 01:25:43.587270 23828 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 484210ns I1001 01:25:43.587278 23823 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.189:55929 I1001 01:25:43.587288 23828 replica.cpp:320] Persisted replica status to STARTING I1001 01:25:43.587393 23828 recover.cpp:451] Replica is in STARTING status I1001 01:25:43.587611 23825 master.cpp:1241] The newly elected leader is master@67.195.81.189:55929 with id 20141001-012543-3176252227-55929-23806 I1001 01:25:43.587631 23825 master.cpp:1254] Elected as the leading master! I1001 01:25:43.587643 23825 master.cpp:1072] Recovering from registrar I1001 01:25:43.587704 23824 registrar.cpp:312] Recovering registrar I1001 01:25:43.587731 23827 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I1001 01:25:43.587937 23821 recover.cpp:188] Received a recover response from a replica in STARTING status I1001 01:25:43.588060 23827 recover.cpp:542] Updating replica status to VOTING I1001 01:25:43.588371 23830 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 143615ns I1001 01:25:43.588392 23830 replica.cpp:320] Persisted replica status to VOTING I1001 01:25:43.588433 23821 recover.cpp:556] Successfully joined the Paxos group I1001 01:25:43.588496 23821 recover.cpp:440] Recover process terminated I1001 01:25:43.588632 23820 log.cpp:656] Attempting to start the writer I1001 01:25:43.589174 23832 replica.cpp:474] Replica received implicit promise request with proposal 1 I1001 01:25:43.589617 23832 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 427035ns I1001 01:25:43.589630 23832 replica.cpp:342] Persisted promised to 1 I1001 01:25:43.589833 23821 coordinator.cpp:230] Coordinator attemping to fill missing position I1001 01:25:43.590340 23821 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2 I1001 01:25:43.590499 23821 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 142051ns I1001 01:25:43.590512 23821 replica.cpp:676] Persisted action at 0 I1001 01:25:43.590903 23833 replica.cpp:508] Replica received write request for position 0 I1001 01:25:43.590932 23833 leveldb.cpp:438] Reading position from leveldb took 14221ns I1001 01:25:43.591089 23833 leveldb.cpp:343] Persisting action (14 bytes) to leveldb took 140263ns I1001 01:25:43.591101 23833 replica.cpp:676]
[jira] [Updated] (MESOS-920) Set GLOG_drop_log_memory=false in environment prior to logging initialization.
[ https://issues.apache.org/jira/browse/MESOS-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-920: -- Component/s: technical debt Set GLOG_drop_log_memory=false in environment prior to logging initialization. -- Key: MESOS-920 URL: https://issues.apache.org/jira/browse/MESOS-920 Project: Mesos Issue Type: Improvement Components: technical debt Affects Versions: 0.16.0, 0.15.0 Reporter: Benjamin Mahler We've observed performance scaling issues attributed to the posix_fadvise calls made by glog. This can currently only disabled via the environment: GLOG_DEFINE_bool(drop_log_memory, true, Drop in-memory buffers of log contents. Logs can grow very quickly and they are rarely read before they need to be evicted from memory. Instead, drop them from memory as soon as they are flushed to disk.); if (FLAGS_drop_log_memory) { if (file_length_ = logging::kPageSize) { // don't evict the most recent page uint32 len = file_length_ ~(logging::kPageSize - 1); posix_fadvise(fileno(file_), 0, len, POSIX_FADV_DONTNEED); } } We should set GLOG_drop_log_memory=false prior to making our call to google::InitGoogleLogging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1858) Leaked file descriptors in StatusUpdateStream.
[ https://issues.apache.org/jira/browse/MESOS-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156982#comment-14156982 ] Benjamin Mahler commented on MESOS-1858: Linking in MESOS-1432. Leaked file descriptors in StatusUpdateStream. -- Key: MESOS-1858 URL: https://issues.apache.org/jira/browse/MESOS-1858 Project: Mesos Issue Type: Bug Reporter: Jie Yu https://github.com/apache/mesos/blob/master/src/slave/status_update_manager.hpp#L180 We should set cloexec for 'fd'. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1862) Performance regression in the Master's http metrics.
Benjamin Mahler created MESOS-1862: -- Summary: Performance regression in the Master's http metrics. Key: MESOS-1862 URL: https://issues.apache.org/jira/browse/MESOS-1862 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker As part of the change to hold on to terminal unacknowledged tasks in the master, we introduced a performance regression during the following patch: https://github.com/apache/mesos/commit/0760b007ad65bc91e8cea377339978c78d36d247 {noformat} commit 0760b007ad65bc91e8cea377339978c78d36d247 Author: Benjamin Mahler bmah...@twitter.com Date: Thu Sep 11 10:48:20 2014 -0700 Minor cleanups to the Master code. Review: https://reviews.apache.org/r/25566 {noformat} Rather than keeping a running count of allocated resources, we now compute resources on-demand. This was done in order to ignore terminal task's resources. As a result of this change, the /stats.json and /metrics/snapshot endpoints on the master have slowed down substantially on large clusters. {noformat} $ time curl localhost:5050/health real0m0.004s user0m0.001s sys 0m0.002s $ time curl localhost:5050/stats.json /dev/null real0m15.402s user0m0.001s sys 0m0.003s $ time curl localhost:5050/metrics/snapshot /dev/null real0m6.059s user0m0.002s sys 0m0.002s {noformat} {{perf top}} reveals some of the resource computation during a request to stats.json: {noformat: perf top} Events: 36K cycles 10.53% libc-2.5.so [.] _int_free 9.90% libc-2.5.so [.] malloc 8.56% libmesos-0.21.0.so [.] std::_Rb_treeprocess::ProcessBase*, process::ProcessBase*, std::_Identityprocess::ProcessBase*, std::lessprocess::ProcessBase*, std::allocatorprocess::ProcessBase* :: 8.23% libc-2.5.so [.] _int_malloc 5.80% libstdc++.so.6.0.8 [.] std::_Rb_tree_increment(std::_Rb_tree_node_base*) 5.33% [kernel][k] _raw_spin_lock 3.13% libstdc++.so.6.0.8 [.] std::string::assign(std::string const) 2.95% libmesos-0.21.0.so [.] process::SocketManager::exited(process::ProcessBase*) 2.43% libmesos-0.21.0.so [.] mesos::Resource::MergeFrom(mesos::Resource const) 1.88% libmesos-0.21.0.so [.] mesos::internal::master::Slave::used() const 1.48% libstdc++.so.6.0.8 [.] __gnu_cxx::__atomic_add(int volatile*, int) 1.45% [kernel][k] find_busiest_group 1.41% libc-2.5.so [.] free 1.38% libmesos-0.21.0.so [.] mesos::Value_Range::MergeFrom(mesos::Value_Range const) 1.13% libmesos-0.21.0.so [.] mesos::Value_Scalar::MergeFrom(mesos::Value_Scalar const) 1.12% libmesos-0.21.0.so [.] mesos::Resource::SharedDtor() 1.07% libstdc++.so.6.0.8 [.] __gnu_cxx::__exchange_and_add(int volatile*, int) 0.94% libmesos-0.21.0.so [.] google::protobuf::UnknownFieldSet::MergeFrom(google::protobuf::UnknownFieldSet const) 0.92% libstdc++.so.6.0.8 [.] operator new(unsigned long) 0.88% libmesos-0.21.0.so [.] mesos::Value_Ranges::MergeFrom(mesos::Value_Ranges const) 0.75% libmesos-0.21.0.so [.] mesos::matches(mesos::Resource const, mesos::Resource const) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1842) Create benchmark
[ https://issues.apache.org/jira/browse/MESOS-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160979#comment-14160979 ] Benjamin Mahler commented on MESOS-1842: Looks like this is a duplicate of MESOS-1018? Create benchmark Key: MESOS-1842 URL: https://issues.apache.org/jira/browse/MESOS-1842 Project: Mesos Issue Type: Technical task Components: libprocess Reporter: Joris Van Remoortere Labels: performance, test -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-681) Document the reconciliation API.
[ https://issues.apache.org/jira/browse/MESOS-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-681: -- Description: Now that we have a reconciliation mechanism, we should document why it exists and how to use it going forward. As we add the lower level API, reconciliation may be done slightly differently. Having documentation that reflects the changes would be great. It might also be helpful to upload my slides from the May 19th meetup. was: Now that we have a reconciliation mechanism, we should document why it exists and how to use it in 0.19.0 and going forward. As we add the lower level API, reconciliation may be done slightly differently. Having documentation that reflects the changes would be great. It might also be helpful to upload my slides from the May 19th meetup. Story Points: (was: 2) Summary: Document the reconciliation API. (was: Document the 0.21.0 reconciliation API.) Document the reconciliation API. Key: MESOS-681 URL: https://issues.apache.org/jira/browse/MESOS-681 Project: Mesos Issue Type: Task Components: documentation Reporter: Benjamin Mahler Assignee: Benjamin Mahler Attachments: 0.19.0.key, 0.19.0.pdf Now that we have a reconciliation mechanism, we should document why it exists and how to use it going forward. As we add the lower level API, reconciliation may be done slightly differently. Having documentation that reflects the changes would be great. It might also be helpful to upload my slides from the May 19th meetup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1862) Performance regression in the Master's http metrics.
[ https://issues.apache.org/jira/browse/MESOS-1862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161297#comment-14161297 ] Benjamin Mahler commented on MESOS-1862: https://reviews.apache.org/r/26392/ Performance regression in the Master's http metrics. Key: MESOS-1862 URL: https://issues.apache.org/jira/browse/MESOS-1862 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.21.0 Reporter: Benjamin Mahler Assignee: Benjamin Mahler Priority: Blocker As part of the change to hold on to terminal unacknowledged tasks in the master, we introduced a performance regression during the following patch: https://github.com/apache/mesos/commit/0760b007ad65bc91e8cea377339978c78d36d247 {noformat} commit 0760b007ad65bc91e8cea377339978c78d36d247 Author: Benjamin Mahler bmah...@twitter.com Date: Thu Sep 11 10:48:20 2014 -0700 Minor cleanups to the Master code. Review: https://reviews.apache.org/r/25566 {noformat} Rather than keeping a running count of allocated resources, we now compute resources on-demand. This was done in order to ignore terminal task's resources. As a result of this change, the /stats.json and /metrics/snapshot endpoints on the master have slowed down substantially on large clusters. {noformat} $ time curl localhost:5050/health real 0m0.004s user 0m0.001s sys 0m0.002s $ time curl localhost:5050/stats.json /dev/null real 0m15.402s user 0m0.001s sys 0m0.003s $ time curl localhost:5050/metrics/snapshot /dev/null real 0m6.059s user 0m0.002s sys 0m0.002s {noformat} {{perf top}} reveals some of the resource computation during a request to stats.json: {noformat: perf top} Events: 36K cycles 10.53% libc-2.5.so [.] _int_free 9.90% libc-2.5.so [.] malloc 8.56% libmesos-0.21.0.so [.] std::_Rb_treeprocess::ProcessBase*, process::ProcessBase*, std::_Identityprocess::ProcessBase*, std::lessprocess::ProcessBase*, std::allocatorprocess::ProcessBase* :: 8.23% libc-2.5.so [.] _int_malloc 5.80% libstdc++.so.6.0.8 [.] std::_Rb_tree_increment(std::_Rb_tree_node_base*) 5.33% [kernel][k] _raw_spin_lock 3.13% libstdc++.so.6.0.8 [.] std::string::assign(std::string const) 2.95% libmesos-0.21.0.so [.] process::SocketManager::exited(process::ProcessBase*) 2.43% libmesos-0.21.0.so [.] mesos::Resource::MergeFrom(mesos::Resource const) 1.88% libmesos-0.21.0.so [.] mesos::internal::master::Slave::used() const 1.48% libstdc++.so.6.0.8 [.] __gnu_cxx::__atomic_add(int volatile*, int) 1.45% [kernel][k] find_busiest_group 1.41% libc-2.5.so [.] free 1.38% libmesos-0.21.0.so [.] mesos::Value_Range::MergeFrom(mesos::Value_Range const) 1.13% libmesos-0.21.0.so [.] mesos::Value_Scalar::MergeFrom(mesos::Value_Scalar const) 1.12% libmesos-0.21.0.so [.] mesos::Resource::SharedDtor() 1.07% libstdc++.so.6.0.8 [.] __gnu_cxx::__exchange_and_add(int volatile*, int) 0.94% libmesos-0.21.0.so [.] google::protobuf::UnknownFieldSet::MergeFrom(google::protobuf::UnknownFieldSet const) 0.92% libstdc++.so.6.0.8 [.] operator new(unsigned long) 0.88% libmesos-0.21.0.so [.] mesos::Value_Ranges::MergeFrom(mesos::Value_Ranges const) 0.75% libmesos-0.21.0.so [.] mesos::matches(mesos::Resource const, mesos::Resource const) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
Benjamin Mahler created MESOS-1876: -- Summary: Remove deprecated 'slave_id' field in ReregisterSlaveMessage. Key: MESOS-1876 URL: https://issues.apache.org/jira/browse/MESOS-1876 Project: Mesos Issue Type: Task Components: technical debt Reporter: Benjamin Mahler This is to follow through on removing the deprecated field that we've been phasing out. In 0.21.0, this field will no longer be read: {code} message ReregisterSlaveMessage { // TODO(bmahler): slave_id is deprecated. // 0.21.0: Now an optional field. Always written, never read. // 0.22.0: Remove this field. optional SlaveID slave_id = 1; required SlaveInfo slave = 2; repeated ExecutorInfo executor_infos = 4; repeated Task tasks = 3; repeated Archive.Framework completed_frameworks = 5; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
[ https://issues.apache.org/jira/browse/MESOS-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1876: --- Priority: Trivial (was: Major) Remove deprecated 'slave_id' field in ReregisterSlaveMessage. - Key: MESOS-1876 URL: https://issues.apache.org/jira/browse/MESOS-1876 Project: Mesos Issue Type: Task Components: technical debt Reporter: Benjamin Mahler Priority: Trivial This is to follow through on removing the deprecated field that we've been phasing out. In 0.21.0, this field will no longer be read: {code} message ReregisterSlaveMessage { // TODO(bmahler): slave_id is deprecated. // 0.21.0: Now an optional field. Always written, never read. // 0.22.0: Remove this field. optional SlaveID slave_id = 1; required SlaveInfo slave = 2; repeated ExecutorInfo executor_infos = 4; repeated Task tasks = 3; repeated Archive.Framework completed_frameworks = 5; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1879) Handle a temporary one-way slave -- master socket closure.
Benjamin Mahler created MESOS-1879: -- Summary: Handle a temporary one-way slave -- master socket closure. Key: MESOS-1879 URL: https://issues.apache.org/jira/browse/MESOS-1879 Project: Mesos Issue Type: Bug Components: slave Reporter: Benjamin Mahler Priority: Minor In the same spirit as MESOS-1668, we want to correctly handle a scenario where the slave -- master socket closes, and a new socket can be immediately re-established. If this occurs, the ping / pongs will resume but there may be dropped messages sent by the slave, and so a re-registration would be a good safety net. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1878) Access to sandbox on slave from master UI does not show the sandbox contents
[ https://issues.apache.org/jira/browse/MESOS-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1878: --- Target Version/s: 0.21.0 Affects Version/s: 0.21.0 Assignee: Cody Maloney [~cmaloney] can you take a look at this? Access to sandbox on slave from master UI does not show the sandbox contents Key: MESOS-1878 URL: https://issues.apache.org/jira/browse/MESOS-1878 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 0.21.0 Reporter: Anindya Sinha Assignee: Cody Maloney Priority: Minor From master UI, clicking Sandbox to go to slave sandbox does not list the sandbox contents. The directory path of the sandbox shows up fine, but not the actual contents of the sandbox that is displayed below. Looks like the issue is it fails in the following GET from the corresponding slave: http://slave-host:4891/files/browse.json?jsonp=angular.callbacks._9path=sandbox-path Looking at the commits, I could confirm that the issue is not seen with commit 'babb1c06ecf3077f292a19cfcbf1f1a4ed0e07b1'. Rolling back to a mesos build with this commit being the last commit on mesos slave does not show this behavior. Update: The issue has been introduced by the following 2 commits: ca2e8ef MESOS-1857 Fixed path::join() on older libstdc++ which lack back(). b08fccf Switched path::join() to be variadic Note that the commit ca2e8ef fixes a build issue (on older libstd++) on top of the commit b08fccf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1869) UpdateFramework message might reach the slave before Reregistered message and get dropped
[ https://issues.apache.org/jira/browse/MESOS-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164526#comment-14164526 ] Benjamin Mahler commented on MESOS-1869: Fixed as part of MESOS-1696: https://reviews.apache.org/r/26206/ UpdateFramework message might reach the slave before Reregistered message and get dropped - Key: MESOS-1869 URL: https://issues.apache.org/jira/browse/MESOS-1869 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Benjamin Mahler In reregisterSlave() we send 'SlaveReregisteredMessage' before we link the slave pid, which means a temporary socket will be created and used. Subsequently, after linking, we send the UpdateFrameworkMessage, which creates and uses a persistent socket. This might lead to out-of-order delivery, resulting in UpdateFrameworkMessage reaching the slave before the SlaveReregisteredMessage and getting dropped because the slave is not yet (re-)registered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1469) No output from review bot on timeout
[ https://issues.apache.org/jira/browse/MESOS-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1469: --- Component/s: reviewbot No output from review bot on timeout Key: MESOS-1469 URL: https://issues.apache.org/jira/browse/MESOS-1469 Project: Mesos Issue Type: Bug Components: build, reviewbot Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor When the mesos review build times out, likely due to a long-running failing test, we have no output to debug. We should find a way to stream the output from the build instead of waiting for the build to finish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1234) Mesos ReviewBot should look at old reviews first
[ https://issues.apache.org/jira/browse/MESOS-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1234: --- Component/s: reviewbot Mesos ReviewBot should look at old reviews first Key: MESOS-1234 URL: https://issues.apache.org/jira/browse/MESOS-1234 Project: Mesos Issue Type: Improvement Components: reviewbot Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.19.0 Currently the ReviewBot looks at newest reviews first starving out old reviews if there are enough new/updated reviews. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1712) Automate disallowing of commits mixing mesos/libprocess/stout
[ https://issues.apache.org/jira/browse/MESOS-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1712: --- Component/s: reviewbot Automate disallowing of commits mixing mesos/libprocess/stout - Key: MESOS-1712 URL: https://issues.apache.org/jira/browse/MESOS-1712 Project: Mesos Issue Type: Bug Components: reviewbot Reporter: Vinod Kone For various reasons, we don't want to mix mesos/libprocess/stout changes into a single commit. Typically, it is up to the reviewee/reviewer to catch this. It wold be nice to automate this via the pre-commit hook . -- This message was sent by Atlassian JIRA (v6.3.4#6332)