Re: Re: Cluster history wiped after master leader reelection

2016-03-15 Thread Geoffroy Jabouley
Hi

thanks for your answer.

Too bad the cluster history is wiped out. Is this behavior by design
(history is stored on current leader, and cannot be copied by new leader)?

Any suggestions for a way of persisting it?
Maybe outside of mesos using some data collection?



--


Yes this is the intended behavior.

-- 

*Rodrick Brown* / Systems Engineer

+1 917 445 6839 / rodr...@orchardplatform.com <char...@orchardplatform.com>

*Orchard Platform*

101 5th Avenue, 4th Floor, New York, NY 10003

http://www.orchardplatform.com

Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace Lending
Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/>

> On Mar 10 2016, at 11:47 am, Geoffroy Jabouley <
> geoffroy.jabou...@gmail.com> wrote:
> Hello
>
> a leader re-election just occured on our cluster (0.25.0).
>
> It goes fine except the entire cluster history has been lost.
>
> All tasks counters have been resetted to 0, Completed tasks and Terminated
> frameworks lists are empty.
>
> Has anybody experienced this?
>
> Regards
>
>
> PS: this is not a blocking problem, but it is important in our job to
> sometimes show figures to our management, and such counters always makes
> good impression ;)
>

*NOTICE TO RECIPIENTS*: This communication is confidential and intended for
the use of the addressee only. If you are not an intended recipient of this
communication, please delete it�immediately and notify the sender by return
email. Unauthorized reading, dissemination, distribution or copying of this
communication is prohibited. This communication does not�constitute an
offer to sell or a solicitation of an indication of interest to purchase
any loan, security or any other financial product or instrument, nor is it
an offer to sell or a solicitation of an�indication of interest to purchase
any products or services to any persons who are prohibited from receiving
such information under applicable law. The contents of this communication
may not be�accurate or complete and are subject to change without notice.
As such, Orchard App, Inc. (including its subsidiaries and affiliates,
"Orchard") makes no representation regarding the�accuracy or completeness
of the information contained herein. The intended recipient is advised to
consult its own professional advisors, including those specializing in
legal, tax and accounting�matters. Orchard does not provide legal, tax or
accounting advice.


Re: Re: Mesos 0.25 not incresing Staged/Started counters in the UI

2016-02-24 Thread Geoffroy Jabouley
Thanks for the clarification. Does staged means "currently in staging state"?

In previous versions of Mesos (at least 0.22.1), the Staged value was
increased for each staged tasks, to you could tell "X tasks have been
executed on the cluster".

My point is there is no straightforward way of telling how many tasks
had been running on the cluster since it is up. Or am i missing
something?




If you have some tasks which state is not equal to TASK_STAGING, it would
become non-zero.

On Wed, Feb 24, 2016 at 8:23 PM, Geoffroy Jabouley
<geoffroy.jabou...@gmail.com> wrote:

> Hi again
>
> just checked the /metrics/snapshot endpoint. Staged value is zero. Is this
> normal?
>
> {
>"allocator\/event_queue_dispatches":0.0,
>"frameworks\/jenkins\/messages_processed":9.0,
>"frameworks\/jenkins\/messages_received":9.0,
>"master\/cpus_percent":0.0958,
>"master\/cpus_revocable_percent":0.0,
>"master\/cpus_revocable_total":0.0,
>"master\/cpus_revocable_used":0.0,
>"master\/cpus_total":24.0,
>"master\/cpus_used":2.3,
>"master\/disk_percent":0.0,
>"master\/disk_revocable_percent":0.0,
>"master\/disk_revocable_total":0.0,
>"master\/disk_revocable_used":0.0,
>"master\/disk_total":138161.0,
>"master\/disk_used":0.0,
>"master\/dropped_messages":6.0,
>"master\/elected":1.0,
>"master\/event_queue_dispatches":26.0,
>"master\/event_queue_http_requests":0.0,
>"master\/event_queue_messages":0.0,
>"master\/frameworks_active":2.0,
>"master\/frameworks_connected":2.0,
>"master\/frameworks_disconnected":0.0,
>"master\/frameworks_inactive":0.0,
>"master\/invalid_executor_to_framework_messages":0.0,
>"master\/invalid_framework_to_executor_messages":0.0,
>"master\/invalid_status_update_acknowledgements":0.0,
>"master\/invalid_status_updates":256.0,
>"master\/mem_percent":0.268649930174402,
>"master\/mem_revocable_percent":0.0,
>"master\/mem_revocable_total":0.0,
>"master\/mem_revocable_used":0.0,
>"master\/mem_total":92373.0,
>"master\/mem_used":24816.0,
>"master\/messages_authenticate":0.0,
>"master\/messages_deactivate_framework":0.0,
>"master\/messages_decline_offers":45642.0,
>"master\/messages_executor_to_framework":0.0,
>"master\/messages_exited_executor":0.0,
>"master\/messages_framework_to_executor":0.0,
>"master\/messages_kill_task":1401.0,
>"master\/messages_launch_tasks":1525.0,
>"master\/messages_reconcile_tasks":5100.0,
>"master\/messages_register_framework":0.0,
>"master\/messages_register_slave":3.0,
>"master\/messages_reregister_framework":0.0,
>"master\/messages_reregister_slave":3.0,
>"master\/messages_resource_request":0.0,
>"master\/messages_revive_offers":78.0,
>"master\/messages_status_update":3252.0,
>"master\/messages_status_update_acknowledgement":1964.0,
>"master\/messages_suppress_offers":0.0,
>"master\/messages_unregister_framework":183.0,
>"master\/messages_unregister_slave":0.0,
>"master\/messages_update_slave":6.0,
>"master\/outstanding_offers":0.0,
>"master\/recovery_slave_removals":0.0,
>"master\/slave_registrations":3.0,
>"master\/slave_removals":0.0,
>"master\/slave_removals\/reason_registered":0.0,
>"master\/slave_removals\/reason_unhealthy":0.0,
>"master\/slave_removals\/reason_unregistered":0.0,
>"master\/slave_reregistrations":0.0,
>"master\/slave_shutdowns_canceled":0.0,
>"master\/slave_shutdowns_completed":0.0,
>"master\/slave_shutdowns_scheduled":0.0,
>"master\/slaves_active":3.0,
>"master\/slaves_connected":3.0,
>"master\/slaves_disconnected":0.0,
>"master\/slaves_inactive":0.0,
>"master\/task_killed\/source_master\/reason_framework_removed":1065.0,
>
>
>
>
>
>
>
> *   "master\/tasks_error":0.0,   "master\/tasks_failed":4.0,
> "master\/tasks_finished":4.0,   "master\/tasks_killed":150

Re: Mesos 0.25 not incresing Staged/Started counters in the UI

2016-02-24 Thread Geoffroy Jabouley
Hi again

just checked the /metrics/snapshot endpoint. Staged value is zero. Is this
normal?

{
   "allocator\/event_queue_dispatches":0.0,
   "frameworks\/jenkins\/messages_processed":9.0,
   "frameworks\/jenkins\/messages_received":9.0,
   "master\/cpus_percent":0.0958,
   "master\/cpus_revocable_percent":0.0,
   "master\/cpus_revocable_total":0.0,
   "master\/cpus_revocable_used":0.0,
   "master\/cpus_total":24.0,
   "master\/cpus_used":2.3,
   "master\/disk_percent":0.0,
   "master\/disk_revocable_percent":0.0,
   "master\/disk_revocable_total":0.0,
   "master\/disk_revocable_used":0.0,
   "master\/disk_total":138161.0,
   "master\/disk_used":0.0,
   "master\/dropped_messages":6.0,
   "master\/elected":1.0,
   "master\/event_queue_dispatches":26.0,
   "master\/event_queue_http_requests":0.0,
   "master\/event_queue_messages":0.0,
   "master\/frameworks_active":2.0,
   "master\/frameworks_connected":2.0,
   "master\/frameworks_disconnected":0.0,
   "master\/frameworks_inactive":0.0,
   "master\/invalid_executor_to_framework_messages":0.0,
   "master\/invalid_framework_to_executor_messages":0.0,
   "master\/invalid_status_update_acknowledgements":0.0,
   "master\/invalid_status_updates":256.0,
   "master\/mem_percent":0.268649930174402,
   "master\/mem_revocable_percent":0.0,
   "master\/mem_revocable_total":0.0,
   "master\/mem_revocable_used":0.0,
   "master\/mem_total":92373.0,
   "master\/mem_used":24816.0,
   "master\/messages_authenticate":0.0,
   "master\/messages_deactivate_framework":0.0,
   "master\/messages_decline_offers":45642.0,
   "master\/messages_executor_to_framework":0.0,
   "master\/messages_exited_executor":0.0,
   "master\/messages_framework_to_executor":0.0,
   "master\/messages_kill_task":1401.0,
   "master\/messages_launch_tasks":1525.0,
   "master\/messages_reconcile_tasks":5100.0,
   "master\/messages_register_framework":0.0,
   "master\/messages_register_slave":3.0,
   "master\/messages_reregister_framework":0.0,
   "master\/messages_reregister_slave":3.0,
   "master\/messages_resource_request":0.0,
   "master\/messages_revive_offers":78.0,
   "master\/messages_status_update":3252.0,
   "master\/messages_status_update_acknowledgement":1964.0,
   "master\/messages_suppress_offers":0.0,
   "master\/messages_unregister_framework":183.0,
   "master\/messages_unregister_slave":0.0,
   "master\/messages_update_slave":6.0,
   "master\/outstanding_offers":0.0,
   "master\/recovery_slave_removals":0.0,
   "master\/slave_registrations":3.0,
   "master\/slave_removals":0.0,
   "master\/slave_removals\/reason_registered":0.0,
   "master\/slave_removals\/reason_unhealthy":0.0,
   "master\/slave_removals\/reason_unregistered":0.0,
   "master\/slave_reregistrations":0.0,
   "master\/slave_shutdowns_canceled":0.0,
   "master\/slave_shutdowns_completed":0.0,
   "master\/slave_shutdowns_scheduled":0.0,
   "master\/slaves_active":3.0,
   "master\/slaves_connected":3.0,
   "master\/slaves_disconnected":0.0,
   "master\/slaves_inactive":0.0,
   "master\/task_killed\/source_master\/reason_framework_removed":1065.0,







*   "master\/tasks_error":0.0,   "master\/tasks_failed":4.0,
"master\/tasks_finished":4.0,   "master\/tasks_killed":1506.0,
"master\/tasks_lost":0.0,   "master\/tasks_running":12.0,
"master\/tasks_staging":0.0,   "master\/tasks_starting":0.0,*
   "master\/uptime_secs":767193.85357312,
   "master\/valid_executor_to_framework_messages":0.0,
   "master\/valid_framework_to_executor_messages":0.0,
   "master\/valid_status_update_acknowledgements":1964.0,
   "master\/valid_status_updates":2996.0,
   "registrar\/queued_operations":0.0,
   "registrar\/registry_size_bytes":681.0,
   "registrar\/state_fetch_ms":1693.7408,
   "registrar\/state_store_ms":2.151936,
   "registrar\/state_store_ms\/count":4,
   "registrar\/state_store_ms\/max":7.021056,
   "registrar\/state_store_ms\/min":2.151936,
   "registrar\/state_store_ms\/p50":2.361856,
   "registrar\/state_store_ms\/p90":5.6530176,
   "registrar\/state_store_ms\/p95":6.3370368,
   "registrar\/state_store_ms\/p99":6.88425216,
   "registrar\/state_store_ms\/p999":7.007375616,
   "registrar\/state_store_ms\/p":7.0196879616,
   "system\/cpus_total":8.0,
   "system\/load_15min":0.06,
   "system\/load_1min":0.08,
   "system\/load_5min":0.09,
   "system\/mem_free_bytes":281022464.0,
   "system\/mem_total_bytes":33360670720.0
}


Mesos 0.25 not incresing Staged/Started counters in the UI

2016-02-23 Thread Geoffroy Jabouley
Hello

since we moved to Mesos 0.25, we noticed that in the left column of the UI,
in the TASKS part, counters for Staged and Started tasks are always equals
to 0.


[image: Images intégrées 1]

Is this normal? Or maybe a known-issue?

With 0.22.1, Started counter was always zero but at least Staged counter
showed the number of tasks executed on the cluster since it had started.

Regards


Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-17 Thread Geoffroy Jabouley
Thanks a lot Dario for the workaround! It works fine and can be scripted
with ansible.

For the record, the github issue is available here:
https://github.com/mesosphere/marathon/issues/1292

2015-03-12 17:27 GMT+01:00 Dario Rexin da...@mesosphere.io:

 Hi Geoffrey,

 we identified the issue and will fix it in Marathon 0.8.2. To prevent this
 behaviour for now, you just have to make sure that in a fresh setup
 (Marathon was never connected to Mesos) you first start up a single
 Marathon and let it register with Mesos and then start the other Marathon
 instances. The problem is a race in first registration with Mesos and
 fetching the FrameworkID from Zookeeper. Please let me know if the
 workaround does not help you.

 Cheers,
 Dario

 On 12 Mar 2015, at 09:20, Alex Rukletsov a...@mesosphere.io wrote:

 Geoffroy,

 yes, it looks like a marathon issue, so feel free to post it there as well.

 On Thu, Mar 12, 2015 at 1:34 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Thanks Alex for your answer. I will have a look.

 Would it be better to (cross-)post this discussion on the marathon
 mailing list?

 Anyway, the issue is fixed for 0.8.0, which is the version i'm using.

 2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io:

 Geoffroy,

 most probably you're hitting this bug:
 https://github.com/mesosphere/marathon/issues/1063. The problem is that
 Marathon can register instead of re-registering when a master fails
 over. From master point of view, it's a new framework, that's why the
 previous task is gone and a new one (that technically belongs to a new
 framework) is started. You can see that frameworks have two different IDs
 (check lines 11:31:40.055496 and 11:31:40.785038) in your example.

 Hope that helps,
 Alex

 On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 thanks for your interest. Following are the requested logs, which will
 result in a pretty big mail.

 Mesos/Marathon are *NOT running inside docker*, we only use Docker as
 our mesos containerizer.

 For reminder, here is the use case performed to get the logs file:

 

 Our cluster: 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 

 *Begin State: *
 + the mesos cluster is up (3 machines)
 + mesos master leader is 10.195.30.19
 + marathon leader is 10.195.30.21
 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

 *Action*: stop the mesos master leader process (sudo stop mesos-master)

 *Expected*: mesos master leader has changed, active tasks / frameworks
 remain unchanged

 *End state: *
 + mesos master leader *has changed, now 10.195.30.21*
 + previously running APPTASK on the slave 10.195.30.21 has disappear
 (not showing anymore on the mesos UI), but *docker container is still
 running*
 + a n*ew APPTASK is now running on slave 10.195.30.19*
 + marathon framework registration time in mesos UI shows Just now
 + marathon leader *has changed, now 10.195.30.20*


 

 Now comes the 6 requested logs, which might contain
 interesting/relevant information, but i as a newcomer to mesos it is hard
 to read...


 *from previous MESOS master leader 10.195.30.19 http://10.195.30.19/:*
 W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal
 SIGTERM from process 1 of user 0; exiting


 *from new MESOS master leader 10.195.30.21 http://10.195.30.21/:*
 I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
 (id='2')
 I0310 11:31:40.011823   922 group.cpp:659] Trying to get
 '/mesos/info_02' in ZooKeeper
 I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group
 memberships changed
 I0310 11:31:40.015847   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/00' in ZooKeeper
 I0310 11:31:40.016047   922 detector.cpp:433] A new leading master
 (UPID=master@10.195.30.21:5050) is detected
 I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader
 is master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
 I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
 master!
 I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
 I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
 I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
 I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
 promise request with proposal 2
 I0310 11:31:40.017503   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/03' in ZooKeeper
 I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 893672ns
 I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
 I0310 11:31:40.018817   915

Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-12 Thread Geoffroy Jabouley
Thanks Alex for your answer. I will have a look.

Would it be better to (cross-)post this discussion on the marathon mailing
list?

Anyway, the issue is fixed for 0.8.0, which is the version i'm using.

2015-03-11 22:18 GMT+01:00 Alex Rukletsov a...@mesosphere.io:

 Geoffroy,

 most probably you're hitting this bug:
 https://github.com/mesosphere/marathon/issues/1063. The problem is that
 Marathon can register instead of re-registering when a master fails
 over. From master point of view, it's a new framework, that's why the
 previous task is gone and a new one (that technically belongs to a new
 framework) is started. You can see that frameworks have two different IDs
 (check lines 11:31:40.055496 and 11:31:40.785038) in your example.

 Hope that helps,
 Alex

 On Tue, Mar 10, 2015 at 4:04 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 thanks for your interest. Following are the requested logs, which will
 result in a pretty big mail.

 Mesos/Marathon are *NOT running inside docker*, we only use Docker as
 our mesos containerizer.

 For reminder, here is the use case performed to get the logs file:

 

 Our cluster: 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 

 *Begin State: *
 + the mesos cluster is up (3 machines)
 + mesos master leader is 10.195.30.19
 + marathon leader is 10.195.30.21
 + 1 docker task (let's call it APPTASK) is running on slave 10.195.30.21

 *Action*: stop the mesos master leader process (sudo stop mesos-master)

 *Expected*: mesos master leader has changed, active tasks / frameworks
 remain unchanged

 *End state: *
 + mesos master leader *has changed, now 10.195.30.21*
 + previously running APPTASK on the slave 10.195.30.21 has disappear
 (not showing anymore on the mesos UI), but *docker container is still
 running*
 + a n*ew APPTASK is now running on slave 10.195.30.19*
 + marathon framework registration time in mesos UI shows Just now
 + marathon leader *has changed, now 10.195.30.20*


 

 Now comes the 6 requested logs, which might contain interesting/relevant
 information, but i as a newcomer to mesos it is hard to read...


 *from previous MESOS master leader 10.195.30.19 http://10.195.30.19:*
 W0310 11:31:28.310518 24289 logging.cpp:81] RAW: Received signal SIGTERM
 from process 1 of user 0; exiting


 *from new MESOS master leader 10.195.30.21 http://10.195.30.21:*
 I0310 11:31:40.011545   922 detector.cpp:138] Detected a new leader:
 (id='2')
 I0310 11:31:40.011823   922 group.cpp:659] Trying to get
 '/mesos/info_02' in ZooKeeper
 I0310 11:31:40.015496   915 network.hpp:424] ZooKeeper group memberships
 changed
 I0310 11:31:40.015847   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/00' in ZooKeeper
 I0310 11:31:40.016047   922 detector.cpp:433] A new leading master (UPID=
 master@10.195.30.21:5050) is detected
 I0310 11:31:40.016074   922 master.cpp:1263] The newly elected leader is
 master@10.195.30.21:5050 with id 20150310-112310-354337546-5050-895
 I0310 11:31:40.016089   922 master.cpp:1276] Elected as the leading
 master!
 I0310 11:31:40.016108   922 master.cpp:1094] Recovering from registrar
 I0310 11:31:40.016188   918 registrar.cpp:313] Recovering registrar
 I0310 11:31:40.016542   918 log.cpp:656] Attempting to start the writer
 I0310 11:31:40.016918   918 replica.cpp:474] Replica received implicit
 promise request with proposal 2
 I0310 11:31:40.017503   915 group.cpp:659] Trying to get
 '/mesos/log_replicas/03' in ZooKeeper
 I0310 11:31:40.017832   918 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 893672ns
 I0310 11:31:40.017848   918 replica.cpp:342] Persisted promised to 2
 I0310 11:31:40.018817   915 network.hpp:466] ZooKeeper group PIDs: {
 log-replica(1)@10.195.30.20:5050, log-replica(1)@10.195.30.21:5050 }
 I0310 11:31:40.023022   923 coordinator.cpp:230] Coordinator attemping to
 fill missing position
 I0310 11:31:40.023110   923 log.cpp:672] Writer started with ending
 position 8
 I0310 11:31:40.023293   923 leveldb.cpp:438] Reading position from
 leveldb took 13195ns
 I0310 11:31:40.023309   923 leveldb.cpp:438] Reading position from
 leveldb took 3120ns
 I0310 11:31:40.023619   922 registrar.cpp:346] Successfully fetched the
 registry (610B) in 7.385856ms
 I0310 11:31:40.023679   922 registrar.cpp:445] Applied 1 operations in
 9263ns; attempting to update the 'registry'
 I0310 11:31:40.024238   922 log.cpp:680] Attempting to append 647 bytes
 to the log
 I0310 11:31:40.024279   923 coordinator.cpp:340] Coordinator attempting
 to write APPEND action at position 9
 I0310 11:31:40.024435   923 replica.cpp:508] Replica received write
 request for position 9
 I0310 11:31

Re: CPU resource allocation: ignore?

2015-03-11 Thread Geoffroy Jabouley
Ok so it seems better to keep cpu isolator and use little cpu share.


btw, when trying to create a mesos task using Marathon with cpu=0.0, i get
the following errors:

[2015-03-11 17:05:48,395] INFO Received status update for task
test-app.7b6ad5d9-c808-11e4-946b-56847afe9799: *TASK_LOST (Task uses
invalid resources: cpus(*):0)* (mesosphere.marathon.MarathonScheduler:148)
[2015-03-11 17:05:48,402] INFO Task
test-app.7b6ad5d9-c808-11e4-946b-56847afe9799 expunged and removed from
TaskTracker (mesosphere.marathon.tasks.TaskTracker:107)

so i guess this is not possible.

2015-03-11 17:05 GMT+01:00 Ian Downes idow...@twitter.com:

 Sorry, I meant that no cpu isolator only means no isolation.

 *The allocator* *does enforce a non-zero cpu allocation*, specifically
 see MIN_CPUS defined in src/master/constants.cpp to be 0.01 and used by the
 allocator:

 HierarchicalAllocatorProcessRoleSorter, FrameworkSorter::*allocatable*(
 const Resources resources)
 {
   Optiondouble cpus = resources.cpus();
   OptionBytes mem = resources.mem();

   return (*cpus.isSome()  cpus.get() = MIN_CPUS*) ||
  (mem.isSome()  mem.get() = MIN_MEM);
 }

 On Wed, Mar 11, 2015 at 8:54 AM, Connor Doyle con...@mesosphere.io
 wrote:

 If you don't care at all about accounting usage of that resource then you
 should be able to set it to 0.0.  As Ian mentioned, this won't be enforced
 with the cpu isolator disabled.
 --
 Connor

 On Mar 11, 2015, at 08:43, Ian Downes idow...@twitter.com wrote:

 The --isolation flag for the slave determines how resources are
 *isolated*, i.e., by not specifying any cpu isolator there will be no
 isolation between executors for cpu usage; the Linux scheduler will try to
 balance their execution.

 Cpu and memory are considered required resources for executors and I
 believe the master enforces this.

 What are behavior are you trying to achieve? If your jobs don't require
 much cpu then can you not just set a small value, like 0.25 cpu?

 On Wed, Mar 11, 2015 at 7:20 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 As cpu relatives shares are *not very* relevant in our heterogenous
 cluster, we would like to get rid of CPU resources management and only use
 MEM resources for our cluster and tasks allocation.

 Even when modifying the isolation flag of our slave to
 --isolation=cgroups/mem, we see these in the logs:

 *from the slave, at startup:*
 I0311 15:09:55.006750 50906 slave.cpp:289] Slave resources:
 ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974

 *from the master:*
 I0311 15:15:16.764714 50884 hierarchical_allocator_process.hpp:563]
 Recovered ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979;
 disk(*):22974 (total allocatable: ports(*):[31000-32000, 80-443];
 *cpus(*):2*; mem(*):1979; disk(*):22974) on slave
 20150311-150951-3982541578-5050-50860-S0 from framework
 20150311-150951-3982541578-5050-50860-

 And mesos master UI is showing both CPU and MEM resources status.



 Btw, we are using Marathon and Jenkins frameworks to start our mesos
 tasks, and the cpus field seems mandatory (set to 1.0 by default). So i
 guess you cannot easily bypass cpu resources allocation...


 Any idea?
 Regards

 2015-02-19 15:15 GMT+01:00 Ryan Thomas r.n.tho...@gmail.com:

 Hey Don,

 Have you tried only setting the 'cgroups/mem' isolation flag on the
 slave and not the cpu one?

 http://mesosphere.com/docs/reference/mesos-slave/


 ryan

 On 19 February 2015 at 14:13, Donald Laidlaw donlaid...@me.com wrote:

 I am using Mesos 0.21.1 with Marathon 0.8.0 and running everything in
 docker containers.

 Is there a way to have mesos ignore the cpu relative shares? That is,
 not limit the docker container CPU at all when it runs. I would still want
 to have the Memory resource limitation, but would rather just let the 
 linux
 system under the containers schedule all the CPU.

 This would allow us to just allocate tasks to mesos slaves based on
 available memory only, and to let those tasks get whatever CPU they could
 when they needed it. This is desireable where there can be lots of 
 relative
 high memory tasks that have very low CPU requirements. Especially if we do
 not know the capabilities of the slave machines with regards to CPU. Some
 of them may have fast CPU's, some slow, so it is hard to pick a relative
 number for that slave.

 Thanks,

 Don Laidlaw








Re: CPU resource allocation: ignore?

2015-03-11 Thread Geoffroy Jabouley
Hello

As cpu relatives shares are *not very* relevant in our heterogenous
cluster, we would like to get rid of CPU resources management and only use
MEM resources for our cluster and tasks allocation.

Even when modifying the isolation flag of our slave to
--isolation=cgroups/mem, we see these in the logs:

*from the slave, at startup:*
I0311 15:09:55.006750 50906 slave.cpp:289] Slave resources:
ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979; disk(*):22974

*from the master:*
I0311 15:15:16.764714 50884 hierarchical_allocator_process.hpp:563]
Recovered ports(*):[31000-32000, 80-443]; *cpus(*):2*; mem(*):1979;
disk(*):22974 (total allocatable: ports(*):[31000-32000, 80-443];
*cpus(*):2*; mem(*):1979; disk(*):22974) on slave
20150311-150951-3982541578-5050-50860-S0 from framework
20150311-150951-3982541578-5050-50860-

And mesos master UI is showing both CPU and MEM resources status.



Btw, we are using Marathon and Jenkins frameworks to start our mesos tasks,
and the cpus field seems mandatory (set to 1.0 by default). So i guess
you cannot easily bypass cpu resources allocation...


Any idea?
Regards

2015-02-19 15:15 GMT+01:00 Ryan Thomas r.n.tho...@gmail.com:

 Hey Don,

 Have you tried only setting the 'cgroups/mem' isolation flag on the slave
 and not the cpu one?

 http://mesosphere.com/docs/reference/mesos-slave/


 ryan

 On 19 February 2015 at 14:13, Donald Laidlaw donlaid...@me.com wrote:

 I am using Mesos 0.21.1 with Marathon 0.8.0 and running everything in
 docker containers.

 Is there a way to have mesos ignore the cpu relative shares? That is, not
 limit the docker container CPU at all when it runs. I would still want to
 have the Memory resource limitation, but would rather just let the linux
 system under the containers schedule all the CPU.

 This would allow us to just allocate tasks to mesos slaves based on
 available memory only, and to let those tasks get whatever CPU they could
 when they needed it. This is desireable where there can be lots of relative
 high memory tasks that have very low CPU requirements. Especially if we do
 not know the capabilities of the slave machines with regards to CPU. Some
 of them may have fast CPU's, some slow, so it is hard to pick a relative
 number for that slave.

 Thanks,

 Don Laidlaw





Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-10 Thread Geoffroy Jabouley
/2015 11:31:55.053]
[marathon-akka.actor.default-dispatcher-10] [akka://marathon/deadLetters]
Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:31:55,054] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:31:55,064] INFO Received status update for task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799:
TASK_LOST (Reconciliation: Task is unknown to the slave)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:55,069] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,069] INFO Queueing 1 new tasks for
/ffaas-backoffice-app-nopersist (0 queued)
(mesosphere.marathon.SchedulerActions:494)
[2015-03-10 11:31:55,069] INFO Task
ffaas-backoffice-app-nopersist.cc399489-c70f-11e4-ab88-56847afe9799
expunged and removed from TaskTracker
(mesosphere.marathon.tasks.TaskTracker:107)
[2015-03-10 11:31:55,070] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:55.072] [marathon-akka.actor.default-dispatcher-7]
[akka://marathon/user/$b] POSTing to all endpoints.
[2015-03-10 11:31:55,073] INFO Need to scale
/ffaas-backoffice-app-nopersist from 0 up to 1 instances
(mesosphere.marathon.SchedulerActions:488)
[2015-03-10 11:31:55,074] INFO Already queued 1 tasks for
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:498)
...
...
...
[2015-03-10 11:31:57,682] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:31:57,694] INFO Sending event notification.
(mesosphere.marathon.MarathonScheduler:262)
[INFO] [03/10/2015 11:31:57.694]
[marathon-akka.actor.default-dispatcher-11] [akka://marathon/user/$b]
POSTing to all endpoints.
...
...
...
[2015-03-10 11:36:55,047] INFO Expunging orphaned tasks from store
(mesosphere.marathon.tasks.TaskTracker:170)
[INFO] [03/10/2015 11:36:55.050] [marathon-akka.actor.default-dispatcher-2]
[akka://marathon/deadLetters] Message
[mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from
Actor[akka://marathon/user/MarathonScheduler/$a#1562989663] to
Actor[akka://marathon/deadLetters] was not delivered. [2] dead letters
encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
[2015-03-10 11:36:55,057] INFO Syncing tasks for all apps
(mesosphere.marathon.SchedulerActions:403)
[2015-03-10 11:36:55,058] INFO Requesting task reconciliation with the
Mesos master (mesosphere.marathon.SchedulerActions:430)
[2015-03-10 11:36:55,063] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,065] INFO Received status update for task
ffaas-backoffice-app-nopersist.ad9b55a7-c710-11e4-83a7-56847afe9799:
TASK_RUNNING (Reconciliation: Latest task state)
(mesosphere.marathon.MarathonScheduler:148)
[2015-03-10 11:36:55,066] INFO Already running 1 instances of
/ffaas-backoffice-app-nopersist. Not scaling.
(mesosphere.marathon.SchedulerActions:512)



-- End of logs



2015-03-10 10:25 GMT+01:00 Adam Bordelon a...@mesosphere.io:

 This is certainly not the expected/desired behavior when failing over a
 mesos master in HA mode. In addition to the master logs Alex requested, can
 you also provide relevant portions of the slave logs for these tasks? If
 the slave processes themselves never failed over, checkpointing and slave
 recovery should be irrelevant. Are you running the mesos-slave itself
 inside a Docker, or any other non-traditional setup?

 FYI, --checkpoint defaults to true (and is removed in 0.22), --recover
 defaults to reconnect, and --strict defaults to true, so none of those
 are necessary.

 On Fri, Mar 6, 2015 at 10:09 AM, Alex Rukletsov a...@mesosphere.io
 wrote:

 Geoffroy,

 could you please provide master logs (both from killed and taking over
 masters)?

 On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello

 we are facing some unexpecting issues when testing high availability
 behaviors of our mesos cluster.

 *Our use case:*

 *State*: the mesos cluster is up (3 machines), 1 docker task is running
 on each slave (started from marathon)

 *Action*: stop the mesos master leader process

 *Expected*: mesos master leader has changed, *active tasks remain
 unchanged*

 *Seen*: mesos master leader

Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-06 Thread Geoffroy Jabouley
Hello

we are facing some unexpecting issues when testing high availability
behaviors of our mesos cluster.

*Our use case:*

*State*: the mesos cluster is up (3 machines), 1 docker task is running on
each slave (started from marathon)

*Action*: stop the mesos master leader process

*Expected*: mesos master leader has changed, *active tasks remain unchanged*

*Seen*: mesos master leader has changed, *all active tasks are now FAILED
but docker containers are still running*, marathon detects FAILED tasks and
starts new tasks. We end with 2 docker containers running on each machine,
but only one is linked to a RUNNING mesos task.


Is the seen behavior correct?

Have we misunderstood the high availability concept? We thought that doing
this use case would not have any impact on the current cluster state
(except leader re-election)

Thanks in advance for your help
Regards

---

our setup is the following:
3 identical mesos nodes with:
+ zookeeper
+ docker 1.5
+ mesos master 0.21.1 configured in HA mode
+ mesos slave 0.21.1 configured with checkpointing, strict and reconnect
+ marathon 0.8.0 configured in HA mode with checkpointing

---

Command lines:


*mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
--cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
--quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

*mesos-slave*
/usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
--executor_registration_timeout=5mins --hostname=10.195.30.19
--ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
--recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

*marathon*
java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
-Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
/usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
--local_port_min 31000 --task_launch_timeout 30 --http_port 8080
--hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,10.195.30.20:2181
,10.195.30.21:2181/mesos


Re: Is mesos spamming me?

2015-02-01 Thread Geoffroy Jabouley
Hello

let's have a look at the message displayed in Jenkins log:

INFO: Offer not sufficient for slave request:
[name: cpus
type: SCALAR
scalar {
  value: 1.6
}
role: *
*== The Mesos slave is currently offering 1.6 CPU resources*

name: mem
type: SCALAR
scalar {
  value: 455.0
}
role: *
*== The Mesos slave is currently offering 455MB of RAM resources*


 name: disk
type: SCALAR
scalar {
  value: 32833.0
}
role: *
== The Mesos slave is currently offering 32GB of Disk resources

, name: ports
type: RANGES
ranges {
  range {
begin: 31000
end: 32000
  }
}
role: *
== The Mesos slave is currently offering ports between 31000  32000
(default)

]
[]



*Requested for Jenkins slave:  cpus: 0.2  mem:  704.0*

*== Your Jenkins slave is requesting 0.2 CPU and 704 MB of RAM*


So for me it is normal that your Jenkins slave request cannot be fullfilled
by Mesos, at least by *this mesos slave, as it only has 455MB of RAM to
offer and you need 704MB*.

FYI, the requested memory for a Jenkins slave is derived from the following
calculation: *Jenkins Slave Memory in MB + (Maximum number of Executors per
Slave * Jenkins Executor Memory in MB)*.
Maybe that's why you are seeing 704MB here and not 512MB as expected.


But if you have several other Mesos slaves each offering 2CPU/2GB RAM, then
this should not be a problem and the Jenkins slave should be created on
another Mesos slave (log message is something like offers match)

Are there any other apps running on your Mesos slave (another jenkins
slave, a jenkins master, ...) that would consume missing resources?



2015-02-02 6:11 GMT+01:00 Hepple, Robert rhep...@tnsi.com:

 On Sun, 2015-02-01 at 21:02 -0800, Vinod Kone wrote:
 
 
 
  On Sun, Feb 1, 2015 at 8:58 PM, Vinod Kone vinodk...@gmail.com
  wrote:
  By default mesos slave leaves some RAM and CPU for system
  processes. You can override this behavior by --resources flag.
 

 Yeah but ... the slave is reporting 1863Mb RAM and 2 CPUS - so how come
 that is rejected by jenkins which is asking for the default 0.1 cpu and
 512Mb RAM???


 Thanks


 Bob

  On Sun, Feb 1, 2015 at 6:05 PM, Hepple, Robert
  rhep...@tnsi.com wrote:
  On Fri, 2015-01-30 at 10:00 +0100, Geoffroy Jabouley
  wrote:
   Hello
  
  
   The message means that the received resource offer
  from Mesos cluster
   does not meet your jenkins slave requirements
  (memory or cpu). This is
   normal message.
  
 
  ... and here's another thing - the mesos master
  registers the slave as
  having 2*cpus and 1.8Gb RAM:
 
  I0202 11:43:47.623059 25809
  hierarchical_allocator_process.hpp:442] Added slave
  20150129-120204-1408111020-5050-10811-S18
  (ci00bldslv02
  v.ss.corp.cnp.tnsi.com) with cpus(*):2; mem(*):1863;
  disk(*):32961; ports(*):[31000-32000] (and cpus(*):2;
  mem(*):1863; disk(*):32961;
  ports(*):[31000-32000] available)
 
 
  --
  Senior Software Engineer
  T. 07 3224 9778
  M. 04 1177 6888
  Level 20, 300 Adelaide Street, Brisbane QLD 4000,
  Australia.
 
  On 18th December 2014, MasterCard acquired the Gateway
  Services business
  (TNSPay Retail and TNSPay eCommerce) of Transaction
  Network Services, to
  join MasterCard’s global gateway business, DataCash.
 
 
 
 
 
 
 
 
 
 
 
 

 --
 Senior Software Engineer
 T. 07 3224 9778
 M. 04 1177 6888
 Level 20, 300 Adelaide Street, Brisbane QLD 4000, Australia.

 On 18th December 2014, MasterCard acquired the Gateway Services business
 (TNSPay Retail and TNSPay eCommerce) of Transaction Network Services, to
 join MasterCard’s global gateway business, DataCash.










Re: Is mesos spamming me?

2015-01-30 Thread Geoffroy Jabouley
Hello

The message means that the received resource offer from Mesos cluster does
not meet your jenkins slave requirements (memory or cpu). This is normal
message.


you can filter logs from specific classes in Jenkins

   1. from the webUI, in the jenkins_url/log/levels panel, set the
   logging level for org.jenkinsci.plugins.mesos.JenkinsScheduler to
   *WARNING*
   2. use a logging.properties file


We use the second solution.


Content of the logging.properties file is:

--











*# Global logging handlershandlers=java.util.logging.ConsoleHandler# Define
custom logger for Jenkins mesos plugin (too
verbose!)org.jenkinsci.plugins.mesos.JenkinsScheduler=java.util.logging.ConsoleHandlerorg.jenkinsci.plugins.mesos.JenkinsScheduler.useParentHandlers=FALSEorg.jenkinsci.plugins.mesos.JenkinsScheduler.level=WARNING#
Define common logging
configurationjava.util.logging.ConsoleHandler.level=INFOjava.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter*

--

Jenkins instance is then started using:* java
-Djava.util.logging.config.file=/path/to/logging.properties -jar
$HOME/jenkins.war*


One drawback of this solution is that it also filters other interestings
logs from the mesos JenkinsScheduler class...

Hope this helps
Regards

2015-01-30 6:06 GMT+01:00 Hepple, Robert rhep...@tnsi.com:

 I have a single mesos master and 19 slaves. I have several jenkins
 servers making on-demand requests using the jenkins-mesos plugin - it
 all seems to be working correctly, mesos slaves are assigned to the
 jenkins servers, they execute jobs and eventually they detach.

 Except.

 Except the jenkins servers are getting spammed about every 1 or 2
 seconds with this in /var/log/jenkins/jenkins.log:

 Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler
 matches
 WARNING: Ignoring disk resources from offer
 Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler
 matches
 INFO: Ignoring ports resources from offer
 Jan 30, 2015 2:59:15 PM org.jenkinsci.plugins.mesos.JenkinsScheduler
 matches
 INFO: Offer not sufficient for slave request:
 [name: cpus
 type: SCALAR
 scalar {
   value: 1.6
 }
 role: *
 , name: mem
 type: SCALAR
 scalar {
   value: 455.0
 }
 role: *
 , name: disk
 type: SCALAR
 scalar {
   value: 32833.0
 }
 role: *
 , name: ports
 type: RANGES
 ranges {
   range {
 begin: 31000
 end: 32000
   }
 }
 role: *
 ]
 []
 Requested for Jenkins slave:
   cpus: 0.2
   mem:  704.0
   attributes:


 The mesos master side is also hitting the logs with eg:

 I0130 14:59:43.789172 10828 master.cpp:2344] Processing reply for offers:
 [ 20150129-120204-1408111020-5050-10811-O665754 ] on slave
 20150129-120204-1408111020-5050-10811-S2 at slave(1)@172.17.238.75:5051 (
 ci00bldslv15v.ss.corp.cnp.tnsi.com) for framework
 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at
 scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503
 I0130 14:59:43.789654 10828 master.cpp:2344] Processing reply for offers:
 [ 20150129-120204-1408111020-5050-10811-O665755 ] on slave
 20150129-120204-1408111020-5050-10811-S13 at slave(1)@172.17.238.98:5051 (
 ci00bldslv12v.ss.corp.cnp.tnsi.com) for framework
 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at
 scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503
 I0130 14:59:43.790004 10828 master.cpp:2344] Processing reply for offers:
 [ 20150129-120204-1408111020-5050-10811-O665756 ] on slave
 20150129-120204-1408111020-5050-10811-S11 at slave(1)@172.17.238.95:5051 (
 ci00bldslv11v.ss.corp.cnp.tnsi.com) for framework
 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at
 scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503
 I0130 14:59:43.790349 10828 master.cpp:2344] Processing reply for offers:
 [ 20150129-120204-1408111020-5050-10811-O665757 ] on slave
 20150129-120204-1408111020-5050-10811-S7 at slave(1)@172.17.238.108:5051 (
 ci00bldslv19v.ss.corp.cnp.tnsi.com) for framework
 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at
 scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503
 I0130 14:59:43.790670 10828 master.cpp:2344] Processing reply for offers:
 [ 20150129-120204-1408111020-5050-10811-O665758 ] on slave
 20150129-120204-1408111020-5050-10811-S14 at slave(1)@172.17.238.78:5051 (
 ci00bldslv06v.ss.corp.cnp.tnsi.com) for framework
 20150129-120204-1408111020-5050-10811-0001 (Jenkins Scheduler) at
 scheduler-1aab9acc-fba9-4123-b1ac-56ce74c0365b@172.17.152.201:54503
 I0130 14:59:43.791192 10828 hierarchical_allocator_process.hpp:563]
 Recovered cpus(*):1.6; mem(*):453; disk(*):32961; ports(*):[31000-32000]
 (total allocatable: cpus(*):1.6; mem(*):453; disk(*):32961;
 ports(*):[31000-32000]) on slave 20150129-120204-1408111020-5050-10811-S2
 from framework 20150129-120204-1408111020-5050-10811-0001
 I0130 14:59:43.791507 10828 hierarchical_allocator_process.hpp:563]
 Recovered 

Re: Unable to follow Sandbox links from Mesos UI.

2015-01-26 Thread Geoffroy Jabouley
Hello

just in case, which internet browser are you using?

Do you have installed any extensions (NoScript, Ghostery, ...) that could
prevent the display /statis/pailer display?

I personnaly use NoScript with Firefox, and i have to turn it off on all
@IP of our cluster to correctly access slave information from Mesos UI.

My 2 cents
Regards

2015-01-26 21:08 GMT+01:00 Suijian Zhou suijian.z...@ige-project.eu:

 Hi, Alex,
   Yes, I can see the link points to the slave machine when I hover on the
 Download button and stdout/stderr can be downloaded. So do you mean it is
 expected/designed that clicking on 'stdout/stderr' themselves will not show
 you anything? Thanks!

 Cheers,
 Dan


 2015-01-26 7:44 GMT-06:00 Alex Rukletsov a...@mesosphere.io:

 Dan,

 that's correct. The 'static/pailer.html' is a page that lives on the
 master and it gets a url to the actual slave as a parameter. The url
 is computed in 'controllers.js' based on where the associated executor
 lives. You should see this 'actual' url if you hover over the Download
 button. Please check this url for correctness and that you can access
 it from your browser.

 On Fri, Jan 23, 2015 at 9:24 PM, Dan Dong dongda...@gmail.com wrote:
  I see the problem: when I move the cursor onto the link, e.g: stderr, it
  actually points to the IP address of the master machine, so it trys to
  follow links of Master_IP:/tmp/mesos/slaves/...
   which is not there. So why the link does not point to the IP address of
  slaves( config problems somewhere?)?
 
  Cheers,
  Dan
 
 
  2015-01-23 11:25 GMT-06:00 Dick Davies d...@hellooperator.net:
 
  Start with 'inspect element' in the browser and see if that gives any
  clues.
  Sounds like your network is a little strict so it may be something
  else needs opening up.
 
  On 23 January 2015 at 16:56, Dan Dong dongda...@gmail.com wrote:
   Hi, Alex,
 That is what expected, but when I click on it, it pops a new blank
   window(pailer.html) without the content of the file(9KB size). Any
   hints?
  
   Cheers,
   Dan
  
  
   2015-01-23 4:37 GMT-06:00 Alex Rukletsov a...@mesosphere.io:
  
   Dan,
  
   you should be able to view file contents just by clicking on the
 link.
  
   On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com
 wrote:
  
   Yes, --hostname solves the problem. Now I can see all files there
 like
   stdout, stderr etc, but when I click on e.g stdout, it pops a new
   blank
   window(pailer.html) without the content of the file(9KB size).
   Although it
   provides a Download link beside, it would be much more
 convenient if
   one
   can view the stdout and stderr directly. Is this normal or there is
   still
   problem on my envs? Thanks!
  
   Cheers,
   Dan
  
  
   2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io:
  
   Try the --hostname parameters for master/slave. If you want to be
   extra
   explicit about the IP (e.g. publish the public IP instead of the
   private one
   in a cloud environment), you can also set the --ip parameter on
   master/slave.
  
   On Thu, Jan 22, 2015 at 8:43 AM, Dan Dong dongda...@gmail.com
   wrote:
  
   Thanks Ryan, yes, from the machine where the browser is on slave
   hostnames could not be resolved, so that's why failure, but it
 can
   reach
   them by IP address( I don't think sys admin would like to add
 those
   VMs
   entries to /etc/hosts on the server).  I tried to change masters
 and
   slaves
   of mesos to IP addresses instead of hostname but UI still points
 to
   hostnames of slaves. Is threre a way to let mesos only use IP
   address of
   master and slaves?
  
   Cheers,
   Dan
  
  
   2015-01-22 9:48 GMT-06:00 Ryan Thomas r.n.tho...@gmail.com:
  
   It is a request from your browser session, not from the master
 that
   is
   going to the slaves - so in order to view the sandbox you need
 to
   ensure
   that the machine your browser is on can resolve and route to the
   masters
   _and_ the slaves.
  
   The master doesn't proxy the sandbox requests through itself
 (yet)
   -
   they are made directly from your browser instance to the slaves.
  
   Make sure you can resolve the slaves from the machine you're
   browsing
   the UI on.
  
   Cheers,
  
   ryan
  
   On 22 January 2015 at 15:42, Dan Dong dongda...@gmail.com
 wrote:
  
   Thank you all, the master and slaves can resolve each others'
   hostname and ssh login without password, firewalls have been
   switched off on
   all the machines too.
   So I'm confused what will block such a pull of info of slaves
 from
   UI?
  
   Cheers,
   Dan
  
  
   2015-01-21 16:35 GMT-06:00 Cody Maloney c...@mesosphere.io:
  
   Also see https://issues.apache.org/jira/browse/MESOS-2129 if
 you
   want to track progress on changing this.
  
   Unfortunately it is on hold for me at the moment to fix.
  
   Cody
  
   On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas
   r.n.tho...@gmail.com
   wrote:
  
   Hey Dan,
  
   The UI will attempt to pull that info directly from the
 

Re: Task Checkpointing with Mesos, Marathon and Docker containers

2014-12-01 Thread Geoffroy Jabouley
Hello

the idea is to be able of tuning the mesos slave configuration (attributes,
resources offers, general options, ... upgrades?) without altering the
current tasks running on this mesos slave (a dockerized jenkins instance +
docker jenkins slaves for example).

I am setting up a test cluster with latest mesos/marathon releases, to
check if behaviors are identicals


2014-12-01 19:28 GMT+01:00 Benjamin Mahler benjamin.mah...@gmail.com:

  I would like to be able to shutdown a mesos-slave for maintenance
 without altering the current tasks.

 What are you trying to do? If your maintenance operation does not affect
 the tasks, why do you need to stop the slave in the first place?

 On Wed, Nov 26, 2014 at 1:36 AM, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:

 Hello all

 thanks for your answers.

 Is there a way of configuring this 75s timeout for slave reconnection?

 I think that my problem is that as the task status is lost:
 - marathon framework detects the loss and start another instance
 - mesos-slave, when restarting, detects the lost task and restart a new
 one

 == 2 tasks on mesos cluster, 2 running docker containers, 1 app instance
 in marathon


 So a solution would be to extend the 75s timeout. I thought that my
 command lines for starting the cluster were fine, but it seems incomplete...

 I would like to be able to shutdown a mesos-slave for maintenance without
 altering the current tasks.

 2014-11-25 18:30 GMT+01:00 Connor Doyle con...@mesosphere.io:

 Hi Geoffroy,

 For the Marathon instances, in all released version of Marathon you must
 supply the --checkpoint flag to turn on task checkpointing for the
 framework.  We've changed the default to true starting with the next
 release.

 There is a bug in Mesos where the FrameworkInfo does not get updated
 when a framework re-registers.  This means that if you shut down Marathon
 and restart it with --checkpoint, the Mesos master (with the same
 FrameworkId, which Marathon picks up from ZK) will ignore the new setting.
 For reference, here is the design doc to address that:
 https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info

 Fortunately, there is an easy workaround.

 1) Shut down Marathon (tasks keep running)
 2) Restart the leading Mesos master (tasks keep running)
 3) Start Marathon with --checkpoint enabled

 This works by clearing the Mesos master's in-memory state.  It is
 rebuilt as the slave nodes and frameworks re-register.

 Please report back if this doesn't solve the issue for you.
 --
 Connor


  On Nov 25, 2014, at 07:43, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:
 
  Hello
 
  i am currently trying to activate checkpointing for my Mesos cloud.
 
  Starting from an application running in a docker container on the
 cluster, launched from marathon, my use cases are the followings:
 
  UC1: kill the marathon service, then restart after 2 minutes.
  Expected: the mesos task is still active, the docker container is
 running. When the marathon service restarts, it get backs its tasks.
 
  Result: OK
 
 
  UC2: kill the mesos slave, then restart after 2 minutes.
  Expected: the mesos task remains active, the docker container is
 running. When the mesos slave service restarts, it get backs its tasks.
 Marathon does not show error.
 
  Results: task get status LOST when slave is killed. Docker container
 still running.  Marathon detects the application went down and spawn a new
 one on another available mesos slave. When the slave restarts, it kills the
 previous running container and start a new one. So i end up with 2
 applications on my cluster, one spawn by Marathon, and another orphan one.
 
 
  Is this behavior normal? Can you please explain what i am doing wrong?
 
 
 ---
 
  Here is the configuration i have come so far:
  Mesos 0.19.1 (not dockerized)
  Marathon 0.6.1 (not dockerized)
  Docker 1.3 + Deimos 0.4.2
 
  Mesos master is started:
  /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
 --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
 --quorum=1 --work_dir=/var/lib/mesos
 
  Mesos slave is started:
  /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
 --log_dir=/var/log/mesos --checkpoint=true
 --containerizer_path=/usr/local/bin/deimos
 --executor_registration_timeout=5mins --hostname=... --ip=...
 --isolation=external --recover=reconnect --recovery_timeout=120mins
 --strict=true
 
  Marathon is started:
  java -Xmx512m -Djava.library.path=/usr/local/lib
 -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
 /usr/local/bin/marathon mesosphere.marathon.Main --zk
 zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3
 --hostname ... --event_subscriber http_callback --http_port 8080
 --task_launch_timeout 30 --local_port_max 4 --ha --checkpoint
 
 
 
 






Re: Task Checkpointing with Mesos, Marathon and Docker containers

2014-11-26 Thread Geoffroy Jabouley
Hello all

thanks for your answers.

Is there a way of configuring this 75s timeout for slave reconnection?

I think that my problem is that as the task status is lost:
- marathon framework detects the loss and start another instance
- mesos-slave, when restarting, detects the lost task and restart a new one

== 2 tasks on mesos cluster, 2 running docker containers, 1 app instance
in marathon


So a solution would be to extend the 75s timeout. I thought that my command
lines for starting the cluster were fine, but it seems incomplete...

I would like to be able to shutdown a mesos-slave for maintenance without
altering the current tasks.

2014-11-25 18:30 GMT+01:00 Connor Doyle con...@mesosphere.io:

 Hi Geoffroy,

 For the Marathon instances, in all released version of Marathon you must
 supply the --checkpoint flag to turn on task checkpointing for the
 framework.  We've changed the default to true starting with the next
 release.

 There is a bug in Mesos where the FrameworkInfo does not get updated when
 a framework re-registers.  This means that if you shut down Marathon and
 restart it with --checkpoint, the Mesos master (with the same FrameworkId,
 which Marathon picks up from ZK) will ignore the new setting.  For
 reference, here is the design doc to address that:
 https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info

 Fortunately, there is an easy workaround.

 1) Shut down Marathon (tasks keep running)
 2) Restart the leading Mesos master (tasks keep running)
 3) Start Marathon with --checkpoint enabled

 This works by clearing the Mesos master's in-memory state.  It is rebuilt
 as the slave nodes and frameworks re-register.

 Please report back if this doesn't solve the issue for you.
 --
 Connor


  On Nov 25, 2014, at 07:43, Geoffroy Jabouley 
 geoffroy.jabou...@gmail.com wrote:
 
  Hello
 
  i am currently trying to activate checkpointing for my Mesos cloud.
 
  Starting from an application running in a docker container on the
 cluster, launched from marathon, my use cases are the followings:
 
  UC1: kill the marathon service, then restart after 2 minutes.
  Expected: the mesos task is still active, the docker container is
 running. When the marathon service restarts, it get backs its tasks.
 
  Result: OK
 
 
  UC2: kill the mesos slave, then restart after 2 minutes.
  Expected: the mesos task remains active, the docker container is
 running. When the mesos slave service restarts, it get backs its tasks.
 Marathon does not show error.
 
  Results: task get status LOST when slave is killed. Docker container
 still running.  Marathon detects the application went down and spawn a new
 one on another available mesos slave. When the slave restarts, it kills the
 previous running container and start a new one. So i end up with 2
 applications on my cluster, one spawn by Marathon, and another orphan one.
 
 
  Is this behavior normal? Can you please explain what i am doing wrong?
 
 
 ---
 
  Here is the configuration i have come so far:
  Mesos 0.19.1 (not dockerized)
  Marathon 0.6.1 (not dockerized)
  Docker 1.3 + Deimos 0.4.2
 
  Mesos master is started:
  /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
 --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
 --quorum=1 --work_dir=/var/lib/mesos
 
  Mesos slave is started:
  /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
 --log_dir=/var/log/mesos --checkpoint=true
 --containerizer_path=/usr/local/bin/deimos
 --executor_registration_timeout=5mins --hostname=... --ip=...
 --isolation=external --recover=reconnect --recovery_timeout=120mins
 --strict=true
 
  Marathon is started:
  java -Xmx512m -Djava.library.path=/usr/local/lib
 -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
 /usr/local/bin/marathon mesosphere.marathon.Main --zk
 zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3
 --hostname ... --event_subscriber http_callback --http_port 8080
 --task_launch_timeout 30 --local_port_max 4 --ha --checkpoint
 
 
 
 




Task Checkpointing with Mesos, Marathon and Docker containers

2014-11-25 Thread Geoffroy Jabouley
Hello

i am currently trying to activate checkpointing for my Mesos cloud.

Starting from an application running in a docker container on the cluster,
launched from marathon, my use cases are the followings:

*UC1: kill the marathon service, then restart after 2 minutes.*
*Expected*: the mesos task is still active, the docker container is
running. When the marathon service restarts, it get backs its tasks.

*Result*: OK


*UC2: kill the mesos slave, then restart after 2 minutes.*
*Expected*: the mesos task remains active, the docker container is running.
When the mesos slave service restarts, it get backs its tasks. Marathon
does not show error.

*Results*: task get status LOST when slave is killed. Docker container
still running.  Marathon detects the application went down and spawn a new
one on another available mesos slave. When the slave restarts, it kills the
previous running container and start a new one. So i end up with 2
applications on my cluster, one spawn by Marathon, and another orphan one.


Is this behavior normal? Can you please explain what i am doing wrong?

---

Here is the configuration i have come so far:
Mesos 0.19.1 (not dockerized)
Marathon 0.6.1 (not dockerized)
Docker 1.3 + Deimos 0.4.2

Mesos master is started:
*/usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
--log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
--quorum=1 --work_dir=/var/lib/mesos*

Mesos slave is started:
*/usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
--log_dir=/var/log/mesos --checkpoint=true
--containerizer_path=/usr/local/bin/deimos
--executor_registration_timeout=5mins --hostname=... --ip=...
--isolation=external --recover=reconnect --recovery_timeout=120mins
--strict=true*

Marathon is started:
*java -Xmx512m -Djava.library.path=/usr/local/lib
-Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
/usr/local/bin/marathon mesosphere.marathon.Main --zk
zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 3
--hostname ... --event_subscriber http_callback --http_port 8080
--task_launch_timeout 30 --local_port_max 4 --ha --checkpoint*