reserving resources for host/mesos

2018-06-12 Thread Paul Mackles
Hi - Basic question that I couldn’t find an answer to in existing docs…
when configuring the available resources on a slave, is it appropriate to
leave some resources over for the mesos-agent itself (and the host OS)? Any
pointers on existing configs folks are using would be appreciated.

-- 
Thanks,
Paul


unsubscribe

2017-01-16 Thread Paul Bell



Re: Mesos loses track of Docker containers

2016-08-14 Thread Paul
Thank you, Sivaram.

That would seem to be 2 "votes" for upgrading.

-Paul


> On Aug 13, 2016, at 11:47 PM, Sivaram Kannan <sivara...@gmail.com> wrote:
> 
> 
> I don't remember the condition exactly, but I have faced similar issue in my 
> deployments and have been fixed when I moved to 0.26.0. Upgrade the marathon 
> to compatible version as well.
> 
>> On Wed, Aug 10, 2016 at 9:30 AM, Paul Bell <arach...@gmail.com> wrote:
>> Hi Jeff,
>> 
>> Thanks for your reply.
>> 
>> Yeahthat thought occurred to me late last night. But customer is 
>> sensitive to too much churn, so it wouldn't be my first choice. If I knew 
>> with certainty that such a problem existed in the versions they are running 
>> AND that more recent versions fixed it, then I'd do my best to compel the 
>> upgrade. 
>> 
>> Docker version is also old, 1.6.2.
>> 
>> -Paul
>> 
>>> On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder 
>>> <jeffschroe...@computer.org> wrote:
>>> Have you considered upgrading Mesos and Marathon? Those are quite old 
>>> versions of both with some fairly glaring problems with the docker 
>>> containerizer if memory serves. Also what version of docker?
>>> 
>>> 
>>>> On Wednesday, August 10, 2016, Paul Bell <arach...@gmail.com> wrote:
>>>> Hello,
>>>> 
>>>> One of our customers has twice encountered a problem wherein Mesos & 
>>>> Marathon appear to lose track of the application containers that they 
>>>> started. 
>>>> 
>>>> Platform & version info:
>>>> 
>>>> Ubuntu 14.04 (running under VMware)
>>>> Mesos (master & agent): 0.23.0
>>>> ZK: 3.4.5--1
>>>> Marathon: 0.10.0
>>>> 
>>>> The phenomena:
>>>> 
>>>> When I log into either the Mesos or Marathon UIs I see no evidence of 
>>>> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps" 
>>>> command shows the containers up & running. 
>>>> 
>>>> I've seen some confusing appearances before, but never this. For example, 
>>>> I've seen what might be described as the reverse of the above phenomena. I 
>>>> mean the case where a customer powers cycles the VM. In such a case you 
>>>> typically see in Marathon's UI the (mere) appearance of the containers up 
>>>> & running, but a "docker ps" command shows no containers running. As folks 
>>>> on this list have explained to me, this is the result of "stale state" and 
>>>> after 10 minutes (by default), Mesos figures out that the supposedly 
>>>> active tasks aren't there and restarts them.
>>>> 
>>>> But that's not the case here. I am hard-pressed to understand what 
>>>> conditions/causes might lead to Mesos & Marathon becoming unaware of 
>>>> containers that they started.
>>>> 
>>>> I would be very grateful if someone could help me understand what's going 
>>>> on here (so would our customer!).
>>>> 
>>>> Thanks.
>>>> 
>>>> -Paul
>>> 
>>> 
>>> -- 
>>> Text by Jeff, typos by iPhone
> 
> 
> 
> -- 
> ever tried. ever failed. no matter.
> try again. fail again. fail better.
> -- Samuel Beckett


Re: Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell
Hi Jeff,

Thanks for your reply.

Yeahthat thought occurred to me late last night. But customer is
sensitive to too much churn, so it wouldn't be my first choice. If I knew
with certainty that such a problem existed in the versions they are running
AND that more recent versions fixed it, then I'd do my best to compel the
upgrade.

Docker version is also old, 1.6.2.

-Paul

On Wed, Aug 10, 2016 at 9:18 AM, Jeff Schroeder <jeffschroe...@computer.org>
wrote:

> Have you considered upgrading Mesos and Marathon? Those are quite old
> versions of both with some fairly glaring problems with the docker
> containerizer if memory serves. Also what version of docker?
>
>
> On Wednesday, August 10, 2016, Paul Bell <arach...@gmail.com> wrote:
>
>> Hello,
>>
>> One of our customers has twice encountered a problem wherein Mesos &
>> Marathon appear to lose track of the application containers that they
>> started.
>>
>> Platform & version info:
>>
>> Ubuntu 14.04 (running under VMware)
>> Mesos (master & agent): 0.23.0
>> ZK: 3.4.5--1
>> Marathon: 0.10.0
>>
>> The phenomena:
>>
>> When I log into either the Mesos or Marathon UIs I see no evidence of
>> *any* tasks, active or completed. Yet, in the Linux shell, a "docker ps"
>> command shows the containers up & running.
>>
>> I've seen some confusing appearances before, but never this. For example,
>> I've seen what might be described as the *reverse* of the above
>> phenomena. I mean the case where a customer powers cycles the VM. In such a
>> case you typically see in Marathon's UI the (mere) appearance of the
>> containers up & running, but a "docker ps" command shows no containers
>> running. As folks on this list have explained to me, this is the result of
>> "stale state" and after 10 minutes (by default), Mesos figures out that the
>> supposedly active tasks aren't there and restarts them.
>>
>> But that's not the case here. I am hard-pressed to understand what
>> conditions/causes might lead to Mesos & Marathon becoming unaware of
>> containers that they started.
>>
>> I would be very grateful if someone could help me understand what's going
>> on here (so would our customer!).
>>
>> Thanks.
>>
>> -Paul
>>
>>
>>
>
> --
> Text by Jeff, typos by iPhone
>


Mesos loses track of Docker containers

2016-08-10 Thread Paul Bell
Hello,

One of our customers has twice encountered a problem wherein Mesos &
Marathon appear to lose track of the application containers that they
started.

Platform & version info:

Ubuntu 14.04 (running under VMware)
Mesos (master & agent): 0.23.0
ZK: 3.4.5--1
Marathon: 0.10.0

The phenomena:

When I log into either the Mesos or Marathon UIs I see no evidence of *any*
tasks, active or completed. Yet, in the Linux shell, a "docker ps" command
shows the containers up & running.

I've seen some confusing appearances before, but never this. For example,
I've seen what might be described as the *reverse* of the above phenomena.
I mean the case where a customer powers cycles the VM. In such a case you
typically see in Marathon's UI the (mere) appearance of the containers up &
running, but a "docker ps" command shows no containers running. As folks on
this list have explained to me, this is the result of "stale state" and
after 10 minutes (by default), Mesos figures out that the supposedly active
tasks aren't there and restarts them.

But that's not the case here. I am hard-pressed to understand what
conditions/causes might lead to Mesos & Marathon becoming unaware of
containers that they started.

I would be very grateful if someone could help me understand what's going
on here (so would our customer!).

Thanks.

-Paul


Re: What's the official pronounce of mesos?

2016-07-13 Thread Paul
Sadly, I don't understand a whole lot about Mesos, but I did learn Ancient 
Greek in college, taught it for a couple of years, and have even translated 
parts of Homer's Iliad.

 μέσος

The 'e' (epsilon) in 'Mesos' would be pronounced like the 'e' in the English 
word 'pet'. The 'o' (omicron) as in 'hot'.

But, at least to English ears, that pronunciation feels a bit stilted. So I 
think Rodrick's right to sound the 'o' as long, as in 'tone'.

-Paul

> On Jul 13, 2016, at 9:12 PM, Rodrick Brown <rodr...@orchard-app.com> wrote:
> 
> Mess-O's 
> 
> Get Outlook for iOS
> 
> 
> 
> 
> On Wed, Jul 13, 2016 at 7:56 PM -0400, "zhiwei" <zhiw...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I saw in some videos, different people pronounce 'mesos' differently.
>> 
>> Can someone add the official pronounce of mesos to wikipedia?
> 
> NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
> use of the addressee only. If you are not an intended recipient of this 
> communication, please delete it immediately and notify the sender by return 
> email. Unauthorized reading, dissemination, distribution or copying of this 
> communication is prohibited. This communication does not constitute an offer 
> to sell or a solicitation of an indication of interest to purchase any loan, 
> security or any other financial product or instrument, nor is it an offer to 
> sell or a solicitation of an indication of interest to purchase any products 
> or services to any persons who are prohibited from receiving such information 
> under applicable law. The contents of this communication may not be accurate 
> or complete and are subject to change without notice. As such, Orchard App, 
> Inc. (including its subsidiaries and affiliates, "Orchard") makes no 
> representation regarding the accuracy or completeness of the information 
> contained herein. The intended recipient is advised to consult its own 
> professional advisors, including those specializing in legal, tax and 
> accounting matters. Orchard does not provide legal, tax or accounting advice.


Re: Consequences of health-check timeouts?

2016-05-18 Thread Paul Bell
Hi Hasodent,

Thanks for your reply.

In re executor_shutdown_grace_period: how would this enable the task
(MongoDB) to terminate gracefully? (BTW: I am fairly certain that the mongo
STDOUT as captured by Mesos shows that it received signal 15 just before it
said good-bye). My naive understanding of this grace period is that it
simply delays the termination of the executor.

The following snippet is rom /var/log/syslog. I believe it shows the stack
trace (largely in the kernel) that led to mesos-master being blocked for
more than 120 seconds. Please note that immediately above (before) the
blocked mesos-master is a blocked jbd2/dm. Immediately below (after) the
blocked mesos-master is a blocked java task. I'm not sure what the java
task is. This took place on the mesos-master node and none of our
applications runs there. It runs master, Marathon, and ZK. Maybe the java
task is Marathon or ZK?

Thanks again.

-Paul

May 16 20:06:53 71 kernel: [193339.890848] INFO: task mesos-master:4013
blocked for more than 120 seconds.

May 16 20:06:53 71 kernel: [193339.890873]   Not tainted
3.13.0-32-generic #57-Ubuntu

May 16 20:06:53 71 kernel: [193339.890889] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

May 16 20:06:53 71 kernel: [193339.890912] mesos-masterD
88013fd94440 0  4013  1 0x

May 16 20:06:53 71 kernel: [193339.890914]  880137429a28
0002 880135778000 880137429fd8

May 16 20:06:53 71 kernel: [193339.890916]  00014440
00014440 880135778000 88013fd94cd8

May 16 20:06:53 71 kernel: [193339.890918]  88013ffd34b0
0002 81284630 880137429aa0

May 16 20:06:53 71 kernel: [193339.890919] Call Trace:

May 16 20:06:53 71 kernel: [193339.890922]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890924]  []
io_schedule+0x9d/0x140

May 16 20:06:53 71 kernel: [193339.890925]  []
sleep_on_shadow_bh+0xe/0x20

May 16 20:06:53 71 kernel: [193339.890927]  []
__wait_on_bit+0x62/0x90

May 16 20:06:53 71 kernel: [193339.890929]  [] ?
start_this_handle+0x590/0x590

May 16 20:06:53 71 kernel: [193339.890930]  []
out_of_line_wait_on_bit+0x77/0x90

May 16 20:06:53 71 kernel: [193339.890932]  [] ?
autoremove_wake_function+0x40/0x40

May 16 20:06:53 71 kernel: [193339.890934]  [] ?
wake_up_bit+0x25/0x30

May 16 20:06:53 71 kernel: [193339.890936]  []
do_get_write_access+0x2ad/0x4f0

May 16 20:06:53 71 kernel: [193339.890938]  [] ?
__getblk+0x2d/0x2e0

May 16 20:06:53 71 kernel: [193339.890939]  []
jbd2_journal_get_write_access+0x27/0x40

May 16 20:06:53 71 kernel: [193339.890942]  []
__ext4_journal_get_write_access+0x3b/0x80

May 16 20:06:53 71 kernel: [193339.890946]  []
ext4_reserve_inode_write+0x70/0xa0

May 16 20:06:53 71 kernel: [193339.890948]  [] ?
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890949]  []
ext4_mark_inode_dirty+0x44/0x1f0

May 16 20:06:53 71 kernel: [193339.890951]  []
ext4_dirty_inode+0x40/0x60

May 16 20:06:53 71 kernel: [193339.890953]  []
__mark_inode_dirty+0x10a/0x2d0

May 16 20:06:53 71 kernel: [193339.890956]  []
update_time+0x81/0xd0

May 16 20:06:53 71 kernel: [193339.890957]  []
file_update_time+0x80/0xd0

May 16 20:06:53 71 kernel: [193339.890961]  []
__generic_file_aio_write+0x180/0x3d0

May 16 20:06:53 71 kernel: [193339.890963]  []
generic_file_aio_write+0x58/0xa0

May 16 20:06:53 71 kernel: [193339.890965]  []
ext4_file_write+0x99/0x400

May 16 20:06:53 71 kernel: [193339.890967]  [] ?
wake_up_state+0x10/0x20

May 16 20:06:53 71 kernel: [193339.890970]  [] ?
wake_futex+0x66/0x90

May 16 20:06:53 71 kernel: [193339.890972]  [] ?
futex_wake+0x1b1/0x1d0

May 16 20:06:53 71 kernel: [193339.890974]  []
do_sync_write+0x5a/0x90

May 16 20:06:53 71 kernel: [193339.890976]  []
vfs_write+0xb4/0x1f0

May 16 20:06:53 71 kernel: [193339.890978]  []
SyS_write+0x49/0xa0

May 16 20:06:53 71 kernel: [193339.890980]  []
tracesys+0xe1/0xe6



On Wed, May 18, 2016 at 2:33 AM, haosdent <haosd...@gmail.com> wrote:

> >Is there some way to be given control (a callback, or an "exit" routine)
> so that the container about to be nuked can be given a chance to exit
> gracefully?
> The default value of executor_shutdown_grace_period is 5 seconds, you
> could change it by specify the `--executor_shutdown_grace_period` flag when
> launch mesos agent.
>
> >Are there other steps I can take to avoid this mildly calamitous
> occurrence?
> >mesos-slaves get shutdown
> Do you know where your mesos-master stuck when it happens? Any error log
> or related log about this? In addition, is there any log when mesos-slave
> shut down?
>
> On Wed, May 18, 2016 at 6:12 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> I probably have the following account partly wrong, but let me present it
>> just the same and those who know better can co

Consequences of health-check timeouts?

2016-05-17 Thread Paul Bell
Hi All,

I probably have the following account partly wrong, but let me present it
just the same and those who know better can correct me as needed.

I've an application that runs several MongoDB shards, each a Dockerized
container, each on a distinct node (VM); in fact, some of the VMs are on
separate ESXi hosts.

I've lately seen situations where, because of very slow disks for the
database, the following sequence occurs (I think):

   1. Linux (Ubuntu 14.04 LTS) virtual memory manager hits thresholds
   defined by vm.dirty_background_ratio and/or vm.dirty_ratio (probably both)
   2. Synchronous flushing of many, many pages occurs, writing to a slow
   disk
   3. (Around this time one might see in /var/log/syslog "task X blocked
   for more than 120 seconds" for all kinds of tasks, including mesos-master)
   4. mesos-slaves get shutdown (this is the part I'm unclear about; but I
   am quite certain that on 2 nodes the executors and their in-flight MongoDB
   tasks got zapped because I can see that Marathon restarted them).

The consequences of this are a corrupt MongoDB database. In the case at
hand, the job had run for over 50 hours, processing close to 120 million
files.

Steps I've taken so far to remedy include:

   - tune vm.dirty_background_ratio and vm.dirty_ratio down, respectively,
   to 5 and 10 (from 10 and 20). The intent here is to tolerate more frequent,
   smaller flushes and thus avoid less frequent massive flushes that suspend
   threads for very long periods.
   - increase agent ping timeout to 10 minutes (every 30 seconds, 20 times)

So the questions are:

   - Is there some way to be given control (a callback, or an "exit"
   routine) so that the container about to be nuked can be given a chance to
   exit gracefully?
   - Are there other steps I can take to avoid this mildly calamitous
   occurrence?
   - (Also, I'd be grateful for more clarity on anything in steps 1-4 above
   that is a bit hand-wavy!)

As always, thanks.

-Paul


Status of Mesos-3821

2016-04-19 Thread Paul Bell
Hi,

I think I encountered the problem described by
https://issues.apache.org/jira/browse/MESOS-3821 and wanted to ask if this
fix is in Mesos 0.28.

But perhaps I misunderstand what's being said; so by way of background our
case is Mesos on CentOS 7.2. When we try to set --docker_socket to
tcp://: the mesos-slave service refuses to start. It seems to
require a Unix socket.

There is one comment in the ticket that expresses the hope of being able to
use URLs of the tcp:// form.

Am I misunderstanding this fix and if not, what release of Mesos
incorporates it?

Thanks for your help.

-Paul


Re: Mesos Master and Slave on same server?

2016-04-13 Thread Paul Bell
Hi June,

In addition to doing what Pradeep suggests, I also now & then run a single
node "cluster" that houses mesos-master, mesos-slave, and Marathon.

Works fine.

Cordially,

Paul

On Wed, Apr 13, 2016 at 12:36 PM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> I would suggest you to run mesos-master and zookeeper and marathon on same
> set of hosts (maybe call them as coordinator nodes) and use completely
> different set of nodes for mesos slaves. This way you can do the
> maintenance of such hosts in a very planned fashion.
>
> On Wed, Apr 13, 2016 at 4:22 PM, Stefano Bianchi <jazzist...@gmail.com>
> wrote:
>
>> For sure it is possible.
>> Simply Mesos-master will the the resources offered by the machine on
>> which is running mesos-slave also, transparently.
>>
>> 2016-04-13 16:34 GMT+02:00 June Taylor <j...@umn.edu>:
>>
>>> All of our node servers are identical hardware. Is it reasonable for me
>>> to install the Mesos-Master and Mesos-Slave on the same physical hardware?
>>>
>>> Thanks,
>>> June Taylor
>>> System Administrator, Minnesota Population Center
>>> University of Minnesota
>>>
>>
>>
>
>
> --
> Regards,
> Pradeep Chhetri
>


Re: Mesos 0.28 SSL in official packages

2016-04-12 Thread Paul Bell
FWIW, I quite agree with Zameer's point.

That said, I want to make abundantly clear that in my experience the folks
at Mesosphere are wonderfully helpful.

But what happens if down the road Mesosphere is acquired or there occurs
some other event that could represent, if not a conflict of interest, then
simply a different strategic direction?

My 2 cents.

-Paul

On Mon, Apr 11, 2016 at 5:19 PM, Zameer Manji <zma...@apache.org> wrote:

> I have suggested this before and I will suggest it again here.
>
> I think the Apache Mesos project should build and distribute packages
> instead of relying on the generosity of a commercial vendor. The Apache
> Aurora project does this already with good success. As a user of Apache
> Mesos I don't care about Mesosphere Inc and I feel uncomfortable that the
> project is so dependent on its employees.
>
> Doing this would allow users to contribute packaging fixes directly to the
> project, such as enabling SSL.
>
> On Mon, Apr 11, 2016 at 3:02 AM, Adam Bordelon <a...@mesosphere.io> wrote:
>
>> Hi Kamil,
>>
>> Technically, there are no "official" Apache-built packages for Apache
>> Mesos.
>>
>> At least once company (Mesosphere) chooses to build and distribute
>> Mesos packages, but does not currently offer SSL builds. It wouldn't
>> be hard to add an SSL build to our regular builds, but it hasn't been
>> requested enough to prioritize it.
>>
>> cc: Joris, Kapil
>>
>> On Thu, Apr 7, 2016 at 7:42 AM, haosdent <haosd...@gmail.com> wrote:
>> > Hi, ssl didn't enable default. You need compile it by following this doc
>> > http://mesos.apache.org/documentation/latest/ssl/
>> >
>> > On Thu, Apr 7, 2016 at 10:04 PM, Kamil Wokitajtis <wokitaj...@gmail.com
>> >
>> > wrote:
>> >>
>> >> This is my first post, so Hi everyone!
>> >>
>> >> Is SSL enabled in official packages (CentOS in my case)?
>> >> I can see libssl in ldd output, but I cannot see libevent.
>> >> I had to compile mesos from sources to run it over ssl.
>> >> I would prefer to install it from packages.
>> >>
>> >> Regards,
>> >> Kamil
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Haosdent Huang
>>
>> --
>> Zameer Manji
>>
>>


Re: Backup a Mesos Cluster

2016-04-11 Thread Paul Bell
Piotr,

Thank you for this link. I am looking at it now where I right away notice
that Exhibitor is designed to monitor (and backup) Zookeeper (but not
anything related to Mesos itself). Don't the Mesos master & agent nodes
keep at least some state outside of the ZK znodes, e.g., under the default
workdir?

Shua,

Thank you for this observation. Happily (I think), we do not have a custom
framework. Presently, Marathon is the only framework that we use.

-Paul

On Mon, Apr 11, 2016 at 8:12 AM, Shuai Lin <linshuai2...@gmail.com> wrote:

> If your product containers a custom framework, at least you should
> implement kind of high availability for your scheduler (like
> marathon/chronos does), or let it be launched by marathon so it can be
> restarted when it fails.
>
> On Mon, Apr 11, 2016 at 7:27 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> As we get closer to shipping a Mesos-based version of our product, we've
>> turned our attention to "protecting" (supporting backup & recovery) of not
>> only our application databases, but the cluster as well.
>>
>> I'm not quite sure how to begin thinking about this, but I suppose the
>> usual dimensions of B/R would come into play, e.g., hot/cold, application
>> consistent/crash consistent, etc.
>>
>> Has anyone grappled with this issue and, if so, would you be so kind as
>> to share your experience and solutions?
>>
>> Thank you.
>>
>> -Paul
>>
>>
>


Backup a Mesos Cluster

2016-04-11 Thread Paul Bell
Hi All,

As we get closer to shipping a Mesos-based version of our product, we've
turned our attention to "protecting" (supporting backup & recovery) of not
only our application databases, but the cluster as well.

I'm not quite sure how to begin thinking about this, but I suppose the
usual dimensions of B/R would come into play, e.g., hot/cold, application
consistent/crash consistent, etc.

Has anyone grappled with this issue and, if so, would you be so kind as to
share your experience and solutions?

Thank you.

-Paul


Re: Agent won't start

2016-03-30 Thread Paul Bell
Greg, thanks again - I am planning on moving my work_dir.



Pradeep, thanks again. In a slightly different scenario, namely,

service mesos-slave stop
edit /etc/default/mesos-slave   (add a port resource)
service mesos-slave start


I noticed that slave did not start and - again - the log shows the same
phenomena as in my original post. Per your suggestion, I did a

rm -Rf /tmp/mesos

and the slave service started correctly.

Questions:


   1. Did editing /etc/default/mesos-slave cause the failure of the service
   to start?
   2. given that starting/stopping the entire cluster (stopping all
   services on all nodes) is a standard feature in our product, should I
   routinely to the above "rm" command when the mesos services are stopped?


Thanks for your help.

Cordially,

Paul

On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann <g...@mesosphere.io> wrote:

> Check out this link for info on /tmp cleanup in Ubuntu:
> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>
> And check out this link for information on some of the work_dir's contents
> on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/
>
> The work_dir contains important application state for the Mesos agent, so
> it should not be placed in a location that will be automatically
> garbage-collected by the OS. The choice of /tmp/mesos as a default location
> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
> change it. Ideally you should be able to leave the work_dir alone and let
> the Mesos agent manage it for you.
>
> In any case, I would recommend that you set the work_dir to something
> outside of /tmp; /var/lib/mesos is a commonly-used location.
>
> Cheers,
> Greg
>


Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell <arach...@gmail.com> wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:
>>>
>>>> Hi Paul,
>>>> Noticing the logging output, "Failed to find resources file
>>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>>> may be related to the location of your agent's work_dir. See this ticket:
>>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>>
>>>> Some users have reported issues resulting from the systemd-tmpfiles
>>>> service garbage collecting files in /tmp, perhaps this is related? What
>>>> platform is your agent running on?
>>>>
>>>> You could try specifying a different agent work directory outside of
>>>> /tmp/ via the `--work_dir` command-line flag.
>>>>
>>>> Cheers,
>>>> Greg
>>>>
>>>>
>>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am hoping someone can shed some light on this.
>>>>>
>>>>> An agent node failed to start, that is, when I did "service
>>>>> mesos-slave start" the service came up briefly & then stopped. Before
>>>>> stopping it produced the log shown below. The last thing it wrote is
>>>>> "Trying to create path '/mesos' in Zookeeper".
>>>>>
>>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>>> removing the mesos znode from Zookeeper.
>>>>>
>>>>> After doing this, the mesos-slave service started perfectly.
>>>>>
>>>>> What might be happening here, and also what's the right way to
>>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>> -Paul
>>>>>
>>>>>
>>>>> Log file created at: 2016/03/29 14:19:39
>>>>> Running on machine: 71.100.202.193
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
>>>>> by root
>>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>>> posix/cpu,posix/mem
>>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>>> I0329 14:19:39.616267  58

Re: Agent won't start

2016-03-29 Thread Paul Bell
Whoa...interessant!

The node *may* have been rebooted. Uptime says 2 days. I'll need to check
my notes.

Can you point me to reference re Ubuntu behavior?

Based on what you've told me so far, it sounds as if the sequence:

stop service
reboot agent node
start service


could lead to trouble - or do I misunderstand?


Thank you again for your help.

-Paul

On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann <g...@mesosphere.io> wrote:

> Paul,
> This would be relevant for any system which is automatically deleting
> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
> be completely nuked at boot time. Was the agent node rebooted prior to this
> problem?
>
> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Greg,
>>
>> Thanks very much for your quick reply.
>>
>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>> systemd. I will look at the link you provide.
>>
>> Is there any chance that it might apply to non-systemd platforms?
>>
>> Cordially,
>>
>> Paul
>>
>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:
>>
>>> Hi Paul,
>>> Noticing the logging output, "Failed to find resources file
>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>> may be related to the location of your agent's work_dir. See this ticket:
>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>
>>> Some users have reported issues resulting from the systemd-tmpfiles
>>> service garbage collecting files in /tmp, perhaps this is related? What
>>> platform is your agent running on?
>>>
>>> You could try specifying a different agent work directory outside of
>>> /tmp/ via the `--work_dir` command-line flag.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am hoping someone can shed some light on this.
>>>>
>>>> An agent node failed to start, that is, when I did "service mesos-slave
>>>> start" the service came up briefly & then stopped. Before stopping it
>>>> produced the log shown below. The last thing it wrote is "Trying to create
>>>> path '/mesos' in Zookeeper".
>>>>
>>>> This mention of the mesos znode prompted me to go for a clean slate by
>>>> removing the mesos znode from Zookeeper.
>>>>
>>>> After doing this, the mesos-slave service started perfectly.
>>>>
>>>> What might be happening here, and also what's the right way to
>>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>>
>>>> Thanks for your help.
>>>>
>>>> -Paul
>>>>
>>>>
>>>> Log file created at: 2016/03/29 14:19:39
>>>> Running on machine: 71.100.202.193
>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>>>> root
>>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>>> posix/cpu,posix/mem
>>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>>> 71.100.202.193:5051
>>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>>> --default_role="*" --disk_watch_interval="1mins"
>>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>>> --docker_remove_delay="6hrs"
>>>> --docker_sandbox_dir

Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Greg,

Thanks very much for your quick reply.

I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
systemd. I will look at the link you provide.

Is there any chance that it might apply to non-systemd platforms?

Cordially,

Paul

On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann <g...@mesosphere.io> wrote:

> Hi Paul,
> Noticing the logging output, "Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
> be related to the location of your agent's work_dir. See this ticket:
> https://issues.apache.org/jira/browse/MESOS-4541
>
> Some users have reported issues resulting from the systemd-tmpfiles
> service garbage collecting files in /tmp, perhaps this is related? What
> platform is your agent running on?
>
> You could try specifying a different agent work directory outside of /tmp/
> via the `--work_dir` command-line flag.
>
> Cheers,
> Greg
>
>
> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi,
>>
>> I am hoping someone can shed some light on this.
>>
>> An agent node failed to start, that is, when I did "service mesos-slave
>> start" the service came up briefly & then stopped. Before stopping it
>> produced the log shown below. The last thing it wrote is "Trying to create
>> path '/mesos' in Zookeeper".
>>
>> This mention of the mesos znode prompted me to go for a clean slate by
>> removing the mesos znode from Zookeeper.
>>
>> After doing this, the mesos-slave service started perfectly.
>>
>> What might be happening here, and also what's the right way to
>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> Log file created at: 2016/03/29 14:19:39
>> Running on machine: 71.100.202.193
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>> root
>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>> posix/cpu,posix/mem
>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>> 71.100.202.193:5051
>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>> --attributes="hostType:shard1" --authenticatee="crammd5"
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>> --default_role="*" --disk_watch_interval="1mins"
>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>> --docker_remove_delay="6hrs"
>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="5mins"
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>> --initialize_driver_logging="true" --ip="71.100.202.193"
>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>> --master="zk://71.100.202.191:2181/mesos"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="1secs"
>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>> --strict="true" --switch_user="true" --version="false"
>> --work_dir="/tmp/mesos"
>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>> '/tmp/mesos/meta'
>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'
>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>> 71.100.202.193:5051) connected to ZooKeeper
>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>>
>>
>


Agent won't start

2016-03-29 Thread Paul Bell
Hi,

I am hoping someone can shed some light on this.

An agent node failed to start, that is, when I did "service mesos-slave
start" the service came up briefly & then stopped. Before stopping it
produced the log shown below. The last thing it wrote is "Trying to create
path '/mesos' in Zookeeper".

This mention of the mesos znode prompted me to go for a clean slate by
removing the mesos znode from Zookeeper.

After doing this, the mesos-slave service started perfectly.

What might be happening here, and also what's the right way to
trouble-shoot such a problem? Mesos is version 0.23.0.

Thanks for your help.

-Paul


Log file created at: 2016/03/29 14:19:39
Running on machine: 71.100.202.193
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
posix/cpu,posix/mem
I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
71.100.202.193:5051
I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
--attributes="hostType:shard1" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins"
--docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
--docker_remove_delay="6hrs"
--docker_sandbox_directory="/mnt/mesos/sandbox"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
--enforce_container_disk_quota="false"
--executor_registration_timeout="5mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --hostname="71.100.202.193"
--initialize_driver_logging="true" --ip="71.100.202.193"
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--master="zk://71.100.202.191:2181/mesos"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
--strict="true" --switch_user="true" --version="false"
--work_dir="/tmp/mesos"
I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
mem(*):23089; disk(*):122517; ports(*):[31000-32000]
I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
'/tmp/mesos/meta'
I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'
I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
71.100.202.193:5051) connected to ZooKeeper
I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper


Re: Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell
Hi Vinod,

Thank you for your reply.

I'm not sure that I can be more specific. MSPs are interested in a "view by
tenant", e.g., "show me all applications that are allotted to Tenant X".  I
suppose that the standard Mesos UI could, with properly named task IDs and
the UI's "Find" filter, accomplish part of "view by tenant". But in order
to see the resources consumed by Tenant X's tasks, you have to visit each
task individually and look at their "Resources" table (add them all up).

It'd be cool if when a filter is in effect, the Resources table was updated
to reflect only the resources consumed by the filter-selected tasks.

There's also the question of the units/meaning of Resources. Through
Marathon I give each of my Dockerized tasks .1 CPU. As I understand it,
Docker multiplies this value times 1024 which is Docker's representation of
all the cores on a host. So when I do "docker inspect " I will see
CpuShares of 102. But in the Mesos UI each of my 6 tasks shows .2 CPUs
allocated. I'm simply not sure what this means or how it's arrived at. I
suspect that an MSP will ask the same questions.

I will think about it some more, but I'd be interested to hear feedback on
these few points that I've raised.

Thanks again.

-Paul

On Thu, Feb 25, 2016 at 11:55 AM, Vinod Kone <vinodk...@gmail.com> wrote:

>
> > But an important MSP requirement is a unified view of their many
> tenants. So I am really trying to get a sense for how well the recent
> Mesos/Marathon releases address this requirement.
>
> Can you be more specific about what you mean by unified view and tenants?
> What's lacking currently?


Recent UI enhancements & Managed Service Providers

2016-02-25 Thread Paul Bell
Hi All,

I am running older versions of Mesos & Marathon (0.23.0 and 0.10.0).

Over the course of the last several months I think I've seen several items
on this list about UI enhancements. Perhaps they were enhancements to the
data consumed by the Mesos & Marathon UIs. I've had very little time to dig
deeply into it.

So...I am wondering if someone can either point me to any discussions of
such enhancements or summarize them here.

There is a specific use case behind this request. The Mesos architecture
seems to be a real sweet spot for an MSP. But an important MSP requirement
is a unified view of their many tenants. So I am really trying to get a
sense for how well the recent Mesos/Marathon releases address this
requirement.

Thank you.

-Paul


Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Paul Bell
Hello All,

Has there ever been any consideration of the ability to move in-flight
containers from one Mesos host node to another?

I see this as analogous to VMware's "vMotion" facility wherein VMs can be
moved from one ESXi host to another.

I suppose something like this could be useful from a load-balancing
perspective.

Just curious if it's ever been considered and if so - and rejected - why
rejected?

Thanks.

-Paul


Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell
In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <arach...@gmail.com> wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=m

Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
urnal
dir=/data/db/config/journal
2016-01-14T19:01:11.000+ [initandlisten] recover : no journal files
present, no recovery needed
2016-01-14T19:01:11.429+ [initandlisten] warning:
ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: {
opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
"query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural:
-1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
"0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0,
lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
2016-01-14T19:01:11.429+ [initandlisten] waiting for connections on
port 27019
2016-01-14T19:01:17.405+ [initandlisten] connection accepted from
10.2.0.3:51189 #1 (1 connection now open)
2016-01-14T19:01:17.413+ [initandlisten] connection accepted from
10.2.0.3:51190 #2 (2 connections now open)
2016-01-14T19:01:17.413+ [initandlisten] connection accepted from
10.2.0.3:51191 #3 (3 connections now open)
2016-01-14T19:01:17.414+ [conn3] first cluster operation detected,
adding sharding hook to enable versioning and authentication to remote
servers
2016-01-14T19:01:17.414+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+ [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.419+ [initandlisten] connection accepted from
10.2.0.3:51193 #4 (4 connections now open)
2016-01-14T19:01:17.420+ [initandlisten] connection accepted from
10.2.0.3:51194 #5 (5 connections now open)
2016-01-14T19:01:17.442+ [conn1] end connection 10.2.0.3:51189 (4
connections now open)
2016-01-14T19:02:11.285+ [clientcursormon] mem (MB) res:59 virt:385
2016-01-14T19:02:11.285+ [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:02:11.285+ [clientcursormon]  connections:4
2016-01-14T19:03:11.293+ [clientcursormon] mem (MB) res:72 virt:385
2016-01-14T19:03:11.294+ [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:03:11.294+ [clientcursormon]  connections:4
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task

Most disturbing in all of this is that while I can stop the deployments in
Marathon (which properly end the "docker stop" commands visible in ps
output), I can not bounce docker, not by Upstart, nor via kill command).
Ultimately, I have to reboot the VM.

FWIW, the 3 mongod containers (apparently stuck in their Killing docker
task/shutting down loop) are running at 100%CPU as evinced by both "docker
stats" and "top".

I would truly be grateful for some guidance on this - even a mere
work-around would be appreciated.

Thank you.

-Paul


Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi Tim,

Things have gotten slightly odder (if that's possible). When I now start
the application 5 or so containers, only one "ecxconfigdb" gets started -
and even he took a few tries. That is, I see him failing, moving to
deploying, then starting again. But I've no evidence (no STDOUT, and no
docker ctr logs) that show why.

In any event, ecxconfigdb does start. Happily, when I try to stop the
application I am seeing the phenomena I posted before: killing docker task,
shutting down repeated many times. The UN-stopped container is now running
at 100% CPU.

I will try modifying docker_stop_timeout. Back shortly

Thanks again.

-Paul

PS: what do you make of the "broken pipe" error in the docker.log?

*from /var/log/upstart/docker.log*

[34mINFO[3054] GET /v1.15/images/mongo:2.6.8/json
INFO[3054] GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
ERRO[3054] Handler for GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
returned error: No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
ERRO[3054] HTTP Error
 err=No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
statusCode=404
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] POST
/v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
INFO[3111] GET /v1.21/containers/json
INFO[3120] GET /v1.21/containers/cf7/json
INFO[3120] GET
/v1.21/containers/cf7/logs?stderr=1=1=all
INFO[3153] GET /containers/json
INFO[3153] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3153] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3153] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
INFO[3175] GET /containers/json
INFO[3175] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3175] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3175] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
*INFO[3175] POST
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
?t=15
*ERRO[3175] attach: stdout: write unix @: broken pipe*
*INFO[3190] Container
cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
exit within 15 seconds of SIGTERM - using the force *
*INFO[3200] Container cf7fc7c48324 failed to exit within 10
seconds of kill - trying direct SIGKILL *

*STDOUT from Mesos:*

*--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
*--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
--stop_timeout="15secs"
--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_dire

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hey Tim,

Thank you very much for your reply.

Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>> MongoDB (4)
>> PostGres
>> ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/m

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
I spoke to soon, I'm afraid.

Next time I did the stop (with zero timeout), I see the same phenomenon: a
mongo container showing repeated:

killing docker task
shutting down


What else can I try?

Thank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote:

> Hi Tim,
>
> I set docker_stop_timeout to zero as you asked. I am pleased to report
> (though a bit fearful about being pleased) that this change seems to have
> shut everyone down pretty much instantly.
>
> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
> the immediate use of "kill -9" as opposed to "kill -2"?
>
> I will keep testing the behavior.
>
> Thank you.
>
> -Paul
>
> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> Things have gotten slightly odder (if that's possible). When I now start
>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>> and even he took a few tries. That is, I see him failing, moving to
>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>> docker ctr logs) that show why.
>>
>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>> application I am seeing the phenomena I posted before: killing docker task,
>> shutting down repeated many times. The UN-stopped container is now running
>> at 100% CPU.
>>
>> I will try modifying docker_stop_timeout. Back shortly
>>
>> Thanks again.
>>
>> -Paul
>>
>> PS: what do you make of the "broken pipe" error in the docker.log?
>>
>> *from /var/log/upstart/docker.log*
>>
>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>> [34mINFO [0m[3054] GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [31mERRO [0m[3054] Handler for GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> returned error: No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mERRO [0m[3054] HTTP Error [31merr
>> [0m=No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mstatusCode [0m=404
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [34mINFO [0m[3111] GET /v1.21/containers/json
>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>> [34mINFO [0m[3120] GET
>> /v1.21/containers/cf7/logs?stderr=1=1=all
>> [34mINFO [0m[3153] GET /containers/json
>> [34mINFO [0m[3153] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3153] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3153] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> [34mINFO [0m[3175] GET /containers/json
>> [34mINFO [0m[3175] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3175] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3175] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> * [34mI

Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul
Gladly, Weitao. It'd be my pleasure.

But give me a few hours to find some free time. 

I am today tasked with cooking a Thanksgiving turkey.

But I will try to find the time before noon today (I'm on the right coast in 
the USA).

-Paul

> On Nov 25, 2015, at 11:26 PM, Weitao <zhouwtl...@gmail.com> wrote:
> 
> Hi, Paul. Can your share the total experience about the arch with us. I am 
> trying to do the similar thing
> 
> 
>> 在 2015年11月26日,09:47,Paul <arach...@gmail.com> 写道:
>> 
>> experience


Re: Anyone try Weave in Mesos env ?

2015-11-26 Thread Paul Bell
Hi Weitao,

I came up with this architecture as a way of distributing our application
across multiple nodes. Pre-Mesos, our application, delivered as a single
VMware VM, was not easily scalable. By breaking out the several application
components as Docker containers, we are now able (within limits imposed
chiefly by the application itself) to distribute & run those containers
across the several nodes in the Mesos cluster. Application containers that
need to talk to each other are connected via Weave's "overlay" (veth)
network.

Not surprisingly, this architecture has some of the benefits that you'd
expect from Mesos, chief among them being high-availability (more on this
below), scalability, and hybrid Cloud deployment.

The core unit of deployment is an Ubuntu image (14.04 LTS) that I've
configured with the appropriate components:

Zookeeper
Mesos-master
Mesos-slave
Marathon
Docker
Weave

SSH (including RSA keys)

Our application


This images is presently downloaded by a customer as a VMware .ova file. We
typically ask the customer to convert the resulting VM to a so-called
VMware template from which she can easily deploy multiple VMs as needed.
Please note that although we've started with VMware as our virtualization
platform, I've successfully run cluster nodes on both EC2 and Azure.

I tend to describe the Ubuntu image as "polymorphic", i.e., it can be told
to assume one of two roles, either a "master" role or a "slave" role. A
master runs ZK, mesos-master, and Marathon. A slave runs mesos-slave,
Docker, Weave, and the application.

We presently offer 3 canned deployment options:

   1. single-host, no HA
   2. multi-host, no HA (1 master, 3 slaves)
   3. multi-host, HA (3 masters, 3 slaves)

The single-host, no HA option exists chiefly to mimic the original
pre-Mesos deployment. But it has the added virtue, thanks to Mesos, of
allowing us to dynamically "grow" from a single-host to multiple hosts.

The multi-host, no HA option is presently geared toward a sharded MongoDB
backend where each slave runs a mongod container that is a single partition
(shard) of the larger database. This deployment option also lends itself
very nicely to adding a new slave node at the cluster level, and a new
mongod container at the application level - all without any downtime
whatsoever.

The multi-host, HA option offers the probably familiar *cluster-level* high
availability. I stress "cluster-level" because I think we have to
distinguish between HA at that level & HA at the application level. The
former is realized by the 3 master hosts, i.e., you can lose a master and
new one will self-elect thereby keeping the cluster up & running. But, to
my mind, at least, application level HA requires some co-operation on the
part of the application itself (e.g., checkpoint/restart). That said, it
*is* almost magical to watch Mesos re-launch an application container that
has crashed. But whether or not that re-launch results in coherent
application behavior is another matter.

An important home-grown component here is a Java program that automates
these functions:

create cluster - configures a host for a given role and starts Mesos
services. This is done via SSH
start application - distributes application containers across slave hosts.
This is done by talking to the Marathon REST API
stop application - again, via the Marathon REST API
stop cluster - stops Mesos services. Again, via SSH
destroy cluster - deconfigures the host (after which it has no defined
role); again, SSH


As I write, I see Ajay's e-mail arrive about Calico. I am aware of this
project and it seems quite solid. But I've never understood the need to
"worry about networking containers in multihost setup". Weave runs as a
Docker container and It Just Works. I've "woven" together slaves nodes in a
cluster that spanned 3 different datacenters, one of them in EC2, without
any difficulty. Yes, I do have to assign Weave IP addresses to the several
containers, but this is hardly onerous. In fact, I've found it "liberating"
to select such addresses from a CIDR/8 address space, assigning them to
containers based on the container's purpose (e.g., MongoDB shard containers
might live at 10.4.0.X, etc.). Ultimately, this assignment boils down to
setting an environment variable that Marathon (or the mesos-slave executor)
will use when creating the container via "docker run".

There is a whole lot more that I could say about the internals of this
architecture. But, if you're still interested, I'll await further questions
from you.

HTH.

Cordially,

Paul


On Thu, Nov 26, 2015 at 7:16 AM, Paul <arach...@gmail.com> wrote:

> Gladly, Weitao. It'd be my pleasure.
>
> But give me a few hours to find some free time.
>
> I am today tasked with cooking a Thanksgiving turkey.
>
> But I will try to find the time before noon today (I'm on the rig

Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul
Hi Sam, 

Yeah, I have significant experience in this regard.

We run a Docker containers spread across several Mesos slave nodes. The 
containers are all connected via Weave. It works very well.

Can you describe what you have in mind?

Cordially,

Paul

> On Nov 25, 2015, at 8:03 PM, Sam <usultra...@gmail.com> wrote:
> 
> Guys,
> We are trying to use Weave in hybrid cloud Mesos env , anyone got experience 
> on it ? Appreciated 
> Regards,
> Sam
> 
> Sent from my iPhone


Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul
Happy Thanksgiving to you, too.

I tend to deploy the several Mesos nodes as VMware VMs.

However, I've also run a cluster with master on ESXi, slaves on ESXi, slave on 
bare metal, and an EC2 slave.

But in my case all applications are Docker containers connected via Weave.

Does your present deployment involve Docker and Weave? 

-paul

> On Nov 25, 2015, at 8:55 PM, Sam <usultra...@gmail.com> wrote:
> 
> Paul,
> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud env , 
> and we deployed Mesos master in AWS , part of Slaves in AWS , part of Slaves 
> in Rackspace .  I am thinking whether it works ? And since it got low latency 
> in networking , can we deploy two masters in both AWS and Rackspace ? And 
> federation ?Appreciated for your reply .
> 
> Regards ,
> Sam
> 
> Sent from my iPhone
> 
>> On Nov 26, 2015, at 9:47 AM, Paul <arach...@gmail.com> wrote:
>> 
>> Hi Sam, 
>> 
>> Yeah, I have significant experience in this regard.
>> 
>> We run a Docker containers spread across several Mesos slave nodes. The 
>> containers are all connected via Weave. It works very well.
>> 
>> Can you describe what you have in mind?
>> 
>> Cordially,
>> 
>> Paul
>> 
>>> On Nov 25, 2015, at 8:03 PM, Sam <usultra...@gmail.com> wrote:
>>> 
>>> Guys,
>>> We are trying to use Weave in hybrid cloud Mesos env , anyone got 
>>> experience on it ? Appreciated 
>>> Regards,
>>> Sam
>>> 
>>> Sent from my iPhone


Re: Anyone try Weave in Mesos env ?

2015-11-25 Thread Paul Bell
HmmI'm not sure there's really a "fix" for that (BTW: I assume you mean
to fix high (or long) latency, i.e., to make it lower, faster). A network
link is a network link, right? Like all hardware, it has its own physical
characteristics which determine its latency's lower bound, below which it
is physically impossible to go.

Sounds to me as if you've got the whole Mesos + Docker + Weave thing
figured out, at least as far as the basic connectivity and addressing is
concerned. So there's not much more that I can tell you in that regard.

Are you running Weave 1.2 (or above)? It incorporates their "fast path"
technology based on the Linux kernel's Open vSwitch (*vide*:
http://blog.weave.works/2015/11/13/weave-docker-networking-performance-fast-data-path/).
But, remember, there's still the link in between endpoints. One can
optimize the packet handling within an endpoint, but this could boil down
to a case of "hurry up and wait".

I would urge you to take this question up with the friendly, knowledgeable,
and very helpful folks at Weave:
https://groups.google.com/a/weave.works/forum/#!forum/weave-users .

Cordially,

Paul

On Wed, Nov 25, 2015 at 9:31 PM, Sam <usultra...@gmail.com> wrote:

> Paul,
> Yup, Weave and Docker.  May I know how did you fix low latency issue over
> Internet ? By tunnel or ?
>
> Regards,
> Sam
>
> Sent from my iPhone
>
> > On Nov 26, 2015, at 10:23 AM, Paul <arach...@gmail.com> wrote:
> >
> > Happy Thanksgiving to you, too.
> >
> > I tend to deploy the several Mesos nodes as VMware VMs.
> >
> > However, I've also run a cluster with master on ESXi, slaves on ESXi,
> slave on bare metal, and an EC2 slave.
> >
> > But in my case all applications are Docker containers connected via
> Weave.
> >
> > Does your present deployment involve Docker and Weave?
> >
> > -paul
> >
> >> On Nov 25, 2015, at 8:55 PM, Sam <usultra...@gmail.com> wrote:
> >>
> >> Paul,
> >> Happy thanksgiving first. We are using Aws, Rackspace as hybrid cloud
> env , and we deployed Mesos master in AWS , part of Slaves in AWS , part of
> Slaves in Rackspace .  I am thinking whether it works ? And since it got
> low latency in networking , can we deploy two masters in both AWS and
> Rackspace ? And federation ?Appreciated for your reply .
> >>
> >> Regards ,
> >> Sam
> >>
> >> Sent from my iPhone
> >>
> >>> On Nov 26, 2015, at 9:47 AM, Paul <arach...@gmail.com> wrote:
> >>>
> >>> Hi Sam,
> >>>
> >>> Yeah, I have significant experience in this regard.
> >>>
> >>> We run a Docker containers spread across several Mesos slave nodes.
> The containers are all connected via Weave. It works very well.
> >>>
> >>> Can you describe what you have in mind?
> >>>
> >>> Cordially,
> >>>
> >>> Paul
> >>>
> >>>> On Nov 25, 2015, at 8:03 PM, Sam <usultra...@gmail.com> wrote:
> >>>>
> >>>> Guys,
> >>>> We are trying to use Weave in hybrid cloud Mesos env , anyone got
> experience on it ? Appreciated
> >>>> Regards,
> >>>> Sam
> >>>>
> >>>> Sent from my iPhone
>


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Jie,

Thank you.

That's odd behavior, no? That would seem to mean that the slave can never again 
join the cluster, at least not from it's original IP@.

What if the master bounces? Will it then tolerate the slave?

-Paul

On Nov 13, 2015, at 4:46 PM, Jie Yu <yujie@gmail.com> wrote:

>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
> 
> It'll not be added to the cluster. The master will send a Shutdown message to 
> the slave if it comes up after the timeout.
> 
> - Jie 
> 
>> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell <arach...@gmail.com> wrote:
>> Hi All,
>> 
>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
>> without a response from a mesos-slave, the master will remove the slave. In 
>> the Mesos UI I can see slave state transition from 1 deactivated to 0.
>> 
>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
>> 
>> (I'm dusting off some old notes and trying to refresh my memory about 
>> problems I haven't seen in quite some time).
>> 
>> Thank you.
>> 
>> -Paul
> 


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Ah, now I get it.

And this comports with the behavior I am observing right now.

Thanks again, Jie.

-Paul

> On Nov 13, 2015, at 5:55 PM, Jie Yu <yujie@gmail.com> wrote:
> 
> Paul, the slave will terminate after receiving a Shutdown message. The slave 
> will be restarted (e.g., by monit or systemd) and register with the master as 
> a new slave (a different slaveId).
> 
> - Jie
> 
>> On Fri, Nov 13, 2015 at 2:53 PM, Paul <arach...@gmail.com> wrote:
>> Jie,
>> 
>> Thank you.
>> 
>> That's odd behavior, no? That would seem to mean that the slave can never 
>> again join the cluster, at least not from it's original IP@.
>> 
>> What if the master bounces? Will it then tolerate the slave?
>> 
>> -Paul
>> 
>> On Nov 13, 2015, at 4:46 PM, Jie Yu <yujie@gmail.com> wrote:
>> 
>>>> Can that slave never again be added into the cluster, i.e., what happens 
>>>> if it comes up 1 second after exceeding the timeout product?
>>> 
>>> It'll not be added to the cluster. The master will send a Shutdown message 
>>> to the slave if it comes up after the timeout.
>>> 
>>> - Jie 
>>> 
>>>> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell <arach...@gmail.com> wrote:
>>>> Hi All,
>>>> 
>>>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
>>>> without a response from a mesos-slave, the master will remove the slave. 
>>>> In the Mesos UI I can see slave state transition from 1 deactivated to 0.
>>>> 
>>>> Can that slave never again be added into the cluster, i.e., what happens 
>>>> if it comes up 1 second after exceeding the timeout product?
>>>> 
>>>> (I'm dusting off some old notes and trying to refresh my memory about 
>>>> problems I haven't seen in quite some time).
>>>> 
>>>> Thank you.
>>>> 
>>>> -Paul
> 


Fate of slave node after timeout

2015-11-13 Thread Paul Bell
Hi All,

IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
without a response from a mesos-slave, the master will remove the slave. In
the Mesos UI I can see slave state transition from 1 deactivated to 0.

Can that slave never again be added into the cluster, i.e., what happens if
it comes up 1 second after exceeding the timeout product?

(I'm dusting off some old notes and trying to refresh my memory about
problems I haven't seen in quite some time).

Thank you.

-Paul


Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell
I've done a little reconnoitering, and the terrain looks to me as follows:

   1. Docker maintains container log files at
   /var/lib/docker/containers//-json.log
   2. Mesos maintains container STDOUT files at a
   slave/framework/application specific location, e.g.,
   
/tmp/mesos/slaves/20151102-082316-370041927-5050-32381-S1/frameworks/20151102-082316-370041927-5050-32381-/executors/ecxprimary1.80750071-81a0-11e5-8596-82d195a34239/runs/5c767378-9599-40af-8010-a31f4c55f9dc
   3. The latter is mapped to the container's /mnt/mesos/sandbox
   4. These two files (-json.log and the STDOUT file) are different, *each*
   consumes disk space.

I think that the answer to (1) is Docker's logrotate.

As to (2), I am considering a cron job at host (not container) level that
drives truncate cmd (GNU coreutils) to prune these files at a certain size.
Obviously requires knowing the fully-qualified path under
/tmp/mesos/slaves, but this is readily available via "docker inspect".

-Paul


On Fri, Nov 6, 2015 at 7:17 AM, Paul Bell <arach...@gmail.com> wrote:

> Hi Mauricio,
>
> YeahI see your point; thank you.
>
> My approach would be akin to closing the barn door after the horse got
> out. Both Mesos & Docker are doing their own writing of STDOUT. Docker's
> rotation won't address Mesos's behavior.
>
> I need to find a solution here.
>
> -Paul
>
>
> On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> I don't think that's going to help :(
>> Even if you configure a different docker log driver, Docker still send
>> things to stdout, which is catched by mesos and dumped in the .logs
>> directory in the job sandbox. For example, by default docker logs into a
>> json file in /var/lib/docker but mesos still writes to the sandbox.
>> Hi Mauricio,
>>
>> I'm grappling with the same issue.
>>
>> I'm not yet sure if it represents a viable solution, but I plan to look
>> at Docker's log rotation facility. It was introduced in Docker 1.8.
>>
>> If you beat me to it & it looks like a solution, please let us know!
>>
>> Thanks.
>>
>> Cordially,
>>
>> Paul
>>
>> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
>> mauriciogaravag...@gmail.com> wrote:
>> >
>> > Hi guys,
>> >
>> > How can I manage the stdout/err log files generated by jobs in mesos?
>> for long running docker apps launched using marathon the log files can
>> deplete the disk of an agent, and using quotas makes the jobs to be killed
>> which is also not ideal. I'd like to have a way to rotate them.
>> >
>> > Is it correct to just go to the mesos agent workdir and go through each
>> sandbox stdout/err and rotate them? I know that could break the log UI but
>> it doesn't scale very well having logs of several of GB.
>> >
>> > Thanks!
>>
>
>


Re: manage jobs log files in sandboxes

2015-11-06 Thread Paul Bell
Hi Mauricio,

YeahI see your point; thank you.

My approach would be akin to closing the barn door after the horse got out.
Both Mesos & Docker are doing their own writing of STDOUT. Docker's
rotation won't address Mesos's behavior.

I need to find a solution here.

-Paul


On Thu, Nov 5, 2015 at 10:46 PM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi Paul,
>
> I don't think that's going to help :(
> Even if you configure a different docker log driver, Docker still send
> things to stdout, which is catched by mesos and dumped in the .logs
> directory in the job sandbox. For example, by default docker logs into a
> json file in /var/lib/docker but mesos still writes to the sandbox.
> Hi Mauricio,
>
> I'm grappling with the same issue.
>
> I'm not yet sure if it represents a viable solution, but I plan to look at
> Docker's log rotation facility. It was introduced in Docker 1.8.
>
> If you beat me to it & it looks like a solution, please let us know!
>
> Thanks.
>
> Cordially,
>
> Paul
>
> > On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia <
> mauriciogaravag...@gmail.com> wrote:
> >
> > Hi guys,
> >
> > How can I manage the stdout/err log files generated by jobs in mesos?
> for long running docker apps launched using marathon the log files can
> deplete the disk of an agent, and using quotas makes the jobs to be killed
> which is also not ideal. I'd like to have a way to rotate them.
> >
> > Is it correct to just go to the mesos agent workdir and go through each
> sandbox stdout/err and rotate them? I know that could break the log UI but
> it doesn't scale very well having logs of several of GB.
> >
> > Thanks!
>


Re: manage jobs log files in sandboxes

2015-11-05 Thread Paul
Hi Mauricio,

I'm grappling with the same issue.

I'm not yet sure if it represents a viable solution, but I plan to look at 
Docker's log rotation facility. It was introduced in Docker 1.8.

If you beat me to it & it looks like a solution, please let us know!

Thanks.

Cordially,

Paul

> On Nov 5, 2015, at 9:40 PM, Mauricio Garavaglia 
> <mauriciogaravag...@gmail.com> wrote:
> 
> Hi guys,
> 
> How can I manage the stdout/err log files generated by jobs in mesos? for 
> long running docker apps launched using marathon the log files can deplete 
> the disk of an agent, and using quotas makes the jobs to be killed which is 
> also not ideal. I'd like to have a way to rotate them. 
> 
> Is it correct to just go to the mesos agent workdir and go through each 
> sandbox stdout/err and rotate them? I know that could break the log UI but it 
> doesn't scale very well having logs of several of GB.
> 
> Thanks!


Old docker version deployed

2015-10-06 Thread Paul Wolfe
Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:


Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
I do see the stdout in the webgui, which is how I can confirm the old version 
is deployed.

What I need is some information about what version/tag of the image mesos is 
using.

From: haosdent [mailto:haosd...@gmail.com]
Sent: Tuesday, October 06, 2015 11:37 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang



The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.


RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
My marathon deploy json:

{
 "type": "DOCKER",
  "volumes": [
{
  "containerPath": "/home/myapp /log",
  "hostPath": "/home",
  "mode": "RW"
}
  ],
  "docker": {
"image": "docker-registry:8080/myapp:86",
"network": "BRIDGE",
"portMappings": [
  {
"containerPort": 80,
"hostPort": 0,
"servicePort": 80,
"protocol": "tcp"
  }
],
"privileged": false,
"parameters": [],
"forcePullImage": false
  }
}


From: Paul Wolfe [mailto:paul.wo...@imc.nl]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org
Subject: RE: Old docker version deployed

No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachme

RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
Fair enough, although if that was the case would expect it to fail hard, not 
randomly run an old image.

One thing I did notice was that on the master box, docker images misses the 
version that should have been deployed (ie has version 77 and 79, but no 78)

From: haosdent [mailto:haosd...@gmail.com]
Sent: Tuesday, October 06, 2015 11:52 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

I don't think mesos log "version/tag of the image". When mesos start a docker 
container, always use your image name "docker-registry:8080/myapp:86" as pull 
and run parameters. I think maybe some machines have problems to connect your 
image registry.

On Tue, Oct 6, 2015 at 5:40 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:
My marathon deploy json:

{
 "type": "DOCKER",
  "volumes": [
{
  "containerPath": "/home/myapp /log",
  "hostPath": "/home",
  "mode": "RW"
}
  ],
  "docker": {
"image": "docker-registry:8080/myapp:86",
"network": "BRIDGE",
"portMappings": [
  {
"containerPort": 80,
"hostPort": 0,
"servicePort": 80,
"protocol": "tcp"
  }
],
"privileged": false,
"parameters": [],
"forcePullImage": false
  }
}


From: Paul Wolfe [mailto:paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: RE: Old docker version deployed

No different tags.

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed

Paul,

Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:
You could see the stdout/stderr of your container from mesos webui.

On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe





The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related entities shall accept any responsibility for any errors, 
omissions or other inaccuracies in this information or for the consequences 
thereof, nor shall it be bound in any way by the contents of this e-mail or its 
attachments. In the event of incomplete or incorrect transmission, please 
return the e-mail to the sender and permanently delete this message and any 
attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.



--
Best Regards,
Haosdent Huang




The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the

RE: Old docker version deployed

2015-10-06 Thread Paul Wolfe
Turns out it was a “bug” in docker. We found that running by hand the same tag 
(78) would randomly run version 18. Wouldn’t pull, even though the images 
wasn’t in the cache.

Upgrading from docker 1.7.1 to 1.8.2 seems to solve it, dangerous problem 
though…

From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:54 AM
To: user@mesos.apache.org
Subject: Re: Old docker version deployed

But if the image version is changed, this would fail. Because the image is 
neither locally, neither the registry is available.

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:51, haosdent wrote:
I don't think mesos log "version/tag of the image". When mesos start a docker 
container, always use your image name "docker-registry:8080/myapp:86" as pull 
and run parameters. I think maybe some machines have problems to connect your 
image registry.

On Tue, Oct 6, 2015 at 5:40 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:


My marathon deploy json:



{

 "type": "DOCKER",

  "volumes": [

{

  "containerPath": "/home/myapp /log",

  "hostPath": "/home",

  "mode": "RW"

}

  ],

  "docker": {

"image": "docker-registry:8080/myapp:86",

"network": "BRIDGE",

"portMappings": [

  {

"containerPort": 80,

"hostPort": 0,

"servicePort": 80,

"protocol": "tcp"

  }

],

"privileged": false,

"parameters": [],

"forcePullImage": false

  }

}





From: Paul Wolfe [mailto:paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: RE: Old docker version deployed



No different tags.



From: Rad Gruchalski [mailto:ra...@gruchalski.com]
Sent: Tuesday, October 06, 2015 11:39 AM
To: user@mesos.apache.org<mailto:user@mesos.apache.org>
Subject: Re: Old docker version deployed



Paul,



Are you using the same tag every time?

Kind regards,

Radek Gruchalski

ra...@gruchalski.com<mailto:ra...@gruchalski.com>
<mailto:ra...@gruchalski.com>
de.linkedin.com/in/radgruchalski/<http://de.linkedin.com/in/radgruchalski/>

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On Tuesday, 6 October 2015 at 11:37, haosdent wrote:

You could see the stdout/stderr of your container from mesos webui.



On Tue, Oct 6, 2015 at 5:30 PM, Paul Wolfe 
<paul.wo...@imc.nl<mailto:paul.wo...@imc.nl>> wrote:

Hello all,



I'm new to this list, so please let me know if there is a better/more 
appropriate forum for this question.



We are currently experimenting with marathon and mesos for deploying a simple 
webapp.  We ship the app as a docker container.



Sporadically (ie 1 out of 100) we find an old version of the app is deployed.  
It is obvious from the logs and the appearance of the GUI that the version is 
old.  If I download and run the docker container locally, I see it is indeed 
the latest version of the code.  That leads me to believe that somewhere in the 
marathon deploy or the mesos running of the image, versions are getting 
confused.



I guess my first question is what additional information can I get from 
marathon or mesos logs to help diagnose? I've checked the mesos-SLAVE.* but 
haven't been able to garner anything interesting there.



Thanks for any help!

Paul Wolfe







The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material. If someone other than 
the intended recipient should receive this e-mail, he / she shall not be 
entitled to read, disseminate, disclose or duplicate it.

If you receive this e-mail unintentionally, please inform us immediately by 
"reply" and then delete it from your system. Although this information has been 
compiled with great care, neither IMC Financial Markets & Asset Management nor 
any of its related ent

Re: Securing executors

2015-10-06 Thread Paul Bell
Thanks, Alexander; I will check out the vid.

I kind of assumed that this port was used for exactly the purpose you
mention.

Is TLS a possibility here?

-Paul

On Tue, Oct 6, 2015 at 8:15 AM, Alexander Rojas <alexan...@mesosphere.io>
wrote:

> Hi Paul,
>
> I can refer you to the talk given by Adam Bordelon at MesosCon
> https://www.youtube.com/watch?v=G3sn1OLYDOE
>
> If you want to the short answer, the solution is to put a firewall around
> your cluster.
>
> On a closer look on the port, it is the one used for message passing
> between the mesas-docker-executor and other mesos components.
>
>
> On 05 Oct 2015, at 19:04, Paul Bell <arach...@gmail.com> wrote:
>
> Hi All,
>
> I am running an nmap port scan on a Mesos agent node and noticed nmap
> reporting an open TCP port at 50577.
>
> Poking around some, I discovered exactly 5 mesos-docker-executor
> processes, one for each of my 5 Docker containers, and each with an open
> listen port:
>
> root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
> --container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
> --stop_timeout=15secs
>
> I suppose that all of this is unsurprising. But I know of at least one big
> customer who will without delay run Nmap or Nessus against my clustered
> deployment.
>
> So I am wondering what the best practices approach is to securing these
> open ports.
>
> Thanks for your help.
>
> -Paul
>
>
>
>
>


Securing executors

2015-10-05 Thread Paul Bell
Hi All,

I am running an nmap port scan on a Mesos agent node and noticed nmap
reporting an open TCP port at 50577.

Poking around some, I discovered exactly 5 mesos-docker-executor processes,
one for each of my 5 Docker containers, and each with an open listen port:

root 14131  3617  0 10:39 ?00:00:17 mesos-docker-executor
--container=mesos-20151002-172703-2450482247-5050-3014-S0.5563c65a-e33e-4287-8ce4-b2aa8116aa95
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20151002-172703-2450482247-5050-3014-S0/frameworks/20151002-172703-2450482247-5050-3014-/executors/postgres.ea2954fd-6b6e-11e5-8bef-56847afe9799/runs/5563c65a-e33e-4287-8ce4-b2aa8116aa95
--stop_timeout=15secs

I suppose that all of this is unsurprising. But I know of at least one big
customer who will without delay run Nmap or Nessus against my clustered
deployment.

So I am wondering what the best practices approach is to securing these
open ports.

Thanks for your help.

-Paul


Re: Changing mesos slave configuration

2015-09-23 Thread Paul Bell
Hi Pradeep,

Perhaps I am speaking to a slightly different point, but when I change
/etc/default/mesos-slave to add a new attribute, I have to remove file
/tmp/mesos/meta/slaves/latest.

IIRC, mesos-slave itself, in failing to start after such a change, tells me
to do this:

rm -f /tmp/mesos/meta/slaves/latest


But I know of no way to make such configuration changes without downtime.
And I'd very much like it if Mesos supported such dynamic changes. I
suppose this would require that the agent consult its default file on
demand, rather than once at start-up.

Cordially,

Paul

On Wed, Sep 23, 2015 at 4:41 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hello all,
>
> I have often faced this problem that whenever i try to add some
> configuration parameter to mesos-slave or change any configuration (eg. add
> a new attribute in mesos-slave), the mesos slave doesnt come up on restart.
> I have to delete the slave.info file and then restart the slave but it
> ends up killing all the docker containers started using mesos.
>
> I was trying to figure out the best way to make such changes without
> making any downtime.
>
> Thank you.
>
> --
> Pradeep Chhetri
>


Re: Detecting slave crashes event

2015-09-16 Thread Paul Bell
Thank you, Benjamin.

So, I could periodically request the metrics endpoint, or stream the logs
(maybe via mesos.cli; or SSH)? What, roughly, does the "agent removed"
message look like in the logs?

Are there plans to offer a mechanism for event subscription?

Cordially,

Paul



On Wed, Sep 16, 2015 at 1:30 PM, Benjamin Mahler <benjamin.mah...@gmail.com>
wrote:

> You can detect when we remove an agent due to health check failures via
> the metrics endpoint, but these are counters that are better used for
> alerting / dashboards for visibility. If you need to know which agents, you
> can also consume the logs as a stop-gap solution, until we offer a
> mechanism for subscribing to cluster events.
>
> On Wed, Sep 16, 2015 at 10:11 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am led to believe that, unlike Marathon, Mesos doesn't (yet?) offer a
>> subscribable event bus.
>>
>> So I am wondering if there's a best practices way of determining if a
>> slave node has crashed. By "crashed" I mean something like the power plug
>> got yanked, or anything that would cause Mesos to stop talking to the slave
>> node.
>>
>> I suppose such information would be recorded in /var/log/mesos.
>>
>> Interested to learn how best to detect this.
>>
>> Thank you.
>>
>> -Paul
>>
>
>


Re: Use "docker start" rather than "docker run"?

2015-09-01 Thread Paul Bell
Alex and Marco,

Thanks very much for your really helpful explanations.

For better or worse, neither cpp nor Python are my things; Java's the go-to
language for me.

Cordially,

Paul

On Sat, Aug 29, 2015 at 5:23 AM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> Hi Paul,
>
> +1 to what Alex/Tim say.
>
> Maybe a (simple) example will help: a very basic "framework" I created
> recently, does away with the "Executor" and only uses the "Scheduler",
> sending a CommandInfo structure to Mesos' Agent node to execute.
>
> See:
> https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124
>
> If Python is more your thing, there are examples in the Mesos repository,
> or you can take a look at something I started recently to use the new
> (0.24) HTTP API (NOTE - this is still very much still WIP):
> https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell <arach...@gmail.com> wrote:
>
>> Alex & Tim,
>>
>> Thank you both; most helpful.
>>
>> Alex, can you dispel my confusion on this point: I keep reading that a
>> "framework" in Mesos (e.g., Marathon) consists of a scheduler and an
>> executor. This reference to "executor" made me think that Marathon must
>> have *some* kind of presence on the slave node. But the more familiar I
>> become with Mesos the less likely this seems to me. So, what does it mean
>> to talk about the Marathon framework "executor"?
>>
>> Tim, I did come up with a simple work-around that involves re-copying the
>> needed file into the container each time the application is started. For
>> reasons unknown, this file is not kept in a location that would readily
>> lend itself to my use of persistent storage (Docker -v). That said, I am
>> keenly interested in learning how to write both custom executors &
>> schedulers. Any sense for what release of Mesos will see "persistent
>> volumes"?
>>
>> Thanks again, gents.
>>
>> -Paul
>>
>>
>>
>> On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> Hi Paul,
>>>
>>> We don't [re]start a container since we assume once the task terminated
>>> the container is no longer reused. In Mesos to allow tasks to reuse the
>>> same executor and handle task logic accordingly people will opt to choose
>>> the custom executor route.
>>>
>>> We're working on a way to keep your sandbox data beyond a container
>>> lifecycle, which is called persistent volumes. We haven't integrated that
>>> with Docker containerizer yet, so you'll have to wait to use that feature.
>>>
>>> You could also choose to implement a custom executor for now if you like.
>>>
>>> Tim
>>>
>>> On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov <a...@mesosphere.com>
>>> wrote:
>>>
>>>> Paul,
>>>>
>>>> that component is called DockerContainerizer and it's part of Mesos
>>>> Agent (check
>>>> "/Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp"). @Tim,
>>>> could you answer the "docker start" vs. "docker run" question?
>>>>
>>>> On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I first posted this to the Marathon list, but someone suggested I try
>>>>> it here.
>>>>>
>>>>> I'm still not sure what component (mesos-master, mesos-slave,
>>>>> marathon) generates the "docker run" command that launches containers on a
>>>>> slave node. I suppose that it's the framework executor (Marathon) on the
>>>>> slave that actually executes the "docker run", but I'm not sure.
>>>>>
>>>>> What I'm really after is whether or not we can cause the use of
>>>>> "docker start" rather than "docker run".
>>>>>
>>>>> At issue here is some persistent data inside
>>>>> /var/lib/docker/aufs/mnt/. "docker run" will by design (re)launch
>>>>> my application with a different CTR_ID effectively rendering that data
>>>>> inaccessible. But "docker start" will restart the container and its "old"
>>>>> data will still be there.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -Paul
>>>>>
>>>>
>>>>
>>>
>>
>


Use docker start rather than docker run?

2015-08-28 Thread Paul Bell
Hi All,

I first posted this to the Marathon list, but someone suggested I try it
here.

I'm still not sure what component (mesos-master, mesos-slave, marathon)
generates the docker run command that launches containers on a slave
node. I suppose that it's the framework executor (Marathon) on the slave
that actually executes the docker run, but I'm not sure.

What I'm really after is whether or not we can cause the use of docker
start rather than docker run.

At issue here is some persistent data inside
/var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
my application with a different CTR_ID effectively rendering that data
inaccessible. But docker start will restart the container and its old
data will still be there.

Thanks.

-Paul


Re: Use docker start rather than docker run?

2015-08-28 Thread Paul Bell
Alex  Tim,

Thank you both; most helpful.

Alex, can you dispel my confusion on this point: I keep reading that a
framework in Mesos (e.g., Marathon) consists of a scheduler and an
executor. This reference to executor made me think that Marathon must
have *some* kind of presence on the slave node. But the more familiar I
become with Mesos the less likely this seems to me. So, what does it mean
to talk about the Marathon framework executor?

Tim, I did come up with a simple work-around that involves re-copying the
needed file into the container each time the application is started. For
reasons unknown, this file is not kept in a location that would readily
lend itself to my use of persistent storage (Docker -v). That said, I am
keenly interested in learning how to write both custom executors 
schedulers. Any sense for what release of Mesos will see persistent
volumes?

Thanks again, gents.

-Paul



On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos Agent
 (check /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp).
 @Tim, could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try it
 here.

 I'm still not sure what component (mesos-master, mesos-slave, marathon)
 generates the docker run command that launches containers on a slave
 node. I suppose that it's the framework executor (Marathon) on the slave
 that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of docker
 start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul






Re: Can't start master properly (stale state issue?); help!

2015-08-14 Thread Paul Bell
All,

By way of some background: I'm not running a data center (or centers).
Rather, I work on a distributed application whose trajectory is taking it
into a realm of many Docker containers distributed across many hosts
(mostly virtual hosts at the outset). An environment that supports
isolation, multi-tenancy, scalability, and some fault tolerance is
desirable for this application. Also, the mere ability to simplify - at
least somewhat - the management of multiple hosts is of great importance.
So, that's more or less how I got to Mesos and to here...

I ended up writing a Java program that configures a collection of host VMs
as a Mesos cluster and then, via Marathon, distributes the application
containers across the cluster. Configuring  building the cluster is
largely a lot of SSH work. Doing the same for the application is part
Marathon, part Docker remote API. The containers that need to talk to each
other via TCP are connected with Weave's (http://weave.works) overlay
network. So the main infrastructure consists of Mesos, Docker, and Weave.
The whole thing is pretty amazing - for which I take very little credit.
Rather, these are some wonderful technologies, and the folks who write 
support them are very helpful. That said, I sometimes feel like I'm
juggling chain saws!

*In re* the issues raised on this thread:

All Mesos components were installed via the Mesosphere packages. The 4 VMs
in the cluster are all running Ubuntu 14.04 LTS.

My suspicions about the IP@ 127.0.1.1 were raised a few months ago when,
after seeing this IP in a mesos-master log when things weren't working, I
discovered these articles:


https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4
https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4

*http://frankhinek.com/build-mesos-multi-node-ha-cluster/
http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see
note 2)


So, to the point raised just now by Klaus (and earlier in the thread), the
aforementioned configuration program does change /etc/hosts (and
/etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco 
hasodent, I might have encountered a race condition wherein ZK 
mesos-master saw the unchanged /etc/hosts before I altered it. I believe
that I yesterday fixed that issue.

Also, as part of the cluster create step, I get a bit aggressive (perhaps
unwisely) with what I believe are some state repositories. Specifically, I

rm /var/lib/zookeeper/version-2/*
rm -Rf /var/lib/mesos/replicated_log

Should I NOT be doing this? I know from experience that zapping the
version-2 directory (ZK's data_Dir if IIRC)  can solve occasional
weirdness. Marco is /var/lib/mesos/replicated_log what you are referring
to when you say some issue with the log-replica?

Just a day or two ago I first heard the term znode  learned a little
about zkCli.sh. I will experiment with it more in the coming days.

As matters now stand, I have the cluster up and running. But before I again
deploy the application, I am trying to put the cluster through its paces by
periodically cycling it through the states my program can bring about,
e.g.,

--cluster create (takes a clean VM and configures it to act as one
or more Mesos components: ZK, master, slave)
--cluster stop(stops the Mesos services on each node)
--cluster destroy   (configures the VM back to its original clean state)
--cluster create
--cluster stop
--cluster start


et cetera.

*The only way I got rid of the no leading master issue that started this
thread was by wiping out the master VM and starting over with a clean VM.
That is, stopping/destroying/creating (even rebooting) the cluster had no
effect.*

I suspect that, sooner or later, I will again hit this problem (probably
sooner!). And I want to understand how best to handle it. Such an
occurrence could be pretty awkward at a customer site.

Thanks for all your help.

Cordially,

Paul


On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma kl...@cguru.net wrote:

 I used to meet a similar issue with Zookeeper + Messo; I resolved it by
 remove 127.0.1.1 from /etc/hosts; here is an example:
 klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
 127.0.0.1   localhost
 127.0.1.1   klaus-OptiPlex-780   *= remove this line, and a new
 line: mapping IP (e.g. 192.168.1.100) with hostname*
 ...

 BTW, please also clear-up the log directory and re-start ZK  Mesos.

 If any more comments, please let me know.

 Regards,
 
 Klaus Ma (马达), PMP® | http://www.cguru.net

 Call
 Send SMS
 Call from mobile
 Add to Skype
 You'll need Skype CreditFree via Skype
 Call
 Send SMS
 Call from mobile
 Add to Skype
 You'll need Skype CreditFree via Skype
 --
 Date: Thu, 13 Aug 2015 12:20:34 -0700
 Subject: Re: Can't start master properly (stale state issue?); help!
 From: ma...@mesosphere.io
 To: user@mesos.apache.org



 On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com

Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell
Hi All,

I hope someone can shed some light on this because I'm getting desperate!

I try to start components zk, mesos-master, and marathon in that order.
They are started via a program that SSHs to the sole host and does service
xxx start. Everyone starts happily enough. But the Mesos UI shows me:

*This master is not the leader, redirecting in 0 seconds ... go now*

The pattern seen in all of the mesos-master.INFO logs (one of which shown
below) is that the mesos-master with the correct IP@ starts. But then a
new leader is detected and becomes leading master. This new leader shows
UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

I've tried clearing what ZK and mesos-master state I can find, but this
problem will not go away.

Would someone be so kind as to a) explain what is happening here and b)
suggest remedies?

Thanks very much.

-Paul


Log file created at: 2015/08/13 10:19:43
Running on machine: 71.100.14.9
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by root
I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
d6309f92a7f9af3ab61a878403e3d9c284ea87e0
I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in 13961ns
I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
677ns
I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in the
db in 243ns
I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
positions 0 - 0 with 1 holes and 0 unlearned
I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
ZooKeeper group
I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
received a broadcasted recover request
I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
from a replica in EMPTY status
*I0813 10:19:43.249503  2542 master.cpp:349] Master
20150813-101943-151938119-5050-2542 (71.100.14.9) started on
71.100.14.9:5050 http://71.100.14.9:5050*
I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
STARTING
I0813 10:19:43.252571  2542 master.cpp:397] Master allowing unauthenticated
frameworks to register
I0813 10:19:43.253159  2542 master.cpp:402] Master allowing unauthenticated
slaves to register
I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 1.816161ms
I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
STARTING
I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
received a broadcasted recover request
I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
from a replica in STARTING status
I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
VOTING
I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
'/var/log/mesos/mesos-master.INFO'
I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
to leveldb took 3.477458ms
I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
VOTING
I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the Paxos
group
I0813 10:19:43.258987  2612 recover.cpp:464] Recover process terminated
I0813 10:19:46.590340  2606 group.cpp:313] Group process (group(1)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.590373  2606 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0813 10:19:46.590386  2606 group.cpp:385] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.591442  2606 network.hpp:424] ZooKeeper group memberships
changed
I0813 10:19:46.591514  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/00' in ZooKeeper
I0813 10:19:46.592146  2606 group.cpp:659] Trying to get
'/mesos/log_replicas/01' in ZooKeeper
I0813 10:19:46.593128  2608 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@127.0.1.1:5050 }
I0813 10:19:46.593955  2608 group.cpp:313] Group process (group(2)@
71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.593977  2608 group.cpp:790] Syncing group operations: queue
size (joins, cancels, datas) = (1, 0, 0)
I0813 10:19:46.593986  2608 group.cpp:385] Trying to create path
'/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.594894  2605 group.cpp:313] Group process (group(3)@
71.100.14.9:5050

Re: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Paul Bell
Marco  hasodent,

This is just a quick note to say thank you for your replies.

I will answer you much more fully tomorrow, but for now can only manage a
few quick observations  questions:

1. Having some months ago encountered a known problem with the IP@
127.0.1.1 (I'll provide references tomorrow), I early on configured
/etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP.
That said, I can't rule out a race condition whereby ZK | mesos-master saw
the original unchanged /etc/hosts before I zapped it.

2. What is a znode and how would I drop it?

I start the services (zk, master, marathon; all on same host) by SSHing
into the host  doing service  start commands.

Again, thanks very much; and more tomorrow.

Cordially,

Paul

On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote:

 Hello, how you start the master? And could you try use netstat -antp|grep
 5050 to find whether there are multi master processes run at a same
 machine or not?

 On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I hope someone can shed some light on this because I'm getting desperate!

 I try to start components zk, mesos-master, and marathon in that order.
 They are started via a program that SSHs to the sole host and does service
 xxx start. Everyone starts happily enough. But the Mesos UI shows me:

 *This master is not the leader, redirecting in 0 seconds ... go now*

 The pattern seen in all of the mesos-master.INFO logs (one of which shown
 below) is that the mesos-master with the correct IP@ starts. But then a
 new leader is detected and becomes leading master. This new leader shows
 UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

 I've tried clearing what ZK and mesos-master state I can find, but this
 problem will not go away.

 Would someone be so kind as to a) explain what is happening here and b)
 suggest remedies?

 Thanks very much.

 -Paul


 Log file created at: 2015/08/13 10:19:43
 Running on machine: 71.100.14.9
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
 I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
 root
 I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
 I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
 I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
 d6309f92a7f9af3ab61a878403e3d9c284ea87e0
 I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
 I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
 I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in
 13961ns
 I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
 677ns
 I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in
 the db in 243ns
 I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
 ZooKeeper group
 I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
 I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
 I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
 I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
 received a broadcasted recover request
 I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
 from a replica in EMPTY status
 *I0813 10:19:43.249503  2542 master.cpp:349] Master
 20150813-101943-151938119-5050-2542 (71.100.14.9) started on
 71.100.14.9:5050 http://71.100.14.9:5050*
 I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
 STARTING
 I0813 10:19:43.252571  2542 master.cpp:397] Master allowing
 unauthenticated frameworks to register
 I0813 10:19:43.253159  2542 master.cpp:402] Master allowing
 unauthenticated slaves to register
 I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 1.816161ms
 I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
 STARTING
 I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
 I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
 received a broadcasted recover request
 I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
 from a replica in STARTING status
 I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
 VOTING
 I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
 '/var/log/mesos/mesos-master.INFO'
 I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
 I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8
 bytes) to leveldb took 3.477458ms
 I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
 VOTING
 I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the
 Paxos group
 I0813 10:19:43.258987  2612 recover.cpp:464] Recover process

Re: Custom flags to docker run

2015-08-12 Thread Paul Bell
Hi Stephen,

Via Marathon I am deploying Docker containers across a Mesos cluster. The
containers have unique Weave IP@s allowing inter-container communication.
All things considered, getting to this point has been relatively
straight-forward, and Weave has been one of the IJW components.

I'd be curious to learn why you're finding Weave messy.

If you'd like to take it out-of-band (as it were), please feel free to
e-mail me directly.

Cordially,

Paul

On Wed, Aug 12, 2015 at 3:16 AM, Stephen Knight skni...@pivotal.io wrote:

 Hi,

 Is there a way to pass a custom flag to docker run through the Marathon
 API? I've not seen anything in the documentation, this could just be a
 basic reading fail on my part. What I want to do is use Calico (or similar)
 with Docker and provision containers via Marathon.

 Weave is messy for what I am trying to achieve and the integration isn't
 going as planned, is there a better option and how can you then integrate
 it? Does that flexibility exist in the Marathon API?

 Thx
 Stephen



Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

2015-07-07 Thread Paul Brett
-1 (non-binding) Network isolator will not compile.
https://issues.apache.org/jira/browse/MESOS-3002


On Tue, Jul 7, 2015 at 11:38 AM, Alexander Rojas alexan...@mesosphere.io
wrote:

 +1 (non-binding)

 Ubuntu Server 15.04 gcc 4.9.2 and clang 3.6.0

 OS X Yosemite clang Apple LLVM based on 3.6.0


 On 06 Jul 2015, at 21:14, Jörg Schad jo...@mesosphere.io wrote:

 After more testing:
 -1 (non-binding)
 Docker tests failing on CentOS Linux release 7.1.1503 (Core) , Tim is
 already on the issue (see MESOS-2996)


 On Mon, Jul 6, 2015 at 8:59 PM, Kapil Arya ka...@mesosphere.io wrote:

 +1 (non-binding)

 OpenSUSE Tumbleweed, Linux 4.0.3 / gcc 4.8.3

 On Mon, Jul 6, 2015 at 2:33 PM, Ben Whitehead 
 ben.whiteh...@mesosphere.io wrote:

 +1 (non-binding)

 openSUSE 13.2 Linux 3.16.7 / gcc-4.8.3
 Tested running Marathon 0.9.0-RC3 and Cassandra on Mesos 0.1.1-SNAPSHOT.

 On Mon, Jul 6, 2015 at 6:57 AM, Till Toenshoff toensh...@me.com wrote:

 Even though Alex has IMHO already “busted” this vote ;) .. THANKS ALEX!
 … ,
 here are my results.

 +1

 OS 10.10.4 (14E46) + Apple LLVM version 6.1.0 (clang-602.0.53) (based
 on LLVM 3.6.0svn), make check - OK
 Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64) + gcc (Ubuntu
 4.8.2-19ubuntu1) 4.8.2, make check - OK




 On Jul 6, 2015, at 3:22 PM, Alex Rukletsov a...@mesosphere.com wrote:

 -1

 Compilation error on Mac OS 10.10.4 with clang 3.5, which is supported
 according to release notes.
 More details: https://issues.apache.org/jira/browse/MESOS-2991

 On Mon, Jul 6, 2015 at 11:55 AM, Jörg Schad jo...@mesosphere.io
 wrote:

 P.S. to my prior +1
 Tested on ubuntu-trusty-14.04 including docker.

 On Sun, Jul 5, 2015 at 6:44 PM, Jörg Schad jo...@mesosphere.io
 wrote:

 +1

 On Sun, Jul 5, 2015 at 4:36 PM, Nikolaos Ballas neXus 
 nikolaos.bal...@nexusgroup.com wrote:

  +1



  Sent from my Samsung device


  Original message 
 From: tommy xiao xia...@gmail.com
 Date: 05/07/2015 15:14 (GMT+01:00)
 To: user@mesos.apache.org
 Subject: Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

  +1

 2015-07-04 12:32 GMT+08:00 Weitao zhouwtl...@gmail.com:

  +1

 发自我的 iPhone

 在 2015年7月4日,09:41,Marco Massenzio ma...@mesosphere.io 写道:

   +1

  *Marco Massenzio*
 *Distributed Systems Engineer*

 On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon a...@mesosphere.io
 wrote:

 Hello Mesos community,

 Please vote on releasing the following candidate as Apache Mesos
 0.23.0.

 0.23.0 includes the following:

 
  - Per-container network isolation
 - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
 - Dockerized slaves will properly recover Docker containers upon
 failover.

 as well as experimental support for:
  - Fetcher Caching
  - Revocable Resources
  - SSL encryption
  - Persistent Volumes
  - Dynamic Reservations

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1

 

 The candidate for Mesos 0.23.0 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz

 The tag to be voted on is 0.23.0-rc1:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc1

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:

 https://repository.apache.org/content/repositories/orgapachemesos-1056

 Please vote on releasing this package as Apache Mesos 0.23.0!

 The vote is open until Fri July 10th, 12:00 PDT 2015 and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.23.0
 [ ] -1 Do not release this package because ...

 Thanks,
  -Adam-





  --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com












-- 
-- Paul Brett


Re: [DISCUSS] Renaming Mesos Slave

2015-06-02 Thread Paul Brett
-1 for the name change.

The master/slave terms in Mesos accurately describe the relationship
between the components using common engineering terms that predate modern
computing.

Human slavery is an abomination, but then so is murder.  Would you have us
eliminate all references to kill in the code?

-- Paul

On Tue, Jun 2, 2015 at 12:53 PM, haosdent haosd...@gmail.com wrote:

 Hi Adam,

 1. Mesos Worker
 2. Mesos Worker
 3. No
 4. Carefully. Should take care the compatible when upgrade.

 On Wed, Jun 3, 2015 at 2:50 AM, Dave Lester d...@davelester.org wrote:

  Hi Adam,

 I've been using Master/Worker in presentations for the past 9 months and
 it hasn't led to any confusion.

 1. Mesos worker
 2. Mesos worker
 3. No
 4. Documentation, then API with a full deprecation cycle

 Dave

 On Mon, Jun 1, 2015, at 02:18 PM, Adam Bordelon wrote:

 There has been much discussion about finding a less offensive name than
 Slave, and many of these thoughts have been captured in
 https://issues.apache.org/jira/browse/MESOS-1478

 I would like to open up the discussion on this topic for one week, and if
 we cannot arrive at a lazy consensus, I will draft a proposal from the
 discussion and call for a VOTE.
 Here are the questions I would like us to answer:
 1. What should we call the Mesos Slave node/host/machine?
 2. What should we call the mesos-slave process (could be the same)?
 3. Do we need to rename Mesos Master too?

 Another topic worth discussing is the deprecation process, but we don't
 necessarily need to decide on that at the same time as deciding the new
 name(s).
 4. How will we phase in the new name and phase out the old name?

 Please voice your thoughts and opinions below.

 Thanks!
 -Adam-

 P.S. My personal thoughts:
 1. Mesos Worker [Node]
 2. Mesos Worker or Agent
 3. No
 4. Carefully






 --
 Best Regards,
 Haosdent Huang




-- 
-- Paul Brett


Re: New cologne based user group - mesos-user-group-cologne

2015-03-30 Thread Paul Otto
Finally, a reason to travel to Deutschland! ;) Good luck with the new MUG!

Paul
On Mar 29, 2015 1:00 PM, Marc Zimmermann marc.zimmerm...@mmbash.de
wrote:

 We’ve started a new users group in Cologne called Mesos-User-Group-Cologne
 - please add us to your list!

 http://www.meetup.com/Mesos-User-Group-Cologne/

 Thanks, Marc

 --
 Dipl.-Inf.
 Marc Zimmermann

 
  mmbash UG (haftungsbeschränkt)
 Friedenstraße 22, 50676 Köln, Germany

 http://mmbash.de
 marc.zimmerm...@mmbash.de

 Geschäftsführer: Mike Michel, Marc Zimmermann
 Amtsgericht Köln, HRB 73562




Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-24 Thread Paul Otto
This is awesome! Thanks for all the hard work you all have put into this! I
am really excited to update to the latest stable version of Apache Mesos!

Regards,
Paul


Paul Otto
Principal DevOps Architect, Co-founder
Otto Ops LLC | *OttoOps.com http://OttoOps.com*
970.343.4561 office
720.381.2383 cell

On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io
wrote:

 Hi all,

 The vote for Mesos 0.22.0 (rc4) has passed with the
 following votes.

 +1 (Binding)
 --
 Ben Mahler
 Tim St Clair
 Adam Bordelon
 Brenden Matthews

 +1 (Non-binding)
 --
 Alex Rukletsov
 Craig W
 Ben Whitehead
 Elizabeth Lingg
 Dario Rexin
 Jeff Schroeder
 Michael Park
 Alexander Rojas
 Andrew Langhorn

 There were no 0 or -1 votes.

 Please find the release at:
 https://dist.apache.org/repos/dist/release/mesos/0.22.0

 It is recommended to use a mirror to download the release:
 http://www.apache.org/dyn/closer.cgi

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0

 The mesos-0.22.0.jar has been released to:
 https://repository.apache.org

 The website (http://mesos.apache.org) will be updated shortly to reflect
 this release.

 Thanks,
 Niklas



Re: Open call to be incl on Mesos support and services list

2014-07-25 Thread Paul Otto
Hi Dave,

I would be interested in having Otto Ops LLC be added to that list. We have
been building a Mesos + Marathon + Docker infrastructure for Time Warner
Cable, and would be very interested in doing more with the community.

Regards,
Paul Otto

-- 
Paul Otto
Principal DevOps Engineer, Owner
Otto Ops LLC | *OttoOps.com http://ottoops.com/*
970.343.4561 office
720.381.2383 cell

On Thu, Jul 24, 2014 at 4:07 PM, Dave Lester daveles...@gmail.com wrote:

 Hi All,

 I wanted to revisit a previous thread
 http://markmail.org/message/o3nlnihmwqtgsm7d where I suggested that we
 add a section to the Mesos website to list companies that provide Mesos
 services and development. At that time, we heard interest from:

 * Grand Logic
 * Mesosphere
 * Big Data Open Source Security LLC

 I've created a JIRA ticket (MESOS-1638
 https://issues.apache.org/jira/browse/MESOS-1638) to track this; feel
 free to comment there or use this thread for discussion should there be any
 questions or comments.

 Best,
 Dave




-- 
Paul Otto
Principal DevOps Engineer
Otto Ops LLC | *OttoOps.com http://OttoOps.com*
970.343.4561 office
720.381.2383 cell


process isolation

2013-10-21 Thread Paul Mackles
Hi - I just wanted to confirm my understanding of something... with process
isolation, Mesos will not do anything if a given executor exceeds its
resource allocation. In other words, if I accept a resource with 1GB of
memory and then my executor uses 3GB, Mesos won't detect that the process
exceeded its allocation and kill the process. For that, you need to enable
cgroups at which point allocation limits are enforced by the OS. Did I get
that right?

-- 
Thanks,
Paul


resource revocation and long-running task

2013-10-10 Thread Paul Mackles
Hi - I was re-reading the mesos technical paper. Particularly sections
3.3.1 and 4.3. I am currently running mesos-0.14.0.rc4 and I was wondering
how much of what is discussed in those sections is actually implemented?
Specifically, I don't see any way to allocate slots for long-running vs.
short-running tasks. I also haven't seen any configuration related to
resource revocation. Am I missing something?

-- 
Thanks,
Paul