Re: Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Avinash Sridharan
One problem with implementing something like vMotion for Mesos is to
address seamless movement of network connectivity as well. This effectively
requires moving the IP address of the container across hosts. If the
container shares host network stack, this won't be possible since this
would imply moving the host IP address from one host to another. When a
container has its network namespace, attached to the host, using a bridge,
moving across L2 segments might be a possibility. To move across L3
segments you will need some form of overlay (VxLAN maybe ?) .

On Thu, Feb 18, 2016 at 7:34 PM, Jay Taylor  wrote:

> Is this theoretically feasible with Linux checkpoint and restore, perhaps
> via CRIU?http://criu.org/Main_Page
>
> On Feb 18, 2016, at 4:35 AM, Paul Bell  wrote:
>
> Hello All,
>
> Has there ever been any consideration of the ability to move in-flight
> containers from one Mesos host node to another?
>
> I see this as analogous to VMware's "vMotion" facility wherein VMs can be
> moved from one ESXi host to another.
>
> I suppose something like this could be useful from a load-balancing
> perspective.
>
> Just curious if it's ever been considered and if so - and rejected - why
> rejected?
>
> Thanks.
>
> -Paul
>
>
>


-- 
Avinash Sridharan, Mesosphere
+1 (323) 702 5245


Re: Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Jay Taylor
Is this theoretically feasible with Linux checkpoint and restore, perhaps via 
CRIU?http://criu.org/Main_Page

> On Feb 18, 2016, at 4:35 AM, Paul Bell  wrote:
> 
> Hello All,
> 
> Has there ever been any consideration of the ability to move in-flight 
> containers from one Mesos host node to another?
> 
> I see this as analogous to VMware's "vMotion" facility wherein VMs can be 
> moved from one ESXi host to another.
> 
> I suppose something like this could be useful from a load-balancing 
> perspective.
> 
> Just curious if it's ever been considered and if so - and rejected - why 
> rejected?
> 
> Thanks.
> 
> -Paul
> 
> 


Re: Mesos sometimes not allocating the entire cluster

2016-02-18 Thread Guangya Liu
Hi Tom,

After the patch was applied, there is no need to restart framework but only
mesos master.

One question is that I saw from your log, seems your cluster has at least
36 agents, right? I was asking this question because if there are more
frameworks than agents, frameworks with low weight may not able to get
resources sometimes.

Can you please enable GLOG_v=2 for mesos master for a while and put the log
somewhere for us to check (Do not enable this for a long time as you will
get log message flooded), this kind of log messages may give some help for
your problem.

Another is that there is another problem trying to fix another performance
issue for allocator but may not help you much, but you can still take a
look: https://issues.apache.org/jira/browse/MESOS-4694

Thanks,

Guangya

On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld  wrote:

> Hi Ben,
>
> We've rolled that patch out (applied over 0.23.1) on our production
> cluster and have seen little change, the master is still not sending any
> offers to those frameworks. We did this upgrade online, so would there be
> any reason the fix wouldn't have helped (other than it not being the
> cause)? Would we need to restart the frameworks (so they get new IDs) to
> see the effect?
>
> It's not that the master is never sending them offers, it's that it does
> it up to a certain point... for different types of frameworks (all using
> libmesos) but then no more, regardless of how much free resource is
> available... the free resources are offered to some frameworks, but not
> all. Is there any way for us to do more introspection into the state of the
> master / allocator to try and debug? Right now we're at a bit of a loss of
> where to start diving in...
>
> Much appreciated as always,
>
> Tom.
>
> On 18 February 2016 at 10:21, Tom Arnfeld  wrote:
>
>> Hi Ben,
>>
>> I've only just seen your email! Really appreciate the reply, that's
>> certainly an interesting bug and we'll try that patch and see how we get on.
>>
>> Cheers,
>>
>> Tom.
>>
>> On 29 January 2016 at 19:54, Benjamin Mahler  wrote:
>>
>>> Hi Tom,
>>>
>>> I suspect you may be tripping the following issue:
>>> https://issues.apache.org/jira/browse/MESOS-4302
>>>
>>> Please have a read through this and see if it applies here. You may also
>>> be able to apply the fix to your cluster to see if that helps things.
>>>
>>> Ben
>>>
>>> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld  wrote:
>>>
 Hey,

 I've noticed some interesting behaviour recently when we have lots of
 different frameworks connected to our Mesos cluster at once, all using a
 variety of different shares. Some of the frameworks don't get offered more
 resources (for long periods of time, hours even) leaving the cluster under
 utilised.

 Here's an example state where we see this happen..

 Framework 1 - 13% (user A)
 Framework 2 - 22% (user B)
 Framework 3 - 4% (user C)
 Framework 4 - 0.5% (user C)
 Framework 5 - 1% (user C)
 Framework 6 - 1% (user C)
 Framework 7 - 1% (user C)
 Framework 8 - 0.8% (user C)
 Framework 9 - 11% (user D)
 Framework 10 - 7% (user C)
 Framework 11 - 1% (user C)
 Framework 12 - 1% (user C)
 Framework 13 - 6% (user E)

 In this example, there's another ~30% of the cluster that is
 unallocated, and it stays like this for a significant amount of time until
 something changes, perhaps another user joins and allocates the rest
 chunks of this spare resource is offered to some of the frameworks, but not
 all of them.

 I had always assumed that when lots of frameworks were involved,
 eventually the frameworks that would keep accepting resources indefinitely
 would consume the remaining resource, as every other framework had rejected
 the offers.

 Could someone elaborate a little on how the DRF allocator / sorter
 handles this situation, is this likely to be related to the different users
 being used? Is there a way to mitigate this?

 We're running version 0.23.1.

 Cheers,

 Tom.

>>>
>>>
>>
>


-- 
Guangya Liu (εˆ˜ε…‰δΊš)
Senior Software Engineer
DCOS and OpenStack Development
IBM Platform Computing
Systems and Technology Group


Re: [VOTE] Release Apache Mesos 0.27.1 (rc1)

2016-02-18 Thread Steven Schlansker
On Feb 18, 2016, at 2:23 PM, Michael Park  wrote:

> Hi Steven,
> 
> From the looks of it, this was something that has been broken pre-0.27.0.
> I would propose that this ticket be targeted for 0.28.0, and I can be the 
> shepherd for it.
> 
> How does this sound?
> 

Very reasonable, thanks! :)

> MPark
> 
> On 16 February 2016 at 17:10, Steven Schlansker  
> wrote:
> On Feb 16, 2016, at 4:52 PM, Michael Park  wrote:
> 
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 0.27.1.
> >
> 
> I filed a bug against 0.27.0 where Mesos can emit totally invalid JSON
> in response to the /files/read.json endpoint:
> 
> https://issues.apache.org/jira/browse/MESOS-4642
> 
> I suppose it's too late at this point to get it considered for the 0.27.1 
> release?
> I would have pushed sooner but I didn't realize the next release would happen 
> so quickly :)
> 
> 
> 
> >
> > 0.27.1 includes the following:
> > 
> > * Improved `systemd` integration.
> > * Ability to disable `systemd` integration.
> >
> > * Additional performance improvements to /state endpoint.
> > * Removed duplicate "active" keys from the /state endpoint.
> >
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.1-rc1
> > 
> >
> > The candidate for Mesos 0.27.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz
> >
> > The tag to be voted on is 0.27.1-rc1:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.1-rc1
> >
> > The MD5 checksum of the tarball can be found at:
> > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.md5
> >
> > The signature of the tarball can be found at:
> > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is up in Maven in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1102
> >
> > Please vote on releasing this package as Apache Mesos 0.27.1!
> >
> > The vote is open until Fri Feb 19 17:00:00 PST 2016 and passes if a 
> > majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 0.27.1
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> >
> > Joris, MPark
> 
> 



Re: [VOTE] Release Apache Mesos 0.27.1 (rc1)

2016-02-18 Thread Michael Park
Hi Steven,

>From the looks of it, this was something that has been broken pre-0.27.0.
I would propose that this ticket be targeted for 0.28.0, and I can be the
shepherd for it.

How does this sound?

MPark

On 16 February 2016 at 17:10, Steven Schlansker 
wrote:

> On Feb 16, 2016, at 4:52 PM, Michael Park  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 0.27.1.
> >
>
> I filed a bug against 0.27.0 where Mesos can emit totally invalid JSON
> in response to the /files/read.json endpoint:
>
> https://issues.apache.org/jira/browse/MESOS-4642
>
> I suppose it's too late at this point to get it considered for the 0.27.1
> release?
> I would have pushed sooner but I didn't realize the next release would
> happen so quickly :)
>
>
>
> >
> > 0.27.1 includes the following:
> >
> 
> > * Improved `systemd` integration.
> > * Ability to disable `systemd` integration.
> >
> > * Additional performance improvements to /state endpoint.
> > * Removed duplicate "active" keys from the /state endpoint.
> >
> > The CHANGELOG for the release is available at:
> >
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.1-rc1
> >
> 
> >
> > The candidate for Mesos 0.27.1 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz
> >
> > The tag to be voted on is 0.27.1-rc1:
> >
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.1-rc1
> >
> > The MD5 checksum of the tarball can be found at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.md5
> >
> > The signature of the tarball can be found at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is up in Maven in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1102
> >
> > Please vote on releasing this package as Apache Mesos 0.27.1!
> >
> > The vote is open until Fri Feb 19 17:00:00 PST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 0.27.1
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> >
> > Joris, MPark
>
>


Re: Mesos sometimes not allocating the entire cluster

2016-02-18 Thread Tom Arnfeld
Hi Ben,

We've rolled that patch out (applied over 0.23.1) on our production cluster
and have seen little change, the master is still not sending any offers to
those frameworks. We did this upgrade online, so would there be any reason
the fix wouldn't have helped (other than it not being the cause)? Would we
need to restart the frameworks (so they get new IDs) to see the effect?

It's not that the master is never sending them offers, it's that it does it
up to a certain point... for different types of frameworks (all using
libmesos) but then no more, regardless of how much free resource is
available... the free resources are offered to some frameworks, but not
all. Is there any way for us to do more introspection into the state of the
master / allocator to try and debug? Right now we're at a bit of a loss of
where to start diving in...

Much appreciated as always,

Tom.

On 18 February 2016 at 10:21, Tom Arnfeld  wrote:

> Hi Ben,
>
> I've only just seen your email! Really appreciate the reply, that's
> certainly an interesting bug and we'll try that patch and see how we get on.
>
> Cheers,
>
> Tom.
>
> On 29 January 2016 at 19:54, Benjamin Mahler  wrote:
>
>> Hi Tom,
>>
>> I suspect you may be tripping the following issue:
>> https://issues.apache.org/jira/browse/MESOS-4302
>>
>> Please have a read through this and see if it applies here. You may also
>> be able to apply the fix to your cluster to see if that helps things.
>>
>> Ben
>>
>> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld  wrote:
>>
>>> Hey,
>>>
>>> I've noticed some interesting behaviour recently when we have lots of
>>> different frameworks connected to our Mesos cluster at once, all using a
>>> variety of different shares. Some of the frameworks don't get offered more
>>> resources (for long periods of time, hours even) leaving the cluster under
>>> utilised.
>>>
>>> Here's an example state where we see this happen..
>>>
>>> Framework 1 - 13% (user A)
>>> Framework 2 - 22% (user B)
>>> Framework 3 - 4% (user C)
>>> Framework 4 - 0.5% (user C)
>>> Framework 5 - 1% (user C)
>>> Framework 6 - 1% (user C)
>>> Framework 7 - 1% (user C)
>>> Framework 8 - 0.8% (user C)
>>> Framework 9 - 11% (user D)
>>> Framework 10 - 7% (user C)
>>> Framework 11 - 1% (user C)
>>> Framework 12 - 1% (user C)
>>> Framework 13 - 6% (user E)
>>>
>>> In this example, there's another ~30% of the cluster that is
>>> unallocated, and it stays like this for a significant amount of time until
>>> something changes, perhaps another user joins and allocates the rest
>>> chunks of this spare resource is offered to some of the frameworks, but not
>>> all of them.
>>>
>>> I had always assumed that when lots of frameworks were involved,
>>> eventually the frameworks that would keep accepting resources indefinitely
>>> would consume the remaining resource, as every other framework had rejected
>>> the offers.
>>>
>>> Could someone elaborate a little on how the DRF allocator / sorter
>>> handles this situation, is this likely to be related to the different users
>>> being used? Is there a way to mitigate this?
>>>
>>> We're running version 0.23.1.
>>>
>>> Cheers,
>>>
>>> Tom.
>>>
>>
>>
>


Re: UnsatisfiedLinkError in mesos 0.27 build with unbundled dependencies

2016-02-18 Thread Andrii Biletskyi
Please disregard this subject. I think I was able to fix the problem.

As suggested in Dockerfile I ran ./configure with --disable-java flag.
Removing it and reinstalling everything fixed the link error.

Thanks,
Andrii

On Thu, Feb 18, 2016 at 8:11 AM, Andrii Biletskyi <
andrii.bilets...@stealth.ly> wrote:

> Yes, it is set exactly as you pointed out  - /usr/local/lib/libmesos.so
>
> Just in case adding jre info:
>
> vagrant@master:/vagrant$ echo $JAVA_HOME
> /usr/lib/jvm/java-7-openjdk-amd64/jre
>
> vagrant@master:/vagrant$ java -version
> java version "1.7.0_95"
> OpenJDK Runtime Environment (IcedTea 2.6.4) (7u95-2.6.4-0ubuntu0.14.04.1)
> OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
>
> Thanks,
> Andrii
>
> On Thu, Feb 18, 2016 at 2:46 AM, haosdent  wrote:
>
>> Hi, do you try to
>> set MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so ?
>>
>> On Thu, Feb 18, 2016 at 6:34 AM, Andrii Biletskyi <
>> andrii.bilets...@stealth.ly> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to test new networking module. In order to do that I built
>>> mesos from tag 0.27.0
>>> with unbundled dependencies as suggested. I pretty much followed this
>>> Dockerfile
>>>
>>> https://github.com/mesosphere/docker-containers/blob/master/mesos-modules-dev/Dockerfile
>>> .
>>> I'm doing all the steps on a new vagrant ubuntu machine with nothing
>>> preinstalled.
>>>
>>> As far as I can tell mesos was built successfully - I didn't receive
>>> erros, libmesos.so was created
>>> under /usr/local/lib. I have specified MESOS_NATIVE_JAVA_LIBRARY
>>> accordingly. But when I
>>> start my java scheduler I see this error:
>>> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>>> org.apache.mesos.MesosSchedulerDriver.initialize()V
>>> at org.apache.mesos.MesosSchedulerDriver.initialize(Native Method)
>>>
>>> Is it a mesos build problem or some missing configuration?
>>>
>>> Thanks,
>>> Andrii Biletskyi
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>


Feature request: move in-flight containers w/o stopping them

2016-02-18 Thread Paul Bell
Hello All,

Has there ever been any consideration of the ability to move in-flight
containers from one Mesos host node to another?

I see this as analogous to VMware's "vMotion" facility wherein VMs can be
moved from one ESXi host to another.

I suppose something like this could be useful from a load-balancing
perspective.

Just curious if it's ever been considered and if so - and rejected - why
rejected?

Thanks.

-Paul


Re: Mesos sometimes not allocating the entire cluster

2016-02-18 Thread Tom Arnfeld
Hi Ben,

I've only just seen your email! Really appreciate the reply, that's
certainly an interesting bug and we'll try that patch and see how we get on.

Cheers,

Tom.

On 29 January 2016 at 19:54, Benjamin Mahler  wrote:

> Hi Tom,
>
> I suspect you may be tripping the following issue:
> https://issues.apache.org/jira/browse/MESOS-4302
>
> Please have a read through this and see if it applies here. You may also
> be able to apply the fix to your cluster to see if that helps things.
>
> Ben
>
> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld  wrote:
>
>> Hey,
>>
>> I've noticed some interesting behaviour recently when we have lots of
>> different frameworks connected to our Mesos cluster at once, all using a
>> variety of different shares. Some of the frameworks don't get offered more
>> resources (for long periods of time, hours even) leaving the cluster under
>> utilised.
>>
>> Here's an example state where we see this happen..
>>
>> Framework 1 - 13% (user A)
>> Framework 2 - 22% (user B)
>> Framework 3 - 4% (user C)
>> Framework 4 - 0.5% (user C)
>> Framework 5 - 1% (user C)
>> Framework 6 - 1% (user C)
>> Framework 7 - 1% (user C)
>> Framework 8 - 0.8% (user C)
>> Framework 9 - 11% (user D)
>> Framework 10 - 7% (user C)
>> Framework 11 - 1% (user C)
>> Framework 12 - 1% (user C)
>> Framework 13 - 6% (user E)
>>
>> In this example, there's another ~30% of the cluster that is unallocated,
>> and it stays like this for a significant amount of time until something
>> changes, perhaps another user joins and allocates the rest chunks of
>> this spare resource is offered to some of the frameworks, but not all of
>> them.
>>
>> I had always assumed that when lots of frameworks were involved,
>> eventually the frameworks that would keep accepting resources indefinitely
>> would consume the remaining resource, as every other framework had rejected
>> the offers.
>>
>> Could someone elaborate a little on how the DRF allocator / sorter
>> handles this situation, is this likely to be related to the different users
>> being used? Is there a way to mitigate this?
>>
>> We're running version 0.23.1.
>>
>> Cheers,
>>
>> Tom.
>>
>
>