Re: Feature request: move in-flight containers w/o stopping them
One problem with implementing something like vMotion for Mesos is to address seamless movement of network connectivity as well. This effectively requires moving the IP address of the container across hosts. If the container shares host network stack, this won't be possible since this would imply moving the host IP address from one host to another. When a container has its network namespace, attached to the host, using a bridge, moving across L2 segments might be a possibility. To move across L3 segments you will need some form of overlay (VxLAN maybe ?) . On Thu, Feb 18, 2016 at 7:34 PM, Jay Taylor wrote: > Is this theoretically feasible with Linux checkpoint and restore, perhaps > via CRIU?http://criu.org/Main_Page > > On Feb 18, 2016, at 4:35 AM, Paul Bell wrote: > > Hello All, > > Has there ever been any consideration of the ability to move in-flight > containers from one Mesos host node to another? > > I see this as analogous to VMware's "vMotion" facility wherein VMs can be > moved from one ESXi host to another. > > I suppose something like this could be useful from a load-balancing > perspective. > > Just curious if it's ever been considered and if so - and rejected - why > rejected? > > Thanks. > > -Paul > > > -- Avinash Sridharan, Mesosphere +1 (323) 702 5245
Re: Feature request: move in-flight containers w/o stopping them
Is this theoretically feasible with Linux checkpoint and restore, perhaps via CRIU?http://criu.org/Main_Page > On Feb 18, 2016, at 4:35 AM, Paul Bell wrote: > > Hello All, > > Has there ever been any consideration of the ability to move in-flight > containers from one Mesos host node to another? > > I see this as analogous to VMware's "vMotion" facility wherein VMs can be > moved from one ESXi host to another. > > I suppose something like this could be useful from a load-balancing > perspective. > > Just curious if it's ever been considered and if so - and rejected - why > rejected? > > Thanks. > > -Paul > >
Re: Mesos sometimes not allocating the entire cluster
Hi Tom, After the patch was applied, there is no need to restart framework but only mesos master. One question is that I saw from your log, seems your cluster has at least 36 agents, right? I was asking this question because if there are more frameworks than agents, frameworks with low weight may not able to get resources sometimes. Can you please enable GLOG_v=2 for mesos master for a while and put the log somewhere for us to check (Do not enable this for a long time as you will get log message flooded), this kind of log messages may give some help for your problem. Another is that there is another problem trying to fix another performance issue for allocator but may not help you much, but you can still take a look: https://issues.apache.org/jira/browse/MESOS-4694 Thanks, Guangya On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld wrote: > Hi Ben, > > We've rolled that patch out (applied over 0.23.1) on our production > cluster and have seen little change, the master is still not sending any > offers to those frameworks. We did this upgrade online, so would there be > any reason the fix wouldn't have helped (other than it not being the > cause)? Would we need to restart the frameworks (so they get new IDs) to > see the effect? > > It's not that the master is never sending them offers, it's that it does > it up to a certain point... for different types of frameworks (all using > libmesos) but then no more, regardless of how much free resource is > available... the free resources are offered to some frameworks, but not > all. Is there any way for us to do more introspection into the state of the > master / allocator to try and debug? Right now we're at a bit of a loss of > where to start diving in... > > Much appreciated as always, > > Tom. > > On 18 February 2016 at 10:21, Tom Arnfeld wrote: > >> Hi Ben, >> >> I've only just seen your email! Really appreciate the reply, that's >> certainly an interesting bug and we'll try that patch and see how we get on. >> >> Cheers, >> >> Tom. >> >> On 29 January 2016 at 19:54, Benjamin Mahler wrote: >> >>> Hi Tom, >>> >>> I suspect you may be tripping the following issue: >>> https://issues.apache.org/jira/browse/MESOS-4302 >>> >>> Please have a read through this and see if it applies here. You may also >>> be able to apply the fix to your cluster to see if that helps things. >>> >>> Ben >>> >>> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld wrote: >>> Hey, I've noticed some interesting behaviour recently when we have lots of different frameworks connected to our Mesos cluster at once, all using a variety of different shares. Some of the frameworks don't get offered more resources (for long periods of time, hours even) leaving the cluster under utilised. Here's an example state where we see this happen.. Framework 1 - 13% (user A) Framework 2 - 22% (user B) Framework 3 - 4% (user C) Framework 4 - 0.5% (user C) Framework 5 - 1% (user C) Framework 6 - 1% (user C) Framework 7 - 1% (user C) Framework 8 - 0.8% (user C) Framework 9 - 11% (user D) Framework 10 - 7% (user C) Framework 11 - 1% (user C) Framework 12 - 1% (user C) Framework 13 - 6% (user E) In this example, there's another ~30% of the cluster that is unallocated, and it stays like this for a significant amount of time until something changes, perhaps another user joins and allocates the rest chunks of this spare resource is offered to some of the frameworks, but not all of them. I had always assumed that when lots of frameworks were involved, eventually the frameworks that would keep accepting resources indefinitely would consume the remaining resource, as every other framework had rejected the offers. Could someone elaborate a little on how the DRF allocator / sorter handles this situation, is this likely to be related to the different users being used? Is there a way to mitigate this? We're running version 0.23.1. Cheers, Tom. >>> >>> >> > -- Guangya Liu (εε δΊ) Senior Software Engineer DCOS and OpenStack Development IBM Platform Computing Systems and Technology Group
Re: [VOTE] Release Apache Mesos 0.27.1 (rc1)
On Feb 18, 2016, at 2:23 PM, Michael Park wrote: > Hi Steven, > > From the looks of it, this was something that has been broken pre-0.27.0. > I would propose that this ticket be targeted for 0.28.0, and I can be the > shepherd for it. > > How does this sound? > Very reasonable, thanks! :) > MPark > > On 16 February 2016 at 17:10, Steven Schlansker > wrote: > On Feb 16, 2016, at 4:52 PM, Michael Park wrote: > > > Hi all, > > > > Please vote on releasing the following candidate as Apache Mesos 0.27.1. > > > > I filed a bug against 0.27.0 where Mesos can emit totally invalid JSON > in response to the /files/read.json endpoint: > > https://issues.apache.org/jira/browse/MESOS-4642 > > I suppose it's too late at this point to get it considered for the 0.27.1 > release? > I would have pushed sooner but I didn't realize the next release would happen > so quickly :) > > > > > > > 0.27.1 includes the following: > > > > * Improved `systemd` integration. > > * Ability to disable `systemd` integration. > > > > * Additional performance improvements to /state endpoint. > > * Removed duplicate "active" keys from the /state endpoint. > > > > The CHANGELOG for the release is available at: > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.1-rc1 > > > > > > The candidate for Mesos 0.27.1 release is available at: > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz > > > > The tag to be voted on is 0.27.1-rc1: > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.1-rc1 > > > > The MD5 checksum of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.md5 > > > > The signature of the tarball can be found at: > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.asc > > > > The PGP key used to sign the release is here: > > https://dist.apache.org/repos/dist/release/mesos/KEYS > > > > The JAR is up in Maven in a staging repository here: > > https://repository.apache.org/content/repositories/orgapachemesos-1102 > > > > Please vote on releasing this package as Apache Mesos 0.27.1! > > > > The vote is open until Fri Feb 19 17:00:00 PST 2016 and passes if a > > majority of at least 3 +1 PMC votes are cast. > > > > [ ] +1 Release this package as Apache Mesos 0.27.1 > > [ ] -1 Do not release this package because ... > > > > Thanks, > > > > Joris, MPark > >
Re: [VOTE] Release Apache Mesos 0.27.1 (rc1)
Hi Steven, >From the looks of it, this was something that has been broken pre-0.27.0. I would propose that this ticket be targeted for 0.28.0, and I can be the shepherd for it. How does this sound? MPark On 16 February 2016 at 17:10, Steven Schlansker wrote: > On Feb 16, 2016, at 4:52 PM, Michael Park wrote: > > > Hi all, > > > > Please vote on releasing the following candidate as Apache Mesos 0.27.1. > > > > I filed a bug against 0.27.0 where Mesos can emit totally invalid JSON > in response to the /files/read.json endpoint: > > https://issues.apache.org/jira/browse/MESOS-4642 > > I suppose it's too late at this point to get it considered for the 0.27.1 > release? > I would have pushed sooner but I didn't realize the next release would > happen so quickly :) > > > > > > > 0.27.1 includes the following: > > > > > * Improved `systemd` integration. > > * Ability to disable `systemd` integration. > > > > * Additional performance improvements to /state endpoint. > > * Removed duplicate "active" keys from the /state endpoint. > > > > The CHANGELOG for the release is available at: > > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.1-rc1 > > > > > > > The candidate for Mesos 0.27.1 release is available at: > > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz > > > > The tag to be voted on is 0.27.1-rc1: > > > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.1-rc1 > > > > The MD5 checksum of the tarball can be found at: > > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.md5 > > > > The signature of the tarball can be found at: > > > https://dist.apache.org/repos/dist/dev/mesos/0.27.1-rc1/mesos-0.27.1.tar.gz.asc > > > > The PGP key used to sign the release is here: > > https://dist.apache.org/repos/dist/release/mesos/KEYS > > > > The JAR is up in Maven in a staging repository here: > > https://repository.apache.org/content/repositories/orgapachemesos-1102 > > > > Please vote on releasing this package as Apache Mesos 0.27.1! > > > > The vote is open until Fri Feb 19 17:00:00 PST 2016 and passes if a > majority of at least 3 +1 PMC votes are cast. > > > > [ ] +1 Release this package as Apache Mesos 0.27.1 > > [ ] -1 Do not release this package because ... > > > > Thanks, > > > > Joris, MPark > >
Re: Mesos sometimes not allocating the entire cluster
Hi Ben, We've rolled that patch out (applied over 0.23.1) on our production cluster and have seen little change, the master is still not sending any offers to those frameworks. We did this upgrade online, so would there be any reason the fix wouldn't have helped (other than it not being the cause)? Would we need to restart the frameworks (so they get new IDs) to see the effect? It's not that the master is never sending them offers, it's that it does it up to a certain point... for different types of frameworks (all using libmesos) but then no more, regardless of how much free resource is available... the free resources are offered to some frameworks, but not all. Is there any way for us to do more introspection into the state of the master / allocator to try and debug? Right now we're at a bit of a loss of where to start diving in... Much appreciated as always, Tom. On 18 February 2016 at 10:21, Tom Arnfeld wrote: > Hi Ben, > > I've only just seen your email! Really appreciate the reply, that's > certainly an interesting bug and we'll try that patch and see how we get on. > > Cheers, > > Tom. > > On 29 January 2016 at 19:54, Benjamin Mahler wrote: > >> Hi Tom, >> >> I suspect you may be tripping the following issue: >> https://issues.apache.org/jira/browse/MESOS-4302 >> >> Please have a read through this and see if it applies here. You may also >> be able to apply the fix to your cluster to see if that helps things. >> >> Ben >> >> On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld wrote: >> >>> Hey, >>> >>> I've noticed some interesting behaviour recently when we have lots of >>> different frameworks connected to our Mesos cluster at once, all using a >>> variety of different shares. Some of the frameworks don't get offered more >>> resources (for long periods of time, hours even) leaving the cluster under >>> utilised. >>> >>> Here's an example state where we see this happen.. >>> >>> Framework 1 - 13% (user A) >>> Framework 2 - 22% (user B) >>> Framework 3 - 4% (user C) >>> Framework 4 - 0.5% (user C) >>> Framework 5 - 1% (user C) >>> Framework 6 - 1% (user C) >>> Framework 7 - 1% (user C) >>> Framework 8 - 0.8% (user C) >>> Framework 9 - 11% (user D) >>> Framework 10 - 7% (user C) >>> Framework 11 - 1% (user C) >>> Framework 12 - 1% (user C) >>> Framework 13 - 6% (user E) >>> >>> In this example, there's another ~30% of the cluster that is >>> unallocated, and it stays like this for a significant amount of time until >>> something changes, perhaps another user joins and allocates the rest >>> chunks of this spare resource is offered to some of the frameworks, but not >>> all of them. >>> >>> I had always assumed that when lots of frameworks were involved, >>> eventually the frameworks that would keep accepting resources indefinitely >>> would consume the remaining resource, as every other framework had rejected >>> the offers. >>> >>> Could someone elaborate a little on how the DRF allocator / sorter >>> handles this situation, is this likely to be related to the different users >>> being used? Is there a way to mitigate this? >>> >>> We're running version 0.23.1. >>> >>> Cheers, >>> >>> Tom. >>> >> >> >
Re: UnsatisfiedLinkError in mesos 0.27 build with unbundled dependencies
Please disregard this subject. I think I was able to fix the problem. As suggested in Dockerfile I ran ./configure with --disable-java flag. Removing it and reinstalling everything fixed the link error. Thanks, Andrii On Thu, Feb 18, 2016 at 8:11 AM, Andrii Biletskyi < andrii.bilets...@stealth.ly> wrote: > Yes, it is set exactly as you pointed out - /usr/local/lib/libmesos.so > > Just in case adding jre info: > > vagrant@master:/vagrant$ echo $JAVA_HOME > /usr/lib/jvm/java-7-openjdk-amd64/jre > > vagrant@master:/vagrant$ java -version > java version "1.7.0_95" > OpenJDK Runtime Environment (IcedTea 2.6.4) (7u95-2.6.4-0ubuntu0.14.04.1) > OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode) > > Thanks, > Andrii > > On Thu, Feb 18, 2016 at 2:46 AM, haosdent wrote: > >> Hi, do you try to >> set MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so ? >> >> On Thu, Feb 18, 2016 at 6:34 AM, Andrii Biletskyi < >> andrii.bilets...@stealth.ly> wrote: >> >>> Hi all, >>> >>> I'm trying to test new networking module. In order to do that I built >>> mesos from tag 0.27.0 >>> with unbundled dependencies as suggested. I pretty much followed this >>> Dockerfile >>> >>> https://github.com/mesosphere/docker-containers/blob/master/mesos-modules-dev/Dockerfile >>> . >>> I'm doing all the steps on a new vagrant ubuntu machine with nothing >>> preinstalled. >>> >>> As far as I can tell mesos was built successfully - I didn't receive >>> erros, libmesos.so was created >>> under /usr/local/lib. I have specified MESOS_NATIVE_JAVA_LIBRARY >>> accordingly. But when I >>> start my java scheduler I see this error: >>> Exception in thread "main" java.lang.UnsatisfiedLinkError: >>> org.apache.mesos.MesosSchedulerDriver.initialize()V >>> at org.apache.mesos.MesosSchedulerDriver.initialize(Native Method) >>> >>> Is it a mesos build problem or some missing configuration? >>> >>> Thanks, >>> Andrii Biletskyi >>> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> > >
Feature request: move in-flight containers w/o stopping them
Hello All, Has there ever been any consideration of the ability to move in-flight containers from one Mesos host node to another? I see this as analogous to VMware's "vMotion" facility wherein VMs can be moved from one ESXi host to another. I suppose something like this could be useful from a load-balancing perspective. Just curious if it's ever been considered and if so - and rejected - why rejected? Thanks. -Paul
Re: Mesos sometimes not allocating the entire cluster
Hi Ben, I've only just seen your email! Really appreciate the reply, that's certainly an interesting bug and we'll try that patch and see how we get on. Cheers, Tom. On 29 January 2016 at 19:54, Benjamin Mahler wrote: > Hi Tom, > > I suspect you may be tripping the following issue: > https://issues.apache.org/jira/browse/MESOS-4302 > > Please have a read through this and see if it applies here. You may also > be able to apply the fix to your cluster to see if that helps things. > > Ben > > On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld wrote: > >> Hey, >> >> I've noticed some interesting behaviour recently when we have lots of >> different frameworks connected to our Mesos cluster at once, all using a >> variety of different shares. Some of the frameworks don't get offered more >> resources (for long periods of time, hours even) leaving the cluster under >> utilised. >> >> Here's an example state where we see this happen.. >> >> Framework 1 - 13% (user A) >> Framework 2 - 22% (user B) >> Framework 3 - 4% (user C) >> Framework 4 - 0.5% (user C) >> Framework 5 - 1% (user C) >> Framework 6 - 1% (user C) >> Framework 7 - 1% (user C) >> Framework 8 - 0.8% (user C) >> Framework 9 - 11% (user D) >> Framework 10 - 7% (user C) >> Framework 11 - 1% (user C) >> Framework 12 - 1% (user C) >> Framework 13 - 6% (user E) >> >> In this example, there's another ~30% of the cluster that is unallocated, >> and it stays like this for a significant amount of time until something >> changes, perhaps another user joins and allocates the rest chunks of >> this spare resource is offered to some of the frameworks, but not all of >> them. >> >> I had always assumed that when lots of frameworks were involved, >> eventually the frameworks that would keep accepting resources indefinitely >> would consume the remaining resource, as every other framework had rejected >> the offers. >> >> Could someone elaborate a little on how the DRF allocator / sorter >> handles this situation, is this likely to be related to the different users >> being used? Is there a way to mitigate this? >> >> We're running version 0.23.1. >> >> Cheers, >> >> Tom. >> > >