Re: Design doc: Agent draining and deprecation of maintenance primitives

2019-05-30 Thread Joseph Wu
As far as I can tell, the document is public.

On Thu, May 30, 2019 at 12:22 AM Marc Roos  wrote:

>
> Is the doc not public?
>
>
> -Original Message-
> From: Joseph Wu [mailto:jos...@mesosphere.io]
> Sent: donderdag 30 mei 2019 2:07
> To: dev; user
> Subject: Design doc: Agent draining and deprecation of maintenance
> primitives
>
> Hi all,
>
> A few years back, we added some constructs called maintenance primitives
> to Mesos.  This feature was meant to allow operators and frameworks to
> cooperate in draining tasks off nodes scheduled for maintenance.  As far
> as we've observed since, this feature never achieved enough adoption to
> be useful for operators.
>
> As such, we are proposing a more opinionated approach for draining
> tasks.  The goal is to have Mesos perform draining in lieu of
> frameworks, minimizing or eliminating the need to change frameworks to
> account for draining.  We will also be simplifying the operator
> workflow, which would only require a single call (holding an AgentID) to
> start draining; and a single call to bring an agent back into the
> cluster.
>
>
> Due to how closely this proposed feature overlaps with maintenance
> primitives, we will be deprecating maintenance primitives upon
> implementation of agent draining.
>
>
> If interested, please take a look at the design document:
>
>
> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>
>
>


Design doc: Agent draining and deprecation of maintenance primitives

2019-05-29 Thread Joseph Wu
Hi all,

A few years back, we added some constructs called maintenance primitives to
Mesos.  This feature was meant to allow operators and frameworks to
cooperate in draining tasks off nodes scheduled for maintenance.  As far as
we've observed since, this feature never achieved enough adoption to be
useful for operators.

As such, we are proposing a more opinionated approach for draining tasks.
The goal is to have Mesos perform draining in lieu of frameworks,
minimizing or eliminating the need to change frameworks to account for
draining.  We will also be simplifying the operator workflow, which would
only require a single call (holding an AgentID) to start draining; and a
single call to bring an agent back into the cluster.

Due to how closely this proposed feature overlaps with maintenance
primitives, we will be deprecating maintenance primitives upon
implementation of agent draining.

If interested, please take a look at the design document:
https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/


Re: [VOTE] Release Apache Mesos 1.8.0 (rc2)

2019-04-23 Thread Joseph Wu
-1 (binding)

We found a serious bug when upgrading from 1.7.x to 1.8.x, which prevents
agents from reregistering after upgrading the masters:
https://issues.apache.org/jira/browse/MESOS-9740

On Tue, Apr 23, 2019 at 8:27 AM Andrei Budnik  wrote:

> +1
>
> sudo make -j16 distcheck
> DISTCHECK_CONFIGURE_FLAGS='--disable-libtool-wrappers
> --disable-parallel-test-execution --enable-seccomp-isolator
> --enable-launcher-sealing'
> on Fedora 25
>
> I gave +1, but some of the recently added tests are failing:
> [  FAILED  ] VolumeGidManagerTest.ROOT_UNPRIVILEGED_USER_SlaveReboot
> [  FAILED  ] CniIsolatorTest.VETH_VerifyResourceStatistics
> [  FAILED  ] DockerVolumeIsolatorTest.ROOT_EmptyCheckpointFileSlaveRecovery
>
>
> On Thu, Apr 18, 2019 at 3:00 PM Benno Evers  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.8.0.
> >
> >
> > 1.8.0 includes the following:
> >
> >
> 
> >  * Greatly reduced allocator cycle time.
> >  * Operation feedback for v1 schedulers.
> >  * Per-framework minimum allocatable resources.
> >  * New CLI subcommands `task attach` and `task exec`.
> >  * New `linux/seccomp` isolator.
> >  * Support for Docker v2 Schema2 manifest format.
> >  * XFS quota for persistent volumes.
> >  * **Experimental** Support for the new CSI v1 API.
> >
> > In addition, 1.8.0-rc2 includes the following changes:
> >
> >
> -
> >  * Docker manifest v2s2 config with image GC.
> >  * Expanded `highlights` section in the CHANGELOG.
> >
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.8.0-rc2
> >
> >
> 
> >
> > The candidate for Mesos 1.8.0 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz
> >
> > The tag to be voted on is 1.8.0-rc2:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.8.0-rc2
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.8.0-rc2/mesos-1.8.0.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1252
> >
> > Please vote on releasing this package as Apache Mesos 1.8.0!
> >
> > The vote is open until Wednesday, April 24th and passes if a majority of
> > at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.8.0
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Benno and Joseph
> >
>


Re: Failed to accept socket: Failed accept: connection error: error:1407609C:SSL routines:SSL23_GET_CLIENT_HELLO:http request

2019-02-20 Thread Joseph Wu
The "SSL routines:SSL23_GET_CLIENT_HELLO:http request" is OpenSSL's cryptic
way of saying the client is using HTTP to talk to an HTTPS server.  Since
you've disabled LIBPROCESS_SSL_SUPPORT_DOWNGRADE, the error should be
expected.

On Wed, Feb 20, 2019 at 2:06 PM Marc Roos  wrote:

>
>
> Why am I getting these when I connect with at browser to port 5050. How
> to resolve this, looks like a bug or so. 192.168.10.151 has the same
> cert as m1.local and the cert has even common name m01.local. But the
> javascript is reporting "An error occurred: SEC_ERROR_UNKNOWN_ISSUER"
>
>
> W0220 22:42:23.678849 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:43:23.678017 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:44:23.678727 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:45:23.677820 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:46:23.677990 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:47:23.678474 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
> W0220 22:48:23.677384 21804 process.cpp:902] Failed to accept socket:
> Failed accept: connection error: error:1407609C:SSL
> routines:SSL23_GET_CLIENT_HELLO:http request
>
> LIBPROCESS_SSL_ENABLED=1
> LIBPROCESS_SSL_SUPPORT_DOWNGRADE=0
> LIBPROCESS_SSL_KEY_FILE=/etc/pki/tls/private/m01.local.key
> LIBPROCESS_SSL_CERT_FILE=/etc/pki/tls/certs/m01.local.crt
> LIBPROCESS_SSL_VERIFY_CERT=0
> LIBPROCESS_SSL_CA_FILE="/etc/pki/ca-trust/source/ca-own.crt"
>
>
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> mesos-1.7.1-2.0.1.x86_64
>


Re: How to parse -v docker flags

2019-02-13 Thread Joseph Wu
Since you are using the Mesos containerizer, docker will not be part of the
equation (even if you are using a docker image).  By the looks of it, you
are trying to mount a specific volume ("data") provided by the docker
volume driver.

In which case, you'll want to take a look at this documentation:
http://mesos.apache.org/documentation/latest/isolators/docker-volume/
That will tell you how to setup the appropriate isolator and the relevant
way to specify these volumes.

On Wed, Feb 13, 2019 at 12:49 PM Marc Roos  wrote:

>
> How should I parse the -v flags [0] for docker (using mesos
> containerizer) correctly. I tried several.
>
> "args": [ "-v data:/var/lib/influxdb" ],
> "argv": [ "-v data:/var/lib/influxdb" ],
> "argv": [ "data:/var/lib/influxdb" ],
> "argv": [ "influxdb:/data" ],
>
> But all result in:
> run: create server: mkdir all: mkdir /var/lib/influxdb/meta: permission
> denied
>
>
> [0] https://hub.docker.com/_/influxdb
>
>
> {
>   "id": "influxdb",
>   "user": "influxdb",
>   "cmd": null,
>   "cpus": 1,
>   "mem": 512,
>   "instances": 1,
>   "acceptedResourceRoles": ["*"],
>   "residency": { "taskLostBehavior": "WAIT_FOREVER" },
>   "upgradeStrategy": {"minimumHealthCapacity": 0, "maximumOverCapacity":
> 0 },
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "influxdb",
>   "credential": null,
>   "forcePullImage": false
> },
> "volumes": [
>   {
> "containerPath": "data",
> "external": {
>   "name": "app-influxdb",
>   "provider": "dvdi",
>   "options": { "dvdi/driver": "rexray" }
> },
> "mode": "RW"
>   }
> ]
>   },
>   "argv": [ "-v data:/var/lib/influxdb" ],
>   "env": {
> "INFLUXDB_REPORTING_DISABLED": "true",
> "INFLUXDB_HTTP_AUTH_ENABLED": "true",
> "INFLUXDB_ADMIN_ENABLED": "true",
> "INFLUXDB_ADMIN_USER": "admin",
> "INFLUXDB_ADMIN_PASSWORD": "example"
>   }
> }
>


Re: Check failed: reservationScalarQuantities.contains(role)

2019-02-05 Thread Joseph Wu
>From the stack, it looks like the master is attempting to remove an agent
from the master's in-memory state.  In the master's logs you should find a
line shortly before the exit, like:

 master.cpp:] Removed agent : 

The agent's ID should at least give you some pointer to which agent is
causing the problem.  Feel free to create a JIRA (
https://issues.apache.org/jira/) with any information you can glean.  This
particular type of failure, a CHECK-failure, means some invariant has been
violated and usually means we missed a corner case.

On Tue, Feb 5, 2019 at 12:04 PM Jeff Pollard  wrote:

> We recently upgraded our Mesos  cluster from version 1.3 to 1.5, and since
> then have been getting periodic master crashes due to this error:
>
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118
> 8434 hierarchical.cpp:2630] Check failed:
> reservationScalarQuantities.contains(role)
>
> Full stack trace is at the end of this email. When the master fails, we
> automatically restart it and it rejoins the cluster just fine. I did some
> initial searching and was unable to find any existing bug reports or other
> people experiencing this issue. We run a cluster of 3 masters, and see
> crashes on all 3 instances.
>
> Hope to get some guidance on what is going on and/or where to start
> looking for more information.
>
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e9170a7d  google::LogMessage::Fail()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e9172830  google::LogMessage::SendToLog()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e9170663  google::LogMessage::Flush()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e9173259  google::LogMessageFatal::~LogMessageFatal()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e8443cbd
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e8448fcd
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e90c4f11  process::ProcessBase::consume()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e90dea4a  process::ProcessManager::resume()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e90e25d6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e6700c80  (unknown)
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e5f136ba  start_thread
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: @
>  0x7f87e5c4941d  (unknown)
>


Re: [API WG] Proposals for dealing with master subscriber leaks.

2018-11-14 Thread Joseph Wu
Heartbeats are currently the least-liked solution, for precisely the reason
BenM stated.  Clients of the API, such as the maintainers of the DC/OS UI,
would also like to avoid making more connections than necessary and/or
keeping additional state between connections.


Currently, I am leaning towards keeping subscribers in a circular buffer.
This solution is minimal in the code footprint and requires no client-side
changes besides heavily incentivizing retry logic (which we already expect
in most cases).
One potential downside is having more subscribers than the (master flag)
configured maximum.  In this case, each client would kick out the first
few; which would then retry and kick out the next few, etc.  Each retry is
equivalent to a GET /master/state, and the extra calls would basically
erase the performance gains we have from streaming the events.

Nevertheless, I think a reasonably high default would have minimal impact
on both master performance and client connectivity.  The code for this
proposal can be found here:
https://reviews.apache.org/r/69307/  (Just one review)

On Sun, Nov 11, 2018 at 9:22 AM Benjamin Mahler  wrote:

> >- We can add heartbeats to the SUBSCRIBE call.
> > This would need to be
> >  part of a separate operator Call, because one platform (browsers) that
> > might subscribe to the master does not support two-way streaming.
>
> This doesn't make sense to me, the heartbeats should still be part of the
> same connection (request and response are infinite and heartbeating) by
> default. Splitting into a separate call is messy and shouldn't be what we
> force everyone to do, it should only be done in cases that it's impossible
> to use a single connection (e.g. browsers).
>
> On Sat, Nov 10, 2018 at 12:03 AM Joseph Wu  wrote:
>
>> Hi all,
>>
>> During some internal scale testing, we noticed that, when Mesos streaming
>> endpoints are accessed via certain proxies (or load balancers), the
>> proxies
>> might not close connections after they are complete.  For the Mesos
>> master,
>> which only has the /api/v1 SUBSCRIBE streaming endpoint, this can generate
>> unnecessary authorization requests and affects performance.
>>
>> We are considering a few potential solutions:
>>
>>- We can add heartbeats to the SUBSCRIBE call.  This would need to be
>>part of a separate operator Call, because one platform (browsers) that
>>might subscribe to the master does not support two-way streaming.
>>- We can add (optional) arguments to the SUBSCRIBE call, which tells
>> the
>>master to disconnect it after a while.  And the client would have to
>> remake
>>the connection every so often.
>>- We can change the master to hold subscribers in a circular buffer,
>> and
>>disconnect the oldest ones if there are too many connections.
>>
>> We're tracking progress on this issue here:
>> https://issues.apache.org/jira/browse/MESOS-9258
>> Some prototypes of the code changes involved are also linked in the JIRA.
>>
>> Please chime in if you have any suggestions or if any of these options
>> would be undesirable/bad,
>> ~Joseph
>>
>


[API WG] Proposals for dealing with master subscriber leaks.

2018-11-09 Thread Joseph Wu
Hi all,

During some internal scale testing, we noticed that, when Mesos streaming
endpoints are accessed via certain proxies (or load balancers), the proxies
might not close connections after they are complete.  For the Mesos master,
which only has the /api/v1 SUBSCRIBE streaming endpoint, this can generate
unnecessary authorization requests and affects performance.

We are considering a few potential solutions:

   - We can add heartbeats to the SUBSCRIBE call.  This would need to be
   part of a separate operator Call, because one platform (browsers) that
   might subscribe to the master does not support two-way streaming.
   - We can add (optional) arguments to the SUBSCRIBE call, which tells the
   master to disconnect it after a while.  And the client would have to remake
   the connection every so often.
   - We can change the master to hold subscribers in a circular buffer, and
   disconnect the oldest ones if there are too many connections.

We're tracking progress on this issue here:
https://issues.apache.org/jira/browse/MESOS-9258
Some prototypes of the code changes involved are also linked in the JIRA.

Please chime in if you have any suggestions or if any of these options
would be undesirable/bad,
~Joseph


Re: Install target missing for CMake builds

2018-09-06 Thread Joseph Wu
We have not (yet) implemented the install target for the CMake build.  The
target does exist if you use the automake build however.

On Thu, Sep 6, 2018 at 9:36 AM, Junker, Gregory 
wrote:

> Hi
>
> I am trying to build Mesos on Linux (Ubuntu 18.04) using CMake (Makefile
> generator) and following the instructions on the site (
> http://mesos.apache.org/documentation/latest/cmake/), but I am not seeing
> an "install" target generated. This is true for both master as well as
> 1.6.1. Additionally, and possibly related, building the "package" target
> creates a 54-byte .Z file (empty, in other words). The artifacts are
> compiled and linked and all exist in the "src" directory under my build
> root, but both "cmake --build . --target install" and "make install"
> complain that there is no "install" target. "cmake --build . --target help"
> verifies this.
>
> Am I missing something obvious? I can't be the only one having this
> problem?
>
> Greg
>


Re: Resource offers - DRF - Mesos

2018-05-22 Thread Joseph Wu
1) DRF is based on the _current_ allocation of resources (from the master's
perspective) rather than a historical allocation of resources.

2) So when a new cluster is started, all frameworks will have a current
allocation of 0.  And assuming all else (like quotas, roles, and weights)
are equivalent (or not set to anything), then your 2 frameworks would
receive roughly equal shares of offers.

3) As of right now, there is no way for the framework to directly influence
the number of offers received in a single call.  The best approach to
getting offers on multiple machines is to hold onto the offers (i.e. not
accepting nor declining them) until your necessary conditions have been met.

On Tue, May 22, 2018 at 2:56 AM, Thodoris Zois  wrote:

> Hello list,
>
> I have some questions about resource offers for Mesos and I am
> experiencing some problems that I hope somebody will be able to help.
>
> 1) The allocation module of Mesos master uses DRF (according to
> previous allocation history) and decides which framework will get an
> offer, and how many resources will be offered. Is this right?
>
> 2) Assume that a Mesos cluster starts for the very first time and 2
> frameworks join. None of the frameworks has a job to submit, they just
> wait and get offers. What is the policy to send resource offers since
> master does not know anything about previous allocations?
>
> 3) I got a Mesos cluster with 5 machines and 1 framework only. Is there
> any way to force Mesos send everytime all the five resource offers to
> my framework? I have seen that when the framework registers for the
> first time, it gets a list of offers that include all 5 machines.
> However, if it does not accept them the next round of offers contains
> only 2 or even 1 machine, depending on the time that framework has
> declined.
>
> Thank you very much for your response,
> Any help is appreciated!
>
> - Thodoris
>


Re: java driver/shutdown call

2018-01-16 Thread Joseph Wu
If a framework launches tasks, then it will use an executor.  Mesos
provides a "default" executor if the framework doesn't explicitly specify
an executor.  (And the Shutdown call will work with that default executor.)

On Tue, Jan 16, 2018 at 4:49 PM, Mohit Jaggi  wrote:

> Gotcha. Another question: if a framework doesn't use executors, can it
> still use the SHUTDOWN call?
>
> On Fri, Jan 12, 2018 at 2:37 PM, Anand Mazumdar 
> wrote:
>
>> Yes; It's a newer interface that still allows you to switch between the
>> v1 (new) and the old API.
>>
>> -anand
>>
>> On Fri, Jan 12, 2018 at 3:28 PM, Mohit Jaggi 
>> wrote:
>>
>>> Are you suggesting
>>>
>>> *send(new Call(METHOD, Param1, ...)) *
>>>
>>> instead of
>>>
>>> *driver.method(Param1, )*
>>>
>>> *?*
>>>
>>> On Fri, Jan 12, 2018 at 10:59 AM, Anand Mazumdar <
>>> mazumdar.an...@gmail.com> wrote:
>>>
 Mohit,

 You can use the V1Mesos class that uses the v1 API internally allowing
 you to send the 'SHUTDOWN' call. We also have a V0Mesos class that uses the
 old scheduler driver internally.

 -anand

 On Wed, Jan 10, 2018 at 2:53 PM, Mohit Jaggi 
 wrote:

> Thanks Vinod. Is there a V1SchedulerDriver.java file? I see
> https://github.com/apache/mesos/tree/72752fc6deb8ebcbfbd
> 5448dc599ef3774339d31/src/java/src/org/apache/mesos/v1/scheduler but
> it does not have a V1 driver.
>
> On Fri, Jan 5, 2018 at 3:59 PM, Vinod Kone 
> wrote:
>
>> That's right. It is only available for v1 schedulers.
>>
>> On Fri, Jan 5, 2018 at 3:38 PM, Mohit Jaggi 
>> wrote:
>>
>>> Folks,
>>> I am trying to change Apache Aurora's code to call SHUTDOWN instead
>>> of KILL. However, it seems that the SchedulerDriver class in Mesos does 
>>> not
>>> have a shutdownExecutor() call.
>>>
>>> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448
>>> dc599ef3774339d31/src/java/src/org/apache/mesos/SchedulerDriver.java
>>>
>>> Mohit.
>>>
>>
>>
>


 --
 Anand Mazumdar

>>>
>>>
>>
>>
>> --
>> Anand Mazumdar
>>
>
>


Welcome Andrew Schwartzmeyer as a new committer and PMC member!

2017-11-27 Thread Joseph Wu
Hi devs & users,

I'm happy to announce that Andrew Schwartzmeyer has become a new committer
and member of the PMC for the Apache Mesos project.  Please join me in
congratulating him!

Andrew has been an active contributor to Mesos for about a year.  He has
been the primary contributor behind our efforts to change our default build
system to CMake and to port Mesos onto Windows.

Here is his committer candidate checklist for your perusal:
https://docs.google.com/document/d/1MfJRYbxxoX2-A-
g8NEeryUdUi7FvIoNcdUbDbGguH1c/

Congrats Andy!
~Joseph


Re: Mesos containerizer with marathon

2017-10-13 Thread Joseph Wu
A quick modification to try...

Replace the container type:

"container": {
> "type": "DOCKER",
>

With this:

"container": {
"type": "MESOS",

That will tell Marathon to use the Mesos containerizer, rather than the
Docker containerizer.

On Fri, Oct 13, 2017 at 2:38 PM, Marc Roos  wrote:

>
> I was watching this video https://youtu.be/rHUngcGgzVM?t=1515 of using
> mesos for docker images. And it looks like I can run the influxdb docker
> image with
> mesos-execute --master=192.168.10.151:5050 --name=influxdb
> --docker_image=influxdb --shell=false
>
> However I have problems launching the application via the marathon
> webinterface, could this be related to that marathon is looking for the
> dockerd?
>
> Delayed(0 of 1 instances)
> State
> TASK_FAILED
> Message
> Abnormal executor termination: unknown container
> Without any stderr/stdout
>
> {
>   "id": "/influxdb",
>   "cmd": null,
>   "cpus": 1,
>   "mem": 128,
>   "disk": 200,
>   "instances": 1,
>   "acceptedResourceRoles": [],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "influxdb",
>   "network": "BRIDGE",
>   "portMappings": [
> {
>   "containerPort": 8086,
>   "hostPort": 0,
>   "servicePort": 10001,
>   "protocol": "tcp",
>   "name": "httpapi",
>   "labels": {}
> },
> {
>   "containerPort": 25829,
>   "hostPort": 0,
>   "servicePort": 10002,
>   "protocol": "tcp",
>   "name": "collectd",
>   "labels": {}
> }
>   ],
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": false
> }
>   },
>   "portDefinitions": [
> {
>   "port": 10001,
>   "protocol": "tcp",
>   "name": "default",
>   "labels": {}
> },
> {
>   "port": 10002,
>   "protocol": "tcp",
>   "labels": {}
> }
>   ]
> }
>
> centos7
> mesos-1.4.0-2.0.1.x86_64
> marathon-1.4.8-1.0.660.el7.x86_64
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> Getting the images directly from /tmp
>
> PS. Just 'playing' 2 days with mesos test environment, so pardon if
> terminology is not correct.
>


[Design Doc] Standalone Container API

2017-08-07 Thread Joseph Wu
As part of work to improve storage support in Mesos [1], we will be adding
the ability to launch containers via the Mesos Containerizer, without going
through the traditional method (i.e. framework -> offer cycle -> launch
executor/task -> status updates -> etc).  Below I've linked a short design
document for interacting with these "standalone" containers:

https://docs.google.com/document/d/1DZVfZAOLtqd8kbiWHD4j29LzaYcNCh1k6QQnbggyTio/

Please feel free to comment on the doc (or on this thread) if you have any
comments or suggestions! Thanks!

[1]
https://lists.apache.org/thread.html/02871cb51ce6d0bec24770bcaaba07b52dcda0cdb87cbdd0871b82d1@%3Cdev.mesos.apache.org%3E


Re: Custom isolators - External container

2017-08-07 Thread Joseph Wu
First off, the external containerizer was officially removed in Mesos 1.1.0
(it had been deprecated long before that release):
https://issues.apache.org/jira/browse/MESOS-3370

---

If you want to develop/deploy a new isolation method for Mesos, you should
first consider writing isolator modules (Mesos modules):
https://github.com/apache/mesos/blob/master/include/mesos/slave/isolator.hpp

Isolator modules are only applicable for the Mesos containerizer, so if you
plan to run docker workloads, you can consider using built-in isolators
("docker/runtime") that support running docker images in the Mesos
containerizer.

If you plan to use the Docker containerizer, your only choice is to develop
a custom executor to isolate tasks only within the same executor (docker
will take over isolating executors from each other).

---

There are few benefits from running the Mesos agent inside a Docker
container and many pitfalls, so this practice is highly discouraged.
Instead, we recommend running the Mesos agent directly via a supervisor
(upstart, systemd, etc.).  The agent itself is not containerized when run
normally.

On Sun, Aug 6, 2017 at 4:32 PM, Thodoris Zois  wrote:

> Hello,
>
> Is support of external containerizer removed from Mesos? Also, i have
> developed some isolators that i would like to use with Mesos. I found 3
> ways to do that but i don't know what is the proper way and what are the
> advantages and disadvantages in each case.
>
> The 1st one is as a Mesos module
>
> The 2nd one is a custom executor
>
> The 3rd one is the container image on agent.
>
> What i am trying to do is to isolate docker tasks (images - one task per
> docker container) that run under the same agent with my own isolators.
>
> What are the benefits of running agent in a big docker container and
> inside small docker containers as tasks?  If you don't run the agent under
> a big docker container  then by default is running under Mesos container
> while inside are running small docker containers with tasks? (Assume
> that we don't run tasks under mesos container)
>
>
> Thank you and sorry for the so many questions!
> Thodoris
>


Re: Mesos Executor Failing

2017-05-24 Thread Joseph Wu
There isn't a tool for this.  Can you check if the Mesos agent is being
restarted (or crashing) when you launch a task?  And perhaps upload some
logs around the time of the task launch.

There is a mismatch between the exit codes you've reported though.  When
you see that log line in the sandbox logs, the exit code will be "1"
(failure), rather than "0" (success).

On Mon, May 22, 2017 at 9:30 PM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Hi Joseph
>
> I am using 0.27.0.  Is there any diagnosis tool or command line that i can
> run to ascertain that why its happening?
>
> Regards
> Sumit Chawla
>
>
> On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
>> What version of Mesos are you using?  (Just based on the word "slave" in
>> that error message, I'm guessing 0.28 or older.)
>>
>> The "Failed to synchronize" error is something that can occur while the
>> agent is launching the executor.  During the launch, the agent will create
>> a pipe to the executor subprocess; and the executor makes a blocking read
>> on this pipe.  The agent will write a value to the pipe to signal the
>> executor to proceed.  If the agent restarts or the pipe breaks at this
>> point in the launch, then you'll see this error message.
>>
>> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <sumitkcha...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am facing a peculiar issue on one of the slave nodes of our cluster.
>>> I have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>>> with exit code 0.
>>>
>>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>>> exited caused by one of the running tasks) Reason: Unknown executor
>>> exit code (0)
>>>
>>>
>>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>>> being written to stdout/stderr.  Are there any debugging utitlities that i
>>> can use to debug what can be getting wrong on that particular slave?
>>>
>>> I tried running following but got stuck at:
>>>
>>>
>>> /mesos-containerizer launch 
>>> --command='{"environment":{},"shell":true,"value":"ls
>>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8f
>>> a4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539d
>>> a7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-
>>> S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>>> --pipe_read=0 --pipe_write=0 --user=smi
>>>
>>> Failed to synchronize with slave (it's probably exited)
>>>
>>>
>>> Would apprecite pointing to any debugging methods/documentation to
>>> diagnose these kind of problems.
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>


Re: Mesos Executor Failing

2017-05-19 Thread Joseph Wu
What version of Mesos are you using?  (Just based on the word "slave" in
that error message, I'm guessing 0.28 or older.)

The "Failed to synchronize" error is something that can occur while the
agent is launching the executor.  During the launch, the agent will create
a pipe to the executor subprocess; and the executor makes a blocking read
on this pipe.  The agent will write a value to the pipe to signal the
executor to proceed.  If the agent restarts or the pipe breaks at this
point in the launch, then you'll see this error message.

On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit 
wrote:

> Hi
>
> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
> with exit code 0.
>
> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
> exited caused by one of the running tasks) Reason: Unknown executor exit
> code (0)
>
>
> I cannot seem to find anything in mesos-slave.logs, and there is nothing
> being written to stdout/stderr.  Are there any debugging utitlities that i
> can use to debug what can be getting wrong on that particular slave?
>
> I tried running following but got stuck at:
>
>
> /mesos-containerizer launch 
> --command='{"environment":{},"shell":true,"value":"ls
> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f
> --help=false --pipe_read=0 --pipe_write=0 --user=smi
>
> Failed to synchronize with slave (it's probably exited)
>
>
> Would apprecite pointing to any debugging methods/documentation to
> diagnose these kind of problems.
>
> Regards
> Sumit Chawla
>
>


Re: Mesos fetcher error when running as non-root user

2017-04-26 Thread Joseph Wu
There was a change in 1.2.0 which changed how the fetcher would chown the
sandbox:
https://issues.apache.org/jira/browse/MESOS-5218

Prior to 1.2, when the fetcher ran, it would recursively chown the entire
sandbox to the given user.  This was incorrect behavior, since the Mesos
agent will create the sandbox under the same user (but might put some root
files in the non-root sandbox).

Can you check your agent logs and paste the fetcher's error here?

On Wed, Apr 26, 2017 at 9:06 AM, De Groot (CTR), Craig <
craig.degroot@usgs.gov> wrote:

> We recently upgraded from Mesos 1.1.0 to 1.2.0 and are encountering errors
> with code that previously worked in 1.1.0.  I believe that this is a bug in
> the new version.  If not, I would like to know the correct procedure for
> using the sandbox as a user other than root.
>
> Here is the scenario:
> 1) Setup a job in Marathon which specifies a URI to our private
> docker.tar.gz
>   - See: this for an example ... https://mesosphere.github.
> io/marathon/docs/native-docker-private-registry.html
>   - This is a local file on each node
>
> 2) Specify a User (other than root) in the Marathon UI
>
> 3) Mesos will try to fetch the file and fails during the copy because the
> ownership of the sandbox directory are not changed to the specified user.
>   - Note that 1.1.0 correctly set the sandbox directory to the specified
> user
>   - This behavior is documented in the Mesos Docs here (see "specifying a
> user name"):  http://mesos.apache.org/documentation/latest/fetcher/
>
> Thanks in advance for the help!
>
> __
> Craig De Groot
>
>
>


Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Joseph Wu
If Apache JIRA were up, I'd point you to a JIRA noting the problem with
naming docker containers `mesos-*`, as Mesos reserves that prefix (and
kills everything it considers "unknown").

As a quick workaround, try setting this flag to false:
https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596

On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse 
wrote:

> MMm... it seems to die after a long sequence of forks, and mesos itself
> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup
> and it does not realise one of the containers is the agent itself??? Notice
> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
>
> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse ,
> wrote:
>
> Ciao,
>
> the only thing I could find is by running a parallel `docker events`
>
> ```
> 2017-01-13T01:18:20.766593692+01:00 network connect
> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> name=host, type=host)
> 2017-01-13T01:18:20.846137793+01:00 container start
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
> name=mesos-slave, vendor=CentOS)
> 2017-01-13T01:18:20.847965921+01:00 container resize
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1,
> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
> 2017-01-13T01:18:21.610141857+01:00 container kill
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
> name=mesos-slave, signal=15, vendor=CentOS)
> 2017-01-13T01:18:21.610491564+01:00 container kill
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
> name=mesos-slave, signal=9, vendor=CentOS)
> 2017-01-13T01:18:21.646229213+01:00 container die
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1,
> license=GPLv2, name=mesos-slave, vendor=CentOS)
> 2017-01-13T01:18:21.652894124+01:00 network disconnect
> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267
> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> name=host, type=host)
> 2017-01-13T01:18:21.705874041+01:00 container stop
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2,
> name=mesos-slave, vendor=CentOS)
> ```
>
> Ciao,
> Giulio
>
> On 13 Jan 2017, 01:06 +0100, haosdent , wrote:
>
> Hi, @Giuliio According to your log, it looks normal. Do you have any logs
> related to "SIGKILL"?
>
> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse 
> wrote:
>
>> Hi,
>>
>> I’ve a setup where I run mesos in docker which works perfectly when I use
>> 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0)
>> and it seems to receive a sigkill right after saying:
>>
>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>> I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
>> centos
>> I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
>> I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
>> I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
>> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
>> W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will be 
>> downgraded to a non-SSL socket
>> W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will be 
>> downgraded to a non-SSL socket
>> E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
>> failed; this is the output:
>> sh: hadoop: command not found
>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
>> environment:zookeeper.version=zookeeper C client 3.4.8
>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
>> environment:host.name=.XXX.ch
>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client 
>> environment:os.name=Linux
>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client 
>> environment:os.arch=3.10.0-229.14.1.el7.x86_64
>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client 
>> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client 
>> environment:user.name=(null)
>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client 
>> environment:user.home=/root
>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client 
>> environment:user.dir=/
>> 2017-01-12 

Re: Mesos V1 Operator HTTP API - Java Proto Classes

2016-11-16 Thread Joseph Wu
Added.  Welcome to the contributors list :)

On Wed, Nov 16, 2016 at 9:49 AM, Vijay Srinivasaraghavan <
vijikar...@yahoo.com> wrote:

> I have created a JIRA and will submit a patch. Could someone please add me
> to the contributor list as I am not able to assign the JIRA to myself?
>
> https://issues.apache.org/jira/browse/MESOS-6597
>
>
>
>
> On Wednesday, November 16, 2016 9:00 AM, Anand Mazumdar 
> wrote:
>
>
> We wanted to move the project away from officially supporting anything
> other than C++ and discuss more on if we should be responsible for
> publishing to the various language specific channels. However, for the time
> being, we had decided to include the v1 protobufs in the mesos JAR itself.
> (it already contains the v1 Scheduler/Executor protos)
>
> Please file an issue as Zameer pointed out.
>
> -anand
>
> On Wed, Nov 16, 2016 at 8:34 AM, Zameer Manji  wrote:
>
> > I think this is a bug, I feel the jar should include all v1 protobuf
> files.
> >
> > Vijay, I encourage you to file a ticket.
> >
> > On Tue, Nov 15, 2016 at 8:04 PM, Vijay Srinivasaraghavan <
> > vijikar...@yahoo.com.invalid> wrote:
> >
> >> I believe the HTTP API will use the same underlying message format
> (proto
> >> def) and hence the request/response value objects (java) needs to be
> >> auto-generated from the proto files for it to be used in Jersey based
> java
> >> rest client?
> >>
> >>On Tuesday, November 15, 2016 12:37 PM, Tomek Janiszewski <
> >> jani...@gmail.com> wrote:
> >>
> >>
> >>  I suspect jar is deprecated and includes only old API used by mesoslib.
> >> The
> >> goal is to create HTTP API and stop supporting native libs (jars, so,
> >> etc).
> >> I think you shouldn't use that jar in your project.
> >>
> >> wt., 15.11.2016, 20:38 użytkownik Vijay Srinivasaraghavan <
> >> vijikar...@yahoo.com> napisał:
> >>
> >> > Hello,
> >> >
> >> > I am writing a rest client for "operator APIs" and found that some of
> >> the
> >> > protobuf java classes (like "include/mesos/v1/quota/quota.proto",
> >> > "include/mesos/v1/master/master.proto") are not included in the mesos
> >> jar
> >> > file. While investigating, I have found that the "Make" file does not
> >> > include these proto definition files.
> >> >
> >> > I have updated the Make file and added the protos that I am interested
> >> in
> >> > and built a new jar file. Is there any reason why these proto
> >> definitions
> >> > are not included in the original build apart from the reason that the
> >> APIs
> >> > are still evolving?
> >> >
> >> > Regards
> >> > Vijay
> >> >
> >>
> >> --
> >> Zameer Manji
> >>
> >
>
>
>


Re: framework failover

2016-11-04 Thread Joseph Wu
A couple questions/notes:

What do you mean by:

> the system will deploy the framework on a new node within less than three
> minutes.

Are you running your frameworks via Marathon?

How are you terminating the Mesos Agent?  If you send a `kill -SIGUSR1`,
the agent will immediately kill all of its tasks and un-register with the
master.
If you kill the agent with some other signal, the agent will simply stop,
but tasks will continue to run.

According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.

^ Implies that the master does not remove the agent immediately, meaning
you killed the agent, but did not kill the tasks.
During this time, the master is waiting for the agent to come back online.
If the agent doesn't come back during some (configurable) timeout, it will
notify the frameworks about the loss of an agent.

Also, it's a little odd that your frameworks will disconnect upon the agent
process dying.  You may want to investigate your framework dependencies.  A
framework should definitely not depend on the agent process (frameworks
depend on the master though).



On Fri, Nov 4, 2016 at 10:32 AM, Jaana Miettinen  wrote:

> Hi, Would you help me to find out how the framework failover happens in
> mesos 0.28.0 ?
>
>
>
> In my mesos-environment I have the following  frameworks:
>
>
>
> etcd-mesos
>
> cassandra-mesos 0.2.0-1
>
> eremitic
>
> marathon 0.15.2
>
>
>
> If I shutdown the agent (mesos-slave) in which my framework has been
> deployed from the Linux command-line by ‘halt’-command, the sytem will
> deploy the framework on a new node within less than three minutes.
>
>
>
> But when I shut down the agent in which cassandra framework is running it
> takes 14 minutes before the system recovers.
>
>
>
> According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.
>
>
>
> Seen from the mesos-log:
>
>
>
> Line 976: I1104 08:53:29.516564 15502 master.cpp:1173] Slave
> c002796f-a98d-4e55-bee3-f51b8d06323b-S8 at slave(1)@10.254.69.140:5050
> (mesos-slave-1) disconnected
>
>  Line 977: I1104 08:53:29.516644 15502
> master.cpp:2586] Disconnecting slave c002796f-a98d-4e55-bee3-f51b8d06323b-S8
> at slave(1)@10.254.69.140:5050 (mesos-slave-1)
>
>  Line 1020: I1104 08:53:39.872681 15501
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570 disconnected
>
>  Line 1021: I1104 08:53:39.872707 15501
> master.cpp:2527] Disconnecting framework 
> c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570
>
>  Line 1080: W1104 08:54:53.621151 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0007 (Eremetic) at scheduler(1)@
> 10.254.69.140:31570
>
>  Line 1083: W1104 08:54:53.621279 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0004 (Eremetic) at scheduler(1)@
> 10.254.74.77:31956
>
>  Line 1085: W1104 08:54:53.621354 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0002 (Eremetic) at scheduler(1)@
> 10.254.77.2:31460
>
>  Line 1219: I1104 09:09:09.933365 15502
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495 disconnected
>
>  Line 1220: I1104 09:09:09.933404 15502
> master.cpp:2527] Disconnecting framework 
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1222: W1104 09:09:09.933518 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1223: W1104 09:09:09.933697 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1224: W1104 09:09:09.933768 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1225: W1104 09:09:09.933825 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> 

Re: what is the status on this?

2016-09-06 Thread Joseph Wu
And for discovery of other nodes in the Paxos group.

The work on modularizing/decoupling Zookeeper is a prerequisite for having
the replicated log perform leader election itself.  <- That would merely be
another implementation of the interface we will introduce in the process:

https://issues.apache.org/jira/browse/MESOS-3574

On Tue, Sep 6, 2016 at 11:31 AM, Avinash Sridharan <avin...@mesosphere.io>
wrote:

> Also, I think, the replicated log itself uses Zookeeper for leader
> election.
>
> On Tue, Sep 6, 2016 at 12:15 PM, Zameer Manji <zma...@apache.org> wrote:
>
>> If we use the replicated log for leader election, how will frameworks
>> detect the leading master? Right now the scheduler driver uses the
>> MasterInfo in ZK to discover the leader and detect leadership changes.
>>
>> On Mon, Sep 5, 2016 at 10:18 AM, Dario Rexin <dre...@apple.com> wrote:
>>
>>> If we go and change this, why not simply remove any dependencies to
>>> external systems and simply use the replicated log for leader election?
>>>
>>> On Sep 5, 2016, at 9:02 AM, Alex Rukletsov <a...@mesosphere.com> wrote:
>>>
>>> Kant—
>>>
>>> thanks a lot for the feedback! Are you interested in helping out with
>>> Consul module once Jay and Joseph are done with modularizing patches?
>>>
>>> On Mon, Sep 5, 2016 at 8:50 AM, Jay JN Guo <guojian...@cn.ibm.com>
>>> wrote:
>>>
>>>> Patches are currently under review by @Joseph and can be found at the
>>>> links provided by @haosdent.
>>>>
>>>> I took a quick look at Consul key/value HTTP APIs and they look very
>>>> similar to Etcd APIs. You could actually reuse our Etcd module
>>>> implementation once we manage to push the module into Mesos community.
>>>>
>>>> The only technical problem I could see for now is that Consul does not
>>>> support `POST` with incremental key index. We may need to leverage
>>>> `?cas=` operation in Consul to emulate the behaviour of joining a
>>>> key group.
>>>>
>>>> We could have a discussion on how to implement Consul HA module.
>>>>
>>>> cheers,
>>>> /J
>>>>
>>>>
>>>> - Original message -
>>>> From: haosdent <haosd...@gmail.com>
>>>> To: user <user@mesos.apache.org>
>>>> Cc: Jay JN Guo/China/IBM@IBMCN
>>>> Subject: Re: what is the status on this?
>>>> Date: Sun, Sep 4, 2016 6:10 PM
>>>>
>>>> Jay has some patches for de-couple Mesos with Zookeeper
>>>>
>>>> https://issues.apache.org/jira/browse/MESOS-5828
>>>> https://issues.apache.org/jira/browse/MESOS-5829
>>>>
>>>> I think it should be possible to support consul by custom modules after
>>>> jay's work done.
>>>>
>>>> On Sun, Sep 4, 2016 at 6:02 PM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Alex,
>>>>
>>>> We have some experienced devops people here and they all had one thing
>>>> in common which is Zookeeper is a pain to maintain. In fact we refused to
>>>> bring in new tech stacks that require Zookeeper such as Kafka for example.
>>>> so we desperately in search for alternative preferably using consul. I just
>>>> hear lot of positive response when comes it consul. It will be great to see
>>>> mesos and consul working together in which we would be ready to jump at it
>>>> and make a switch for YARN to Mesos.
>>>>
>>>> Thanks,
>>>> Kant
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 31, 2016 1:03 AM, Alex Rukletsov a...@mesosphere.com
>>>> wrote:
>>>>
>>>> Kant—
>>>>
>>>> mind telling us what is your use case and why this ticket is important
>>>> for you? It will help us prioritize work.
>>>>
>>>> On Fri, Aug 26, 2016 at 2:46 AM, tommy xiao <xia...@gmail.com> wrote:
>>>>
>>>> Hi guys, i always focus on t his case. but good news is etcd always
>>>> have patchs. so the coming consul is very easy, just need some time to do
>>>> coding on it. if you have interesting it? let us collaborate it.
>>>>
>>>> 2016-08-26 8:11 GMT+08:00 Joseph Wu <jos...@mesosphere.io>:
>>>>
>>>> There is no timeline as no one has done any work on the issue.
>>>>
>>>>
>>>> On Thu, Aug 25, 2016 at 4:54 PM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> I see this ticket and other related tickets should be part of sprints
>>>> in 2015 and it is still not resolved yet. can we have a timeline on this?
>>>> This would be really helpful
>>>>
>>>> https://issues.apache.org/jira/browse/MESOS-3797
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> Deshi Xiao
>>>> Twitter: xds2000
>>>> E-mail: xiaods(AT)gmail.com
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245
>


Re: Failed to shutdown socket

2016-09-06 Thread Joseph Wu
You can easily trigger this log line by curling Mesos and interrupting the
curl.  For example, the python script in the description of
https://issues.apache.org/jira/browse/MESOS-6104 will almost always trigger
that log line.

The log line itself never indicates a leak.  What I meant by "insightful"
is something like this: https://issues.apache.org/jira/browse/MESOS-5576

On Tue, Sep 6, 2016 at 11:26 AM, June Taylor <j...@umn.edu> wrote:

> I admit I don't understand this explanation, and our logs are also filled
> with this message. How would one tell whether it's a good time to ignore it
> versus a leak?
>
>
> Thanks,
> June Taylor
> System Administrator, Minnesota Population Center
> University of Minnesota
>
> On Tue, Sep 6, 2016 at 1:07 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
>> You can ignore that log line.  It's something Mesos prints when the
>> client side of a socket closes the socket before Mesos does.
>>
>> We've kept the log line thus far because it can be surprisingly
>> insightful when tracking down things like FD leaks based on logs alone :)
>>
>> On Mon, Sep 5, 2016 at 4:26 AM, Gavin Baumanis <gavinbauma...@gmail.com>
>> wrote:
>>
>>> Hi Everyone,
>>>
>>> I am just starting to do some evaluation of Mesos.
>>>
>>> I managed to get it built / installed and running, successfully using
>>> the 1.0.1 release.
>>> I am running 5 nodes via VirtualBox VMs on my Mac.
>>> I restarted the VMs for a separate reason and have subsequently noticed
>>> the following error;
>>> E0905 21:19:32.483242  1002 process.cpp:2105] Failed to shutdown socket
>>> with fd 11: Transport endpoint is not connected
>>>
>>> I have searched here and on Stack Overflow - and while I can see a feww
>>> threads discussing it - there doesn't seem to be a definitive answer as to
>>> what causes it - nor try these steps, in the following order...
>>>
>>> So I thought would ask about this issue again and hopefully create a
>>> discussion that could serve as a one-stop shop for future Googler's of the
>>> same issue.
>>>
>>> When visiting the web page of the master @ 5050, the whole cluster is
>>> shown correctly with all resources available. So I am not even sure if the
>>> error is "noise" - or if it is a serious "thing" to be actively addressed -
>>> but of course would rather it was happening - even if it is benign.
>>>
>>> if anyone has any ideas - I would love to hear them and happily accept
>>> the help.
>>> please let me know if there is anything specific, you'll need from me.
>>>
>>>
>>> As always thanks!
>>>
>>> Gavin Baumanis
>>> E: gavinbauma...@gmail.com
>>>
>>> I, for one, like Roman numerals.
>>>
>>
>>
>


Re: Mesos 1.0 WebUI does not display cluster name

2016-08-30 Thread Joseph Wu
Try "--cluster" instead of "—cluster".

On Tue, Aug 30, 2016 at 2:01 PM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Hi all,
>
>
> Mesos Web UI does not display the name of the cluster.
>
> I have a config file named cluster under /etc/mesos-master/ along with
> other configuration files. It worked well with Mesos 0.28.0. I've upgraded
> the cluster to Mesos 1.0 and this doesn't seem to work.
>
> /usr/sbin/mesos-master --zk=zk://192.168.0.1:2181,192.168.0.17:2181,
> 192.168.0.33:2181/mesos --port=5050 --log_dir=/var/log/mesos
> --acls=/etc/mesos_acls.json --authenticate_frameworks=true
> —cluster=testcluster --credentials=/etc/mesos-auth/credentials --quorum=2
> --work_dir=/var/lib/mesos
>
>
> I greatly appreciate any help!
>
> --
> Regards,
> Haripriya Ayyalasomayajula
>
>


Re: Is this a CI system for is it a development system

2016-08-30 Thread Joseph Wu
The Windows CI can be found here:
https://builds.apache.org/job/Mesos-Windows/

On Tue, Aug 30, 2016 at 7:14 AM, Alexander Rojas 
wrote:

> Didn’t take it as such, I’m just trying to help you as good as I can :)
>
>
> On 30 Aug 2016, at 16:11, DiGiorgio, Mr. Rinaldo S. 
> wrote:
>
>
> On Aug 30, 2016, at 9:39 AM, Alexander Rojas 
> wrote:
>
> Hi,
>
> As far as I know, the machine that did the OS-X builds has been down for
> more than a year. Likewise, I don’t think there has ever been a windows job.
>
> So far Alex Clemmer has been directing the efforts of the windows build so
> you may want to contact him, it you are interested. I would also recommend
> to use the list d...@mesos.apache.org for development related questions.
>
> Thank you and that was by no means a critique.
>
>
> best,
>
> Alexander
>
> On 30 Aug 2016, at 11:50, DiGiorgio, Mr. Rinaldo S. 
> wrote:
>
> Hi,
>
> I don’t know what the expectation is when looking at mesos jobs here:
>
> https://builds.apache.org/view/Incubator%20Projects/job/
> Mesos-OSX/COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,label_exp=mac/
>
> has been waiting for six days.
>
> I can’t find any windows build jobs.
>
>
> Rinaldo
>
>
>
>
>


Re: what is the status on this?

2016-08-25 Thread Joseph Wu
There is no timeline as no one has done any work on the issue.

On Thu, Aug 25, 2016 at 4:54 PM, kant kodali  wrote:

> Hi Guys,
>
> I see this ticket and other related tickets should be part of sprints in
> 2015 and it is still not resolved yet. can we have a timeline on this? This
> would be really helpful
>
> https://issues.apache.org/jira/browse/MESOS-3797
>
> Thanks!
>


Re: can we use mesos and spark with consul or etcd?

2016-08-25 Thread Joseph Wu
There's a bit of ongoing work on decoupling ZK from Mesos, but this is
still some way off.  See this epic:
https://issues.apache.org/jira/browse/MESOS-1806 and it's children.

Most likely, you run into headaches regardless of ZK/Consul/Etcd.  All have
their own set of quirks.  (i.e. the grass is always greener on the other
side.)

On Thu, Aug 25, 2016 at 1:10 PM, kant kodali  wrote:

> I am trying to startup a new cluster but we have 4 people here who had
> manage large clusters using zookeeper and everyone seem to come to the
> consensus that they want to avoid zookeeper.
>
>
>
> On Thu, Aug 25, 2016 12:46 PM, Charles Allen charles.al...@metamarkets.com
> wrote:
>
>> Out of curiosity, are you wanting to avoid ZK because you already have to
>> use etcd? or are you starting up a new cluster and just really don't want
>> to start off with ZK? (i've been wondering how severe the ZK integration
>> requirement is myself)
>>
>> On Thu, Aug 25, 2016 at 12:43 PM kant kodali  wrote:
>>
>> Hi,
>>
>> can we use mesos and spark with consul or etcd? If we can we would be
>> happy to avoid zookeeper.
>>
>> Thanks!
>>
>>


Re: Marathon constantly unregisters on particular slaves

2016-08-24 Thread Joseph Wu
>
> Scenario is:
> * Marathon registers on slave,
>
Why is Marathon registering on the agent?  This shouldn't even be possible,
as frameworks must talk to the master.

Marathon dies on two of them constantly.

How are you starting Marathon?  Via some init service?  And are you
starting Marathon on every node?


Re: Using mesos' cfs limits on a docker container?

2016-08-13 Thread Joseph Wu
If you're not against running Docker containers without the Docker daemon,
try using the Unified containerizer.
See the latter half of this document:
http://mesos.apache.org/documentation/latest/mesos-containerizer/

On Sat, Aug 13, 2016 at 7:02 PM, Mark Hammons  wrote:

> Hi All,
>
>
>
> I was having a lot of success having mesos force sandboxed programs to
> work within cpu and memory constraints, but when I added docker into the
> mix, the cpu limitations go out the window (not sure about the memory
> limitations. Is there any way to mix these two methods of isolation? I'd
> like my executor/algorithm to run inside a docker container, but have that
> container's memory and cpu usage controlled by systemd/mesos.
>
>
>
> Thanks,
>
> Mark
> --
>
> Mark Hammons - +33 06 03 69 56 56
>
> Research Engineer @ BioEmergences 
>
> Lab Phone: 01 69 82 34 19
>


Re: Attributes cause agent to fail

2016-07-29 Thread Joseph Wu
Works fine for me.  Make sure the agent isn't just complaining about
invalid flags.

i.e. This is invalid:
--attributes="something"

This is valid:
--attributes="something:foo"
--attributes="something:foo; nothing:bar"

And make sure your agent's work directory doesn't contain info from an
agent started with different attributes (or no attributes).

On Fri, Jul 29, 2016 at 5:31 PM, Douglas Nelson  wrote:

> When I set any attributes for the agent node it fails to run. No
> mesos-slave.ERROR log is created. I am using mesos 1.0.0 from the
> mesosphere package, but I also tried building it and had the same issue.
>
> As soon as I remove the --attributes flag the agent runs normally and
> registers itself with the master node. Is attributes deprecated? Is anyone
> else running into this?
>


Re: What will happen in maintenance mode

2016-07-25 Thread Joseph Wu
There are some cluster environments where nodes do not have an IP or
hostname.  That's why each MachineID must one have OR the other.  Not one
XOR the other.

There is a note further up the page that explains how Mesos matches
machines to agents:
https://github.com/apache/mesos/blame/3e115accca390663575753279f4400495625cb91/docs/maintenance.md#L135-L142

On Fri, Jul 22, 2016 at 9:34 PM, tommy xiao <xia...@gmail.com> wrote:

> yes, in recently mesos deployment, if i ignore the hostname, just
> specified IP, the mesos cluster sometime is not working. because the
> hostname is not correct. so i also curious the machine definition:
> "Each machine must have at least a hostname or IP included. The hostname
> is not case-sensitive."
>
> it should be defined must hostname and ip included.
>
>
> 2016-07-19 11:38 GMT+08:00 Qiang Chen <qzsc...@gmail.com>:
>
>> Thanks Joseph.
>>
>> I saw this from mesos [doc site](
>> http://mesos.apache.org/documentation/latest/maintenance/):
>>
>> "Each machine must have at least a hostname or IP included. The hostname
>> is not case-sensitive."
>>
>> From my test, the statement above is not correct, as if I only specific
>> the hostname or IP, it will NOT take effect for the maintenance agents.
>> but should specific both will OK.
>>
>> On 2016年07月19日 02:17, Joseph Wu wrote:
>>
>> [image: Boxbe] <https://www.boxbe.com/overview> This message is eligible
>> for Automatic Cleanup! (jos...@mesosphere.io) Add cleanup rule
>> <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Fkey%3Dm%252B%252F9y8szBbdXKWiZ%252FDADQ0%252Fzx2OsVPpMz1%252BhAd8WOjE%253D%26token%3D7yPWMILH6f2hh7W8GLG1B4W3dWqI9yjvahQVEYFryQn3PGah0U1DPo7rfMlTIncRBOxGwo9jI4CHtQ%252BZ435zSbIfdjC1em9cdavejMkUAGEDLcp7EpoDgqU0pX3rrX3o0uawWqnSxys%253D_serial=26129651012_rand=629032590_source=stf_medium=email_campaign=ANNO_CLEANUP_ADD_content=001>
>> | More info
>> <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=26129651012_rand=629032590_source=stf_medium=email_campaign=ANNO_CLEANUP_ADD_content=001>
>>
>>
>> My guess is that your agents don't match the machines you specified.
>> Note: The maintenance endpoints in Mesos allow you to specify maintenance
>> against non-existent machines, because the operator may add agents on those
>> machines in future.
>>
>> In Mesos' maintenance primitives, a "machine" is a hostname + IP.  (A
>> physical/virtual machine can hold multiple agents.)  The response in
>> /maintenance/status is in terms of machines, not agents.  If none of your
>> frameworks support inverse offers, then you won't get any useful
>> information from the /maintenance/status endpoint.
>>
>> You can figure out an agent's hostname/IP by hitting the /master/slaves
>> endpoint:
>>
>> {
>>   "slaves": [
>> {
>>   "pid":"slave(1)@127.0.0.1:5051",
>>   "hostname":"foo-bar",
>>   ...
>>
>> ^ The above translates to a machine = { "hostname": "foo-bar", "ip" : "
>> 127.0.0.1" }
>>
>> On Mon, Jul 18, 2016 at 2:08 AM, Qiang Chen <qzsc...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm puzzled in using maintenance mode.
>>>
>>> I see this from mesos [doc site](
>>> http://mesos.apache.org/documentation/latest/maintenance/):
>>>
>>> ```
>>> When maintenance is triggered by the operator, all agents on the machine
>>> are told to shutdown. These agents are removed from the master, which means
>>> that a TASK_LOST status update will be sent for every task running on
>>> each of those agents. The scheduler driver’s slaveLost callback will
>>> also be invoked for each of the removed agents. Any agents on machines in
>>> maintenance are also prevented from re-registering with the master in the
>>> future (until maintenance is completed and the machine is brought back up).
>>> ```
>>> But I didn't find the agent machine shutdown or task failed when I test
>>> the maintenance HTTP endpoints.
>>>
>>> If mesos agents are in that mode will move the running tasks to other
>>> agents? namely, it will evacuate all the tasks in those agents? and the
>>> shutdown?
>>>
>>> When I POST "/maintenance/schedule" and "/machine/down" and give a
>>> proper maintain time window. I got the response that those specified agents
>>> are in the "draining_machines" and "down_machines" list by GET
>>> "/maintenance/status", but didn't shutdown and evacuate any tasks, why ?
>>> does it make sense?
>>>
>>> Thanks.
>>>
>>> --
>>> Best Regards,
>>> Chen, Qiang
>>>
>>>
>>
>> --
>> Best Regards,
>> Chen, Qiang
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: mesos crash

2016-07-21 Thread Joseph Wu
5050
> F0721 19:30:21.685487 11586 master.cpp:1662] Recovery failed: Failed to
> recover registrar: Failed to perform fetch within 2mins
> *** Check failure stack trace: ***
> @ 0x7fb9b735237c  google::LogMessage::Fail()
> @ 0x7fb9b73522d8  google::LogMessage::SendToLog()
> @ 0x7fb9b7351cce  google::LogMessage::Flush()
> @ 0x7fb9b7354a88  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fb9b62b064c  mesos::internal::master::fail()
> @ 0x7fb9b6384ffb
>  
> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi16__callIvJS1_EJLm0ELm1T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
> @ 0x7fb9b635f8df
>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1clIJS1_EvEET0_DpOT_
> @ 0x7fb9b632c783
>  
> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1vEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
> @ 0x7fb9b63850cd
>  
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1vEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> @   0x4a4833  std::function<>::operator()()
> @   0x49f0eb
>  
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> @   0x4997c2  process::Future<>::fail()
> @ 0x7fb9b5f75a22  process::Promise<>::fail()
> @ 0x7fb9b63824f0  process::internal::thenf<>()
> @ 0x7fb9b63c6bd9
>  
> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi16__callIvISM_EILm0ELm1ELm2T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7fb9b63bd8cd  std::_Bind<>::operator()<>()
> @ 0x7fb9b63a4821  std::_Function_handler<>::_M_invoke()
> @ 0x7fb9b63bdaff  std::function<>::operator()()
> @ 0x7fb9b63a4955
>  
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
> @ 0x7fb9b63c6c85
>  
> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
> @ 0x7fb9b63bdaff  std::function<>::operator()()
> @ 0x7fb9b64267c4  process::internal::run<>()
> @ 0x7fb9b641cef4  process::Future<>::fail()
> @ 0x7fb9b64572de  std::_Mem_fn<>::operator()<>()
> @ 0x7fb9b64526c7
>  
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi16__callIbIS8_EILm0ELm1T_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
> @ 0x7fb9b644ad23
>  
> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1clIJS8_EbEET0_DpOT_
> @ 0x7fb9b6440c63
>  
> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1bEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
> @ 0x7fb9b6452752
>  
> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1bEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
> @   0x4a4833  std::function<>::operator()()
> @   0x49f0eb
>  
> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
> @ 0x7fb9b641cecc  process::Future<>::fail()
> @ 0x7fb9b6415eac  process::Promise<>::fail()
>
> -- 原始邮件 --
> *发件人:* "Joseph Wu";<jos...@mesosphere.io>;
> *发送时间:* 2016年7月20日(星期三) 凌晨2:15
> *收件人:* "user"<user@mesos.apache.org>;
> *主题:* Re: mesos crash
>
> When you start a new group of masters, the masters will not initialize
> their replicated log (from the EMPTY state) until all masters are present.
> This means (quorum * 2 - 1) masters must be up and reachable.
>
> We enforce this behavior because the replicated log can get into a
> inconsistent state otherwise.  Consider a simple case where you have an
> existing group of 3 masters:
> 1) Each master's replicated log is up-to-date with the leader.
> 2) Two masters are completely destroyed (their disks blow up, or
> something).
> 3) You bring two new masters up.
>
> If we allow a quorum of new masters to rejoin an existing cluster, the old
> master's data becomes the source of truth because it has the highest log
> position.  This is not necessarily correct.
> By dis-allowing a quorum of new masters to rejoin an existing cluster, it
> becomes the operator's job 

Re: mesos crash

2016-07-19 Thread Joseph Wu
When you start a new group of masters, the masters will not initialize
their replicated log (from the EMPTY state) until all masters are present.
This means (quorum * 2 - 1) masters must be up and reachable.

We enforce this behavior because the replicated log can get into a
inconsistent state otherwise.  Consider a simple case where you have an
existing group of 3 masters:
1) Each master's replicated log is up-to-date with the leader.
2) Two masters are completely destroyed (their disks blow up, or something).
3) You bring two new masters up.

If we allow a quorum of new masters to rejoin an existing cluster, the old
master's data becomes the source of truth because it has the highest log
position.  This is not necessarily correct.
By dis-allowing a quorum of new masters to rejoin an existing cluster, it
becomes the operator's job to recover after catastrophic failures.

On Tue, Jul 19, 2016 at 2:57 AM, 梦开始的地方 <382607...@qq.com> wrote:

>
> yes,I start 1 master ,work fined,I try 3 masters later,thanks
>
> -- 原始邮件 --
> *发件人:* "haosdent";;
> *发送时间:* 2016年7月19日(星期二) 下午5:39
> *收件人:* "user";
> *主题:* Re: mesos crash
>
> I think may because you start 2 masters and set MESOS_quorum to 2, then
> the election could not finish successfully. May you start 3 masters? Or
> remove zookeeper and just start 1 master.
>
> On Tue, Jul 19, 2016 at 5:26 PM, 梦开始的地方 <382607...@qq.com> wrote:
>
>> No,I deployed in two different server
>>
>>
>> -- 原始邮件 --
>> *发件人:* "haosdent";;
>> *发送时间:* 2016年7月19日(星期二) 下午5:05
>> *收件人:* "user";
>> *主题:* Re: mesos crash
>>
>> Hi,
>> >I start two master node :
>> Did you start them in a same server with same work dir?
>>
>> On Tue, Jul 19, 2016 at 12:18 PM, 梦开始的地方 <382607...@qq.com> wrote:
>>
>>>
>>>
>>> Hello,I deploy mesos on centos,kernel is  3.14.73,mesos version 1.0.0,
>>> this is my master config:
>>> export MESOS_log_dir=/apps/mesos/logs/
>>> export MESOS_ip=0.0.0.0
>>> export MESOS_hostname=`hostname`
>>> export MESOS_logging_level=INFO
>>> export MESOS_quorum=2
>>> export MESOS_work_dir=/apps/mesos/master
>>> export MESOS_zk=zk://zk1:2181,zk2:2181,zk3:2181/oss-mesos
>>> export MESOS_allocator=HierarchicalDRF
>>> export MESOS_cluster=oss-mesos
>>> export MESOS_credentials=/apps/mesos/etc/mesos/credentials.txt
>>> export MESOS_registry=replicated_log
>>> export MESOS_webui_dir=/apps/mesos/share/mesos/webui
>>> export MESOS_zk_session_timeout=90secs
>>> export MESOS_max_executors_per_slave=10
>>> export MESOS_registry_fetch_timeout=2mins
>>>
>>> I start two master node :
>>> but master nodes will crash in a few minute
>>> the log message is
>>>
>>> I0719 11:50:22.673280  5376 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (287)@10.10.186.76:5050
>>> I0719 11:50:23.154119  5381 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (504)@10.10.179.252:5050
>>> I0719 11:50:23.154749  5376 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.156838  5378 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.563072  5382 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (289)@10.10.186.76:5050
>>> I0719 11:50:23.883855  5376 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (507)@10.10.179.252:5050
>>> I0719 11:50:23.884414  5380 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.886569  5375 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.163056  5379 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (291)@10.10.186.76:5050
>>> I0719 11:50:24.425379  5378 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (510)@10.10.179.252:5050
>>> I0719 11:50:24.425864  5379 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.428951  5375 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.935673  5379 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (293)@10.10.186.76:5050
>>> F0719 11:50:25.262277  5381 master.cpp:1662] Recovery failed: Failed to
>>> recover registrar: Failed to perform fetch within 2mins
>>> *** Check failure stack trace: ***
>>> @ 0x7fe6fa0ac37c  google::LogMessage::Fail()
>>> @ 0x7fe6fa0ac2d8  google::LogMessage::SendToLog()
>>> @ 0x7fe6fa0abcce  google::LogMessage::Flush()
>>> @ 0x7fe6fa0aea88  google::LogMessageFatal::~LogMessageFatal()
>>> @ 0x7fe6f900a64c  mesos::internal::master::fail()
>>> @ 0x7fe6f90deffb
>>>  
>>> 

Re: Does a executing task has a expiry time?

2016-07-18 Thread Joseph Wu
The behavior and lifetime of a task is up to the executor (which is, in
turn, controlled by the framework; which is decided by the operator).

The default command executor does not have any timeouts for running tasks.

On Mon, Jul 18, 2016 at 2:59 AM, Bryan Fok  wrote:

> Hi all
>
> Does a executing task has a expiry time? I don't see it in the
> configurable, but just in case.
>
>
> BR
> Bryan
>
>


Re: What will happen in maintenance mode

2016-07-18 Thread Joseph Wu
My guess is that your agents don't match the machines you specified.  Note:
The maintenance endpoints in Mesos allow you to specify maintenance against
non-existent machines, because the operator may add agents on those
machines in future.

In Mesos' maintenance primitives, a "machine" is a hostname + IP.  (A
physical/virtual machine can hold multiple agents.)  The response in
/maintenance/status is in terms of machines, not agents.  If none of your
frameworks support inverse offers, then you won't get any useful
information from the /maintenance/status endpoint.

You can figure out an agent's hostname/IP by hitting the /master/slaves
endpoint:

{
  "slaves": [
{
  "pid":"slave(1)@127.0.0.1:5051",
  "hostname":"foo-bar",
  ...

^ The above translates to a machine = { "hostname": "foo-bar", "ip" : "
127.0.0.1" }

On Mon, Jul 18, 2016 at 2:08 AM, Qiang Chen  wrote:

> Hi all,
>
> I'm puzzled in using maintenance mode.
>
> I see this from mesos [doc site](
> http://mesos.apache.org/documentation/latest/maintenance/):
>
> ```
> When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are removed from the master, which means
> that a TASK_LOST status update will be sent for every task running on
> each of those agents. The scheduler driver’s slaveLost callback will also
> be invoked for each of the removed agents. Any agents on machines in
> maintenance are also prevented from re-registering with the master in the
> future (until maintenance is completed and the machine is brought back up).
> ```
> But I didn't find the agent machine shutdown or task failed when I test
> the maintenance HTTP endpoints.
>
> If mesos agents are in that mode will move the running tasks to other
> agents? namely, it will evacuate all the tasks in those agents? and the
> shutdown?
>
> When I POST "/maintenance/schedule" and "/machine/down" and give a proper
> maintain time window. I got the response that those specified agents are in
> the "draining_machines" and "down_machines" list by GET
> "/maintenance/status", but didn't shutdown and evacuate any tasks, why ?
> does it make sense?
>
> Thanks.
>
> --
> Best Regards,
> Chen, Qiang
>
>


Re: Windows Build on Jenkins almost working

2016-07-15 Thread Joseph Wu
A few notes:

* Lowering the number of warnings is on our TODO list.  Currently, seeing
1000's of warnings is fairly common :(
* The windows build does not work if your files have Unix-style line
endings.  If you use Git on Windows, you should run: git config
core.autocrlf true
* The CMake warnings are saying that the build will download some tarballs
because those tarballs aren't committed to the Mesos repo:
https://github.com/3rdparty/mesos-3rdparty/blob/master/libevent-release-2.1.5-beta.tar.gz
https://github.com/3rdparty/mesos-3rdparty/blob/master/zookeeper-06d3f3f.tar.gz

I've posted a review to update the getting-started page accordingly:
https://reviews.apache.org/r/50080/

On Thu, Jul 14, 2016 at 6:36 PM, Rinaldo Digiorgio 
wrote:

> Hi,
>
>  The build fails with the following.
>
>  http_parser.lib(http_parser.obj) : warning LNK4217: locally defined 
> symbol memchr imported in function http_parser_execute [C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\3rdparty\libprocess\src\tests\process_tests.vcxproj]
>
>
>"C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\Mesos.sln" 
> (stout_tests;Build target) (1) ->
>"C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\src\mesos-1.0.0.vcxproj.metaproj"
>  (default target) (28) ->
>"C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\3rdparty\zookeeper-06d3f3f.vcxproj.metaproj"
>  (default target) (31) ->
>"C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\3rdparty\zookeeper-06d3f3f.vcxproj"
>  (default target) (40) ->
>(CustomBuild target) ->
>  C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 9009. [C:\Program Files 
> (x86)\Jenkins\workspace\mesos-agent-windows\build\3rdparty\zookeeper-06d3f3f.vcxproj]
>
> 1474 Warning(s)
> 1 Error(s)
>
> Time Elapsed 00:24:27.44
>
>
>
>   I see this.
>
> CMake Warning at CMakeLists.txt:52 (message):
>
>   Both `ENABLE_LIBEVENT` and `REBUNDLED` (set to TRUE by default) flags have
>   been set.  But, libevent does not come rebundled in Mesos, so it must be
>   downloaded.
>
>
> CMake Warning at CMakeLists.txt:61 (message):
>   The current supported version of ZK does not compile on Windows, and does
>   not come rebundled in the Mesos repository.  It must be downloaded from the
>   Internet, even though the `REBUNDLED` flag was set.
>
>
>
> If I need to install libevent, what version and where is a good place for
> it?
>
> Rinaldo
>
>


Re: Mesos fine-grained multi-user mode failed to allocate tasks

2016-07-13 Thread Joseph Wu
Looks like you're running Spark in "fine-grained" mode (deprecated).

(The Spark website appears to be down right now, so here's the doc on
Github:)
https://github.com/apache/spark/blob/master/docs/running-on-mesos.md#fine-grained-deprecated

Note that while Spark tasks in fine-grained will relinquish cores as they
> terminate, they will not relinquish memory, as the JVM does not give memory
> back to the Operating System. Neither will executors terminate when they're
> idle.


You can follow some of the recommendations Spark has in that document for
sharing resources, when using Mesos.

On Wed, Jul 13, 2016 at 12:12 PM, Rahul Palamuttam 
wrote:

> Hi,
>
> Our team has been tackling multi-tenancy related issues with Mesos for
> quite some time.
>
> The problem is that tasks aren't being allocated properly when multiple
> applications are trying to launch a job. If we launch application A, and
> soon after application B, application B waits pretty much till the
> completion of application A for tasks to even be staged in Mesos. Right now
> these applications are the spark-shell or the zeppelin interpreter.
>
> Even a simple sc.parallelize(1 to 1000).reduce(+) launched in two
> different spark-shells results in the issue we're observing. One of the
> counts waits (in fact we don't even see the tasks being staged in mesos)
> until the current one finishes. This is the biggest issue we have been
> experience and any help or advice would be greatly appreciated. We want to
> be able to launch multiple jobs concurrently on our cluster and share
> resources appropriately.
>
> Another issue we see is that the java heap-space on the mesos executor
> backend process is not being cleaned up once a job has finished in the
> spark shell.
> I've attached a png file of the jvisualvm output showing that the
> heapspace is still allocated on a worker node. If I force the GC from
> jvisualvm then nearly all of that memory gets cleaned up. This may be
> because the spark-shell is still active - but if we've waited long enough
> why doesn't GC just clean up the space? However, even after forcing GC the
> mesos UI shows us that these resources are still being used.
> There should be a way to bring down the memory utilization of the
> executors once a task is finished. It shouldn't continue to have that
> memory allocated, even if a spark-shell is active on the driver.
>
> We have mesos configured to use fine-grained mode.
> The following are parameters we have set in our spark-defaults.conf file.
>
>
> spark.eventLog.enabled   true
> spark.eventLog.dir   hdfs://frontend-system:8090/directory
> 
> spark.local.dir/data/cluster-local/SPARK_TMP
>
> spark.executor.memory50g
>
> spark.externalBlockStore.baseDir /data/cluster-local/SPARK_TMP
> spark.executor.extraJavaOptions  -XX:MaxTenuringThreshold=0
> spark.executor.uri  hdfs://frontend-system
> :8090/spark/spark-1.6.0-bin-hadoop2.4.tgz
> 
> spark.mesos.coarse  false
>
> Please let me know if there are any questions about our configuration.
> Any advice or experience the mesos community can share pertaining to
> issues with fine-grained mode would be greatly appreciated!
>
> I would also like to sincerely apologize for my previous test message on
> the mailing list.
> It was an ill-conceived idea since we are in a bit of a time crunch and I
> needed to get this message posted. I forgot I needed to send reply on to
> the user-subscribers email for me to be listed, resulting in message not
> sent emails. I will not do that again.
>
> Thanks,
>
> Rahul Palamuttam
>


Re: mesos/dcos user issue?

2016-07-13 Thread Joseph Wu
Looks like you solved your problem:

> either remove the "USER" statement or add the user locally on the mesos
agent machines

You can't run as a user that doesn't exist :)

On Wed, Jul 13, 2016 at 7:18 AM, Clarke, Trevor  wrote:

> I've got an image with a local user and a 'USER myuser' statement in the
> Dockerfile. When I try and run a container in mesos (we're using DC/OS but
> I think it's mesos related as we're not calling via marathon, etc. it's
> from a custom framework) I need "Failed to get user information for
> 'myuser'" unless I either remove the "USER" statement or add the user
> locally on the mesos agent machines. I still see a similar issue if I use
> "USER 4567" with a UID instead of username. Any idea what might be causing
> this?
>
> --
> Trevor R.H. Clarke
> Software Engineer, Ball Aerospace
> (937)320-7087
>
>
>
> This message and any enclosures are intended only for the addressee.
> Please
> notify the sender by email if you are not the intended recipient.  If you
> are
> not the intended recipient, you may not use, copy, disclose, or distribute
> this
> message or its contents or enclosures to any other person and any such
> actions
> may be unlawful.  Ball reserves the right to monitor and review all
> messages
> and enclosures sent to or from this email address.
>


Re: Windows Build

2016-07-11 Thread Joseph Wu
There are some instructions here:
https://github.com/apache/mesos/blob/master/docs/getting-started.md#building-mesos-windows

When the website's update is pushed, the instructions will show up here
too: http://mesos.apache.org/gettingstarted/

On Sat, Jul 9, 2016 at 3:43 PM, Artem Harutyunyan 
wrote:

> Hi Rinaldo,
>
> It is possible to build and run the Agent on Windows from the 1.0.0-rc2
> branch. The CMake files, as well as the ported Agent code were committed a
> couple of weeks ago. Joseph is currently in process of setting up a Windows
> build on our Jenkins. Folks in #windows channel in the community slack
> should be able to help you in case you encounter problems (register here
> https://mesos-slackin.herokuapp.com/).
>
> Please keep in mind that the Windows support is still in an early alpha
> phase.
>
> Artem.
>
> On Sat, Jul 9, 2016 at 6:19 AM, Rinaldo Digiorgio 
> wrote:
>
>> Hi,
>> Would someone be able to suggest how to get started with building
>> mesos on windows. I am under the assumption that the windows branch is not
>> integrated into the current 1.* RC.
>>
>> Rinaldo
>
>
>


Re: Setting up SSL for mesos

2016-07-07 Thread Joseph Wu
Probably not relevant.  (I ran ldd on CentOS 7.)

Which Ubuntu are you running?  And what shell?
Also, try running `make check` up until you see the libprocess tests.
There are a couple of SSL tests there.  (i.e. SSLTest.SSLSocket)
If, for some inexplicable reason, your build is linking but not using SSL,
those tests won't show up.

On Thu, Jul 7, 2016 at 12:40 PM, Douglas Nelson <itsbeh...@gmail.com> wrote:

> ldd src/.libs/mesos-master | grep ssl returns:
>
> libevent_openssl-2.0.so.5 =>
> /usr/lib/x86_64-linux-gnu/libevent_openssl-2.0.so.5
> libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0
>
> So I am missing the libssl3.so line. Is that another package I need to
> install as a prerequisite? In case it's relevant, I'm running ubuntu.
>
> On Thu, Jul 7, 2016 at 1:14 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
>> Can you double-check if your master is linking to openssl?
>>
>> From your build folder, you should get something like:
>> ldd src/.libs/mesos-master | grep ssl
>> libevent_openssl-2.0.so.5 => /lib64/libevent_openssl-2.0.so.5
>> libssl.so.10 => /lib64/libssl.so.10
>> libssl3.so => /lib64/libssl3.so
>>
>> There doesn't seem to be anything wrong with your configure/build steps.
>> And your environment variables setup should work on any sane Unix shell.
>> (Perhaps inline the environment variable?  SSL_ENABLED=true
>> ./mesos-master.sh ...)
>>
>> On Thu, Jul 7, 2016 at 11:53 AM, Douglas Nelson <itsbeh...@gmail.com>
>> wrote:
>>
>>> I rebuilt from scratch with SSL support and got no errors. I only set 
>>> *export
>>> SSL_ENABLED=true* and then I ran the mesos-master.
>>>
>>> No errors were thrown and I can see the web UI via HTTP. I double
>>> checked that I was running the .sh from the build folder I created. Is
>>> mesos not connecting with the environment variable I set for some reason?
>>>
>>>
>>>
>>> On Wed, Jul 6, 2016 at 2:20 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>>>
>>>> If you can see the WebUI via HTTP, without downgrade support, you might
>>>> be inadvertently running a different version of Mesos than the one you
>>>> built.
>>>>
>>>> You can quickly sanity check this by removing either SSL_KEY_FILE or
>>>> SSL_CERT_FILE and starting your master.  If your build has SSL support, it
>>>> should immediately exit with an error message.
>>>>
>>>>
>>>> On Wed, Jul 6, 2016 at 12:33 PM, Douglas Nelson <itsbeh...@gmail.com>
>>>> wrote:
>>>> >
>>>> > I attempted to set up SSL following this guide:
>>>> http://mesos.apache.org/documentation/latest/ssl/
>>>> >
>>>> > I'm able to hit the WebUI with http but using https gives me nothing.
>>>> I must be missing something. Here are the steps I'm taking:
>>>> >
>>>> > I downloaded 0.28.2 from here:
>>>> https://github.com/apache/mesos/releases
>>>> > I ran ./configure --enable-libevent --enable-ssl
>>>> > Then I ran make and make install
>>>> > I set the following environment variables:
>>>> >
>>>> > export SSL_ENABLED=1
>>>> > export SSL_SUPPORT_DOWNGRADE=0
>>>> > export SSL_KEY_FILE=
>>>> > export SSL_CERT_FILE=
>>>> >
>>>> > Finally, I ran ./bin/mesos-master.sh --ip=127.0.0.1
>>>> --work_dir=/var/lib/mesos
>>>> >
>>>> > I can provide any additional information if needed. Thanks!
>>>> >
>>>> > Also, I read that SSL would be included in mesosphere's nightly
>>>> builds: https://open.mesosphere.com/downloads/mesos-nightly/
>>>> >
>>>> > How stable are those builds and has SSL already been included?
>>>> >
>>>> > -Doug Nelson
>>>>
>>>>
>>>
>>
>


Re: Setting up SSL for mesos

2016-07-07 Thread Joseph Wu
Can you double-check if your master is linking to openssl?

>From your build folder, you should get something like:
ldd src/.libs/mesos-master | grep ssl
libevent_openssl-2.0.so.5 => /lib64/libevent_openssl-2.0.so.5
libssl.so.10 => /lib64/libssl.so.10
libssl3.so => /lib64/libssl3.so

There doesn't seem to be anything wrong with your configure/build steps.
And your environment variables setup should work on any sane Unix shell.
(Perhaps inline the environment variable?  SSL_ENABLED=true
./mesos-master.sh ...)

On Thu, Jul 7, 2016 at 11:53 AM, Douglas Nelson <itsbeh...@gmail.com> wrote:

> I rebuilt from scratch with SSL support and got no errors. I only set *export
> SSL_ENABLED=true* and then I ran the mesos-master.
>
> No errors were thrown and I can see the web UI via HTTP. I double checked
> that I was running the .sh from the build folder I created. Is mesos not
> connecting with the environment variable I set for some reason?
>
>
>
> On Wed, Jul 6, 2016 at 2:20 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
>> If you can see the WebUI via HTTP, without downgrade support, you might
>> be inadvertently running a different version of Mesos than the one you
>> built.
>>
>> You can quickly sanity check this by removing either SSL_KEY_FILE or
>> SSL_CERT_FILE and starting your master.  If your build has SSL support, it
>> should immediately exit with an error message.
>>
>>
>> On Wed, Jul 6, 2016 at 12:33 PM, Douglas Nelson <itsbeh...@gmail.com>
>> wrote:
>> >
>> > I attempted to set up SSL following this guide:
>> http://mesos.apache.org/documentation/latest/ssl/
>> >
>> > I'm able to hit the WebUI with http but using https gives me nothing. I
>> must be missing something. Here are the steps I'm taking:
>> >
>> > I downloaded 0.28.2 from here: https://github.com/apache/mesos/releases
>> > I ran ./configure --enable-libevent --enable-ssl
>> > Then I ran make and make install
>> > I set the following environment variables:
>> >
>> > export SSL_ENABLED=1
>> > export SSL_SUPPORT_DOWNGRADE=0
>> > export SSL_KEY_FILE=
>> > export SSL_CERT_FILE=
>> >
>> > Finally, I ran ./bin/mesos-master.sh --ip=127.0.0.1
>> --work_dir=/var/lib/mesos
>> >
>> > I can provide any additional information if needed. Thanks!
>> >
>> > Also, I read that SSL would be included in mesosphere's nightly builds:
>> https://open.mesosphere.com/downloads/mesos-nightly/
>> >
>> > How stable are those builds and has SSL already been included?
>> >
>> > -Doug Nelson
>>
>>
>


Re: Master slow to process status updates after massive killing of tasks?

2016-06-20 Thread Joseph Wu
Looks like the master's event queue is filling up, although it's difficult
to tell what exactly is doing this.  From the numbers in the gist, it's
evident that the master has seconds to minutes of backlog.

In general, there is very little processing cost associated per "accept".
The master does, however, break an "accept" into two chunk which are placed
into the master's event queue (FIFO).  The first chunk logs "Processing
ACCEPT call for offers..." and queues the second chunk.  The second chunk
logs "Launching task..." (assuming this is what the offer was accepted
for).  The greater the time gap between the two logs, the more backlogged
the master is.

I don't think there's enough info to pinpoint the bottleneck.  If you ran
this test again, here are my recommendations:

   - Set up a monitor (i.e. script that polls) for
   /master/metrics/snapshot.  Look through this doc (
   http://mesos.apache.org/documentation/latest/monitoring/) to see what
   each value means.  The most interesting metrics would match the patterns "
   master/event_queue_*" and "
   master/messages_*".
   - Try to hit /__processes__ during your test, particularly when the
   master is backlogged.  This should show the state of the various event
   queues inside Mesos.  (Keep in mind that polling this endpoint *may*
   slow down Mesos.)
   - Check if Singularity is DOS-ing the master :)

>Singularity calls reconcileTasks() every 10 minutes. How often would
>you expect to see that log line? At the worst point, we saw it printed 637
>times in one minute in the master logs.
>
   ^ This is a framework-initiated action.  Unfortunately, there are a lot
   of framework calls in the old scheduler driver that *could* be batched
   but are not due to backwards compatibility.  If Singularity tries to
   reconcile 500 tasks in a single reconcileTasks() call, using the old
   scheduler driver, it will make 500 calls to Mesos :(
   We suspect the HTTP API will have much better scaling in situations like
   this.  And it will be worthwhile to start migrating over to the new API.


On Sun, Jun 19, 2016 at 6:57 PM, Thomas Petr <tp...@hubspot.com> wrote:

> Thanks for the quick response, Joseph! Here are some answers:
>
> The test:
> - The agents were gracefully terminated (kill -term) and were offline for
> about 10 minutes. We had plans to test other scenarios (i.e. network
> partition, kill -9, etc.) but didn't get to them yet.
> - The 1000 accidentally killed tasks were not including the tasks from the
> killed-off agents, but included replacement tasks that were started in
> response to the agent killings. I'd estimate about 400 tasks were lost from
> the killed-off agents.
> - We stopped the 5 agents at about 3:43pm, killed off the ~1000 tasks at
> 3:49pm, and then failed over the master at 4:25pm. Singularity caught wind
> of the failover at 4:27pm, reconnected, and then everything started to
> clear up after that.
> - Singularity currently does not log the Offer ID, so it's not easy for me
> to get the exact timing between Singularity accepting an offer and that
> master line you mentioned. However, I am able to get the time between
> accepting an offer and the "Launching task XXX" master line
> <https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L3589> --
> you can check out this info here:
> https://gist.github.com/tpetr/fe0fecbcfa0a2c8e5889b9e70c0296e7. I have a
> PR <https://github.com/HubSpot/Singularity/pull/1099> to log the Offer ID
> in Singularity, so I'll be able to give you the exact timing the next time
> we run the test.
>
> The setup:
> - We unfortunately weren't monitoring those metrics, but will keep a close
> eye on that when we run this test again.
> - CPU usage was nominal -- CloudWatch reports less than 5% CPU utilization
> throughout the day, jumping to 10% temporarily when we failed over the
> Mesos master.
> - We run Singularity and the Mesos master together on 3 c3.2xlarges in AWS
> so there shouldn't be any bottleneck there.
> - One interesting thing I just noticed in the master logs is that the last
> "Processing ACCEPT call for offers" occurred at 3:50pm, though that could
> just be because after that time, things were so lagged that all of our
> offers timed out.
>
> Singularity:
> - Singularity still uses the default implicit acknowledgement feature of
> the scheduler driver. I filed a TODO for looking into explicit acks, but we
> do very little in the thread calling statusUpdate(). The only thing that
> could really slow things down is communication with ZooKeeper, which is a
> possibility.
> - Singularity calls reconcileTasks() every 10 minutes. How often would you
> expect to see that log line? At the worst point, we saw it printed

Re: Mesos 0.28.2 does not start

2016-06-10 Thread Joseph Wu
valocal mesos-master[1300]: *WARNING: Logging
> before InitGoogleLogging() is written to STDERR*
>
> giu 09 23:26:15 master.novalocal mesos-master[1300]: *F0609
> 23:26:15.898391  1285 process.cpp:892] Failed to initialize: Failed to bind
> on 10.250.0.12*
>
> giu 09 23:26:15 master.novalocal mesos-master[1300]:  Check failure
> stack trace: 
>
> giu 09 23:26:15 master.novalocal systemd[1]: *mesos-master.service: main
> process exited, code=killed, status=6/ABRT*
>
> giu 09 23:26:15 master.novalocal systemd[1]: *Unit mesos-master.service
> entered failed state.*
>
> giu 09 23:26:15 master.novalocal systemd[1]: *mesos-master.service
> failed.*
>
> giu 09 23:26:35 master.novalocal systemd[1]: mesos-master.service holdoff
> time over, scheduling restart.
>
> giu 09 23:26:35 master.novalocal systemd[1]: Started Mesos Master.
>
> giu 09 23:26:35 master.novalocal systemd[1]: Starting Mesos Master...
>
> 2016-06-10 18:51 GMT+02:00 Joseph Wu <jos...@mesosphere.io>:
>
>> The log directory is based on your configuration.  See the master config
>> section here: http://mesos.apache.org/documentation/latest/configuration/
>>
>> If you've set the --log_dir flag, you'll find your logs there.
>> Otherwise, the logs will be in stderr.
>> If you launched the master via a systemd service, use: journalctl -u 
>> mesos-master
>>
>>
>> On Fri, Jun 10, 2016 at 9:45 AM, Stefano Bianchi <jazzist...@gmail.com>
>> wrote:
>>
>>> Actually i don't have the access to the Mesos UI so i need to find the
>>> log within CentOS VM.
>>> Please can you tell me where can i find the master log file ?
>>>
>>> 2016-06-10 17:50 GMT+02:00 Jie Yu <yujie@gmail.com>:
>>>
>>>> Can u create a jira ticket and paste the master log? Thanks for
>>>> reporting!
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jun 10, 2016, at 8:44 AM, Stefano Bianchi <jazzist...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi all
>>>> i'm re installing my platform on another openstack tenant.
>>>> I downloaded all the needed softwares, zookeeper-server, mesos 0.28.2
>>>> marathon 1.1.1 and chronos 2.4.0.
>>>> i have configured all correctly the i started zookeeper-server and it
>>>> works fine.
>>>> when i type: service mesos-master start
>>>> it seems to start but if i check the status with: service mesos-master
>>>> status
>>>> i obtain the following:
>>>>
>>>> [root@master ~]# service mesos-master status
>>>>
>>>> Redirecting to /bin/systemctl status  mesos-master.service
>>>>
>>>> ● mesos-master.service - Mesos Master
>>>>
>>>>Loaded: loaded (/usr/lib/systemd/system/mesos-master.service;
>>>> enabled; vendor preset: disabled)
>>>>
>>>>Active: activating (auto-restart) (Result: signal) since ven
>>>> 2016-06-10 15:39:36 UTC; 3s ago
>>>>
>>>>   Process: 12163 ExecStart=/usr/bin/mesos-init-wrapper master 
>>>> *(code=killed,
>>>> signal=ABRT)*
>>>>
>>>>  Main PID: 12163 (code=killed, signal=ABRT)
>>>>
>>>>
>>>> Any one knows why i have this issue?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>>
>>
>


Re: Mesos 0.28.2 does not start

2016-06-10 Thread Joseph Wu
The log directory is based on your configuration.  See the master config
section here: http://mesos.apache.org/documentation/latest/configuration/

If you've set the --log_dir flag, you'll find your logs there.  Otherwise,
the logs will be in stderr.
If you launched the master via a systemd service, use: journalctl -u
mesos-master


On Fri, Jun 10, 2016 at 9:45 AM, Stefano Bianchi 
wrote:

> Actually i don't have the access to the Mesos UI so i need to find the log
> within CentOS VM.
> Please can you tell me where can i find the master log file ?
>
> 2016-06-10 17:50 GMT+02:00 Jie Yu :
>
>> Can u create a jira ticket and paste the master log? Thanks for reporting!
>>
>> Sent from my iPhone
>>
>> On Jun 10, 2016, at 8:44 AM, Stefano Bianchi 
>> wrote:
>>
>> Hi all
>> i'm re installing my platform on another openstack tenant.
>> I downloaded all the needed softwares, zookeeper-server, mesos 0.28.2
>> marathon 1.1.1 and chronos 2.4.0.
>> i have configured all correctly the i started zookeeper-server and it
>> works fine.
>> when i type: service mesos-master start
>> it seems to start but if i check the status with: service mesos-master
>> status
>> i obtain the following:
>>
>> [root@master ~]# service mesos-master status
>>
>> Redirecting to /bin/systemctl status  mesos-master.service
>>
>> ● mesos-master.service - Mesos Master
>>
>>Loaded: loaded (/usr/lib/systemd/system/mesos-master.service; enabled;
>> vendor preset: disabled)
>>
>>Active: activating (auto-restart) (Result: signal) since ven
>> 2016-06-10 15:39:36 UTC; 3s ago
>>
>>   Process: 12163 ExecStart=/usr/bin/mesos-init-wrapper master *(code=killed,
>> signal=ABRT)*
>>
>>  Main PID: 12163 (code=killed, signal=ABRT)
>>
>>
>> Any one knows why i have this issue?
>>
>> Thanks in advance.
>>
>>
>


Re: Mesos 0.28.2 does not start

2016-06-10 Thread Joseph Wu
I'm guessing you mis-configured a master flag/environment-variable
somewhere.  Since it looks like you're using systemd, can you run this
command, "journalctl -xe", right after you try to start the service?  That
should show you more info on why your master aborted.



On Fri, Jun 10, 2016 at 8:50 AM, Jie Yu  wrote:

> Can u create a jira ticket and paste the master log? Thanks for reporting!
>
> Sent from my iPhone
>
> On Jun 10, 2016, at 8:44 AM, Stefano Bianchi  wrote:
>
> Hi all
> i'm re installing my platform on another openstack tenant.
> I downloaded all the needed softwares, zookeeper-server, mesos 0.28.2
> marathon 1.1.1 and chronos 2.4.0.
> i have configured all correctly the i started zookeeper-server and it
> works fine.
> when i type: service mesos-master start
> it seems to start but if i check the status with: service mesos-master
> status
> i obtain the following:
>
> [root@master ~]# service mesos-master status
>
> Redirecting to /bin/systemctl status  mesos-master.service
>
> ● mesos-master.service - Mesos Master
>
>Loaded: loaded (/usr/lib/systemd/system/mesos-master.service; enabled;
> vendor preset: disabled)
>
>Active: activating (auto-restart) (Result: signal) since ven 2016-06-10
> 15:39:36 UTC; 3s ago
>
>   Process: 12163 ExecStart=/usr/bin/mesos-init-wrapper master *(code=killed,
> signal=ABRT)*
>
>  Main PID: 12163 (code=killed, signal=ABRT)
>
>
> Any one knows why i have this issue?
>
> Thanks in advance.
>
>


Re: Benign 'Shutdown failed on fd' error messages

2016-05-27 Thread Joseph Wu
This log line is part of some socket cleanup Mesos performs for all
sockets.  Mesos calls the "shutdown" syscall on the socket:
http://man7.org/linux/man-pages/man2/shutdown.2.html

This part of the log line:
> Transport endpoint is not connected
comes from the *ENOTCONN* error code.  We generally hit this error code in
one of two cases:
1) We created the socket, but never used it.  We call shutdown(s) to be
safe.
2) The socket was closed on the other side.  Again, we call shutdown(s) to
be safe.

I'd argue that this should not be logged at the ERROR level.  But as of the
current code, these log lines can't be silenced without losing all of your
logging verbosity :(

BTW, the log line has been moved since 0.26, but it will still show up in a
different form:
https://github.com/apache/mesos/commit/b06e932a036044c54cd72ddde1d26c5f9271ea51#diff-b13970db30a54291dc4a85c16491abfe

On Fri, May 27, 2016 at 1:00 AM, haosdent  wrote:

> Do you use zookeeepr? Looks similar to this one
> http://search-hadoop.com/m/0Vlr6fthtf1D5ssd1
>
> On Fri, May 27, 2016 at 3:44 PM, Christopher Ketchum 
> wrote:
>
>> Hi all,
>>
>> I'm running Mesos 0.25.0 and have been seeing these strange 'Shutdown
>> failed on fd' errors. I saw a couple other postings about similar error
>> messages but this case seems to be unique since Mesos seems to be working
>> fine, apart from the printed messages. Does anyone have any suggestions
>> about determining what these messages mean, or, alternatively,  how to
>> silence these errors if they aren't significant?
>>
>> Thanks!
>> Chris
>>
>> I0121 21:29:31.058202 11329 sched.cpp:164] Version: 0.25.0
>> I0121 21:29:31.106197 11355 sched.cpp:262] New master detected at 
>> master@xxx:xx:xx:xxx:5050
>> I0121 21:29:31.106302 11355 sched.cpp:272] No credentials provided. 
>> Attempting to register without authentication
>> E0121 21:29:31.106353 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.106487 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.113162 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.263561 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.286962 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.887789 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:31.978222 11368 socket.hpp:174] Shutdown failed on fd=11: 
>> Transport endpoint is not connected [107]
>> E0121 21:29:34.231999 11368 socket.hpp:174] Shutdown failed on fd=13: 
>> Transport endpoint is not connected [107]
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: 1.0 Release Candidate

2016-05-25 Thread Joseph Wu
I'm guessing you mean the "medium term" bullet point on the Roadmap (
https://cwiki.apache.org/confluence/display/MESOS/Roadmap):

>
>- Deprecate Docker containerizer (in favor of Unified containerizer w/
>Docker support)
>
> This was never meant to be done as part of the 1.0 release.  I'm sure the
folks working on the unified containerizer can tell you their exact plans.


On Wed, May 25, 2016 at 12:10 PM, Jeff Schroeder  wrote:

> Does this mean the work to deprecate the docker containerizer will be
> post-1.0, or have those plans changed?
>
>
> On Wednesday, May 25, 2016, Vinod Kone  wrote:
>
>> Hi folks,
>>
>> As discussed in the previous community sync, we plan to cut a release
>> candidate for our next release (1.0) early next week.
>>
>> 1.0 is mainly centered around new APIs for Mesos. Please take a look at
>> MESOS-338  for blocking
>> issues. We got some great design and testing feedback for the v1 scheduler
>> and executor APIs. Please do the same for the in-progress v1 operator API
>> 
>> .
>>
>> Since this is a 1.0, we would like to do the release a little
>> differently.
>>
>> First, the voting period for vetting the release candidate would be a few
>> weeks (2-3 weeks) instead of the typical 3 days.
>>
>> Second, we are wiling to make major changes (scalability fixes, API
>> fixes) if there are any issues reported by the community.
>>
>> We are doing these because we really want the community to thoroughly
>> test the 1.0 release and give feedback.
>>
>> Thanks,
>>
>
>
> --
> Text by Jeff, typos by iPhone
>


Re: Completed tasks logs missing from mesos UI

2016-05-25 Thread Joseph Wu
Can you check that the ExecutorID and AgentID (actually SlaveID) match the
path you found the logs on your box?

See the diagram here for what each part of the path corresponds to each ID:
http://mesos.apache.org/documentation/latest/sandbox/#where-is-it

On Wed, May 25, 2016 at 3:04 AM, shakeel 
wrote:

> Hi,
>
> I am getting an error message similar to the one below when checking the
> sandbox for a completed task.
>
> "Executor with ID  does not exist on slave with ID"
>
> However when I go on the slave, the logs are present.
>
> Has anyone come across this error before and how did you resolve it?
>
>
> Kind Regards
> Shakeel Suffee
>
> --
> The information contained in this message is for the intended addressee
> only and may contain confidential and/or privileged information. If you are
> not the intended addressee, please delete this message and notify the
> sender; do not copy or distribute this message or disclose its contents to
> anyone. Any views or opinions expressed in this message are those of the
> author and do not necessarily represent those of Motortrak Limited or of
> any of its associated companies. No reliance may be placed on this message
> without written confirmation from an authorised representative of the
> company.
>
> Registered in England 3098391 V.A.T. Registered No. 667463890
>


Re: Cannot pull from private docker v1 registry

2016-05-18 Thread Joseph Wu
The stderr you posted suggests that Mesos successfully fetched your
.dockercfg.  If the following docker pull fails, there should be additional
logs printed either in the Mesos agent logs, or in the task stderr.

Can you check those as well?  (And post them here.)

On Wed, May 18, 2016 at 2:29 PM, Scott Kinney  wrote:

> I have a valid .dockercfg credential file on the slave that I pass as a
> uri in the marathon app definition like...
>
>   "uris": [
>   "file:///root/.dockercfg"
>
>   ],
>
> it fails.
> Mesos sandbox stderr...
>
> I0517 21:45:04.104918  5512 fetcher.cpp:424] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/cf607b5a-b629-46f1-a053-0659b78c4231-S454","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":false,"value":"file:\/\/\/root\/.dockercfg"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/cf607b5a-b629-46f1-a053-0659b78c4231-S454\/frameworks\/cf607b5a-b629-46f1-a053-0659b78c4231-\/executors\/gridservice.9d35ca3e-1c78-11e6-8664-0242472674ba\/runs\/9c650d01-127c-416b-a00b-5ad09409c76e"}
> I0517 21:45:04.106462  5512 fetcher.cpp:379] Fetching URI
> 'file:///root/.dockercfg' I0517 21:45:04.106475  5512 fetcher.cpp:250]
> Fetching directly into the sandbox directory I0517 21:45:04.106487  5512
> fetcher.cpp:187] Fetching URI 'file:///root/.dockercfg' I0517
> 21:45:04.106499  5512 fetcher.cpp:167] Copying resource with command:cp
> '/root/.dockercfg'
> '/tmp/mesos/slaves/cf607b5a-b629-46f1-a053-0659b78c4231-S454/frameworks/cf607b5a-b629-46f1-a053-0659b78c4231-/executors/gridservice.9d35ca3e-1c78-11e6-8664-0242472674ba/runs/9c650d01-127c-416b-a00b-5ad09409c76e/.dockercfg'
> I0517 21:45:04.107993  5512 fetcher.cpp:456] Fetched
> 'file:///root/.dockercfg' to
> '/tmp/mesos/slaves/cf607b5a-b629-46f1-a053-0659b78c4231-S454/frameworks/cf607b5a-b629-46f1-a053-0659b78c4231-/executors/gridservice.9d35ca3e-1c78-11e6-8664-0242472674ba/runs/9c650d01-127c-416b-a00b-5ad09409c76e/.dockercfg
>
>
> Marathon debug says it can't authenticate. I can pull manually on the
> slave with this credential file.
> Any idea what i'm doing wrong?
>
>
> Scott Kinney | DevOps
> stem   |   m  510.282.1299
> 100 Rollins Road, Millbrae, California 94030
>
>  This e-mail and/or any attachments contain Stem, Inc. confidential and
> proprietary information and material for the sole use of the intended
> recipient(s). Any review, use or distribution that has not been expressly
> authorized by Stem, Inc. is strictly prohibited.  If you are not the
> intended recipient, please contact the sender and delete all copies. Thank
> you.


Re: Marathon MySQL and Wordpress Deployment

2016-05-12 Thread Joseph Wu
You may want to elaborate on what exactly you want to do.

But yes.  Command tasks just run commands.  As long the syntax is correct
(i.e. can you run it locally?), it will run.

On Wed, May 11, 2016 at 10:13 PM,  wrote:

> Hi,
>
> Would like to know can we deploy MySQL and Wordpress together through
> Marathon UI (using command option in Marathon UI).
>
> Can we put the commands of both together in the white space with the
> environment varibles of command option.Is it possible to run that way.
>
>
>
>
> -Original Message-
> From: Stephen Gran [mailto:stephen.g...@piksel.com]
> Sent: 11 May 2016 15:26
> To: user@mesos.apache.org
> Subject: Re: Marathon scaling application
>
> Hi,
>
> The logs say that the only enabled containerizer is mesos.  Perhaps you
> need to set that to mesos,docker.
>
> Cheers,
>
> On 11/05/16 10:48, suruchi.kum...@accenture.com wrote:
> > Hi,
> >
> > 1.I did not launch the marathon  job with json file.
> >
> > 2. version of mesos is 0.27.2 and marathon is 0.15.3
> >
> > 3.what OS is on the nodes :Ubuntu 14.04 LTS
> >
> > 4.Here are the slave logs :-
> >
> > E0511 00:41:43.982487  1460 slave.cpp:3800] Termination of executor
> > 'nginx.226620ca-1711-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed: Unknown container:
> > a0d72cc7-f02b-44d7-b93a-3b1df6e74414
> >
> > E0511 01:41:44.518671  1457 slave.cpp:3729] Container
> > '20095298-d0c5-4c23-ae0b-a0b9393ecfb4' for executor
> > 'nginx.847bdd1b-1719-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed to start: None of the
> > enabled containerizers (mesos) could create a container for the
> > provided TaskInfo/ExecutorInfo message
> >
> > E0511 01:41:44.518831  1457 slave.cpp:3800] Termination of executor
> > 'nginx.847bdd1b-1719-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed: Unknown container:
> > 20095298-d0c5-4c23-ae0b-a0b9393ecfb4
> >
> > E0511 02:41:44.632048  1462 slave.cpp:3729] Container
> > '944a6719-b942-4a06-8d4a-08e1f624f62e' for executor
> > 'nginx.e6557acc-1721-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed to start: None of the
> > enabled containerizers (mesos) could create a container for the
> > provided TaskInfo/ExecutorInfo message
> >
> > E0511 02:41:44.632735  1457 slave.cpp:3800] Termination of executor
> > 'nginx.e6557acc-1721-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed: Unknown container:
> > 944a6719-b942-4a06-8d4a-08e1f624f62e
> >
> > E0511 03:41:44.781136  1464 slave.cpp:3729] Container
> > '2677810f-0f42-45fd-87aa-329a9fbe5af0' for executor
> > 'nginx.482be42d-172a-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed to start: None of the
> > enabled containerizers (mesos) could create a container for the
> > provided TaskInfo/ExecutorInfo message
> >
> > E0511 03:41:44.782914  1460 slave.cpp:3800] Termination of executor
> > 'nginx.482be42d-172a-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed: Unknown container:
> > 2677810f-0f42-45fd-87aa-329a9fbe5af0
> >
> > E0511 04:41:44.891082  1463 slave.cpp:3729] Container
> > 'acefe126-7d69-4525-987c-bafbf1dd1d6f' for executor
> > 'nginx.aa066c3e-1732-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed to start: None of the
> > enabled containerizers (mesos) could create a container for the
> > provided TaskInfo/ExecutorInfo message
> >
> > E0511 04:41:44.891180  1463 slave.cpp:3800] Termination of executor
> > 'nginx.aa066c3e-1732-11e6-9f8a-fa163ecc33f1' of framework
> > a039103f-aab7-4f15-8578-0d52ac8f60e0- failed: Unknown container:
> > acefe126-7d69-4525-987c-bafbf1dd1d6f
> >
> > E0510 10:27:25.997802  1352 process.cpp:1958] Failed to shutdown
> > socket with fd 10: Transport endpoint is not connected
> >
> > E0511 05:39:43.651479  1351 slave.cpp:3252] Failed to update resources
> > for container 53bb3453-31b2-4cf7-a9e1-5f700510eeb4 of executor
> > 'nginx.38f28ab0-169b-11e6-9f8a-fa163ecc33f1' running task
> > nginx.38f28ab0-169b-11e6-9f8a-fa163ecc33f1 on status update for
> > terminal task, destroying container: Failed to 'docker -H
> > unix:///var/run/docker.sock inspect
> >
> mesos-f986e4ba-91ba-4624-b685-4c004407c6db-S1.53bb3453-31b2-4cf7-a9e1-5f700510eeb4':
> > exit status = exited with status 1 stderr = Cannot connect to the
> > Docker daemon. Is the docker daemon running on this host?
> >
> > E0511 05:39:43.651845  1351 slave.cpp:3252] Failed to update resources
> > for container ec4e97ad-2365-4c29-9ed7-64cd9261c666 of executor
> > 'nginx.38f48682-169b-11e6-9f8a-fa163ecc33f1' running task
> > nginx.38f48682-169b-11e6-9f8a-fa163ecc33f1 on status update for
> > terminal task, destroying container: Failed to 'docker -H
> > unix:///var/run/docker.sock inspect
> >
> 

Re: Enable s3a for fetcher

2016-05-10 Thread Joseph Wu
I can't speak to what DCOS does or will do (you can ask on the associated
mailing list: us...@dcos.io).

We will be maintaining existing functionality for the fetcher, which means
supporting the schemes:
* file
* http, https, ftp, ftps
* hdfs, hftp, s3, s3n  <--  These rely on hadoop.

And we will retain the --hadoop_home agent flag, which you can use to
specify the hadoop binary.

Other schemes might work right now, if you hack around with your node
setup.  But there's no guarantee that your hack will work between Mesos
versions.  In future, we will associate a fetcher plugin for each scheme.
And you will be able to load custom fetcher plugins for additional schemes.
TLDR: no "nerfing" and less hackiness :)

On Tue, May 10, 2016 at 12:58 PM, Briant, James <
james.bri...@thermofisher.com> wrote:

> This is the mesos latest documentation:
>
> If the requested URI is based on some other protocol, then *the fetcher
> tries to utilise a local Hadoop client* and *hence supports any protocol
> supported by the Hadoop client, e.g., HDFS, S3.* See the slave configuration
> documentation
> <http://mesos.apache.org/documentation/latest/configuration/> for how to
> configure the slave with a path to the Hadoop client. [emphasis added]
>
> What you are saying is that dcos simply wont install hadoop on agents?
>
> Next question then: will you be nerfing fetcher.cpp, or will I be able to
> install hadoop on the agents myself, such that mesos will recognize s3a?
>
>
> From: Joseph Wu <jos...@mesosphere.io>
> Reply-To: "user@mesos.apache.org" <user@mesos.apache.org>
> Date: Tuesday, May 10, 2016 at 12:20 PM
> To: user <user@mesos.apache.org>
>
> Subject: Re: Enable s3a for fetcher
>
> Mesos does not explicitly support HDFS and S3.  Rather, Mesos will assume
> you have a hadoop binary and use it (blindly) for certain types of URIs.
> If the hadoop binary is not present, the mesos-fetcher will fail to fetch
> your HDFS or S3 URIs.
>
> Mesos does not ship/package hadoop, so these URIs are not expected to work
> out of the box (for plain Mesos distributions).  In all cases, the operator
> must preconfigure hadoop on each node (similar to how Docker in Mesos
> works).
>
> Here's the epic tracking the modularization of the mesos-fetcher (I
> estimate it'll be done by 0.30):
> https://issues.apache.org/jira/browse/MESOS-3918
>
> ^ Once done, it should be easier to plug in more fetchers, such as one for
> your use-case.
>
> On Tue, May 10, 2016 at 11:21 AM, Briant, James <
> james.bri...@thermofisher.com> wrote:
>
>> I’m happy to have default IAM role on the box that can read-only fetch
>> from my s3 bucket. s3a gets the credentials from AWS instance metadata. It
>> works.
>>
>> If hadoop is gone, does that mean that hfds: URIs don’t work either?
>>
>> Are you saying dcos and mesos are diverging? Mesos explicitly supports
>> hdfs and s3.
>>
>> In the absence of S3, how do you propose I make large binaries available
>> to my cluster, and only to my cluster, on AWS?
>>
>> Jamie
>>
>> From: Cody Maloney <c...@mesosphere.io>
>> Reply-To: "user@mesos.apache.org" <user@mesos.apache.org>
>> Date: Tuesday, May 10, 2016 at 10:58 AM
>> To: "user@mesos.apache.org" <user@mesos.apache.org>
>> Subject: Re: Enable s3a for fetcher
>>
>> The s3 fetcher stuff inside of DC/OS is not supported. The `hadoop`
>> binary has been entirely removed from DC/OS 1.8 already. There have been
>> various proposals to make it so the mesos fetcher is much more pluggable /
>> extensible (https://issues.apache.org/jira/browse/MESOS-2731 for
>> instance).
>>
>> Generally speaking people want a lot of different sorts of fetching, and
>> there are all sorts of questions of how to properly get auth to the various
>> chunks (if you're using s3a:// presumably you need to get credentials there
>> somehow. Otherwise you could just use http://). Need to design / build
>> that into Mesos and DC/OS to be able to use this stuff.
>>
>> Cody
>>
>> On Tue, May 10, 2016 at 9:55 AM Briant, James <
>> james.bri...@thermofisher.com> wrote:
>>
>>> I want to use s3a: urls in fetcher. I’m using dcos 1.7 which has hadoop
>>> 2.5 on its agents. This version has the necessary hadoop-aws and aws-sdk:
>>>
>>> hadoop--afadb46fe64d0ee7ce23dbe769e44bfb0767a8b9]$ ls
>>> usr/share/hadoop/tools/lib/ | grep aws
>>> aws-java-sdk-1.7.4.jar
>>> hadoop-aws-2.5.0-cdh5.3.3.jar
>>>
>>> What config/scripts do I need to hack to get these guys on the classpath
>>> so that "hadoop fs -copyToLocal” works?
>>>
>>> Thanks,
>>> Jamie
>>
>>
>


Re: Enable s3a for fetcher

2016-05-10 Thread Joseph Wu
Mesos does not explicitly support HDFS and S3.  Rather, Mesos will assume
you have a hadoop binary and use it (blindly) for certain types of URIs.
If the hadoop binary is not present, the mesos-fetcher will fail to fetch
your HDFS or S3 URIs.

Mesos does not ship/package hadoop, so these URIs are not expected to work
out of the box (for plain Mesos distributions).  In all cases, the operator
must preconfigure hadoop on each node (similar to how Docker in Mesos
works).

Here's the epic tracking the modularization of the mesos-fetcher (I
estimate it'll be done by 0.30):
https://issues.apache.org/jira/browse/MESOS-3918

^ Once done, it should be easier to plug in more fetchers, such as one for
your use-case.

On Tue, May 10, 2016 at 11:21 AM, Briant, James <
james.bri...@thermofisher.com> wrote:

> I’m happy to have default IAM role on the box that can read-only fetch
> from my s3 bucket. s3a gets the credentials from AWS instance metadata. It
> works.
>
> If hadoop is gone, does that mean that hfds: URIs don’t work either?
>
> Are you saying dcos and mesos are diverging? Mesos explicitly supports
> hdfs and s3.
>
> In the absence of S3, how do you propose I make large binaries available
> to my cluster, and only to my cluster, on AWS?
>
> Jamie
>
> From: Cody Maloney 
> Reply-To: "user@mesos.apache.org" 
> Date: Tuesday, May 10, 2016 at 10:58 AM
> To: "user@mesos.apache.org" 
> Subject: Re: Enable s3a for fetcher
>
> The s3 fetcher stuff inside of DC/OS is not supported. The `hadoop` binary
> has been entirely removed from DC/OS 1.8 already. There have been various
> proposals to make it so the mesos fetcher is much more pluggable /
> extensible (https://issues.apache.org/jira/browse/MESOS-2731 for
> instance).
>
> Generally speaking people want a lot of different sorts of fetching, and
> there are all sorts of questions of how to properly get auth to the various
> chunks (if you're using s3a:// presumably you need to get credentials there
> somehow. Otherwise you could just use http://). Need to design / build
> that into Mesos and DC/OS to be able to use this stuff.
>
> Cody
>
> On Tue, May 10, 2016 at 9:55 AM Briant, James <
> james.bri...@thermofisher.com> wrote:
>
>> I want to use s3a: urls in fetcher. I’m using dcos 1.7 which has hadoop
>> 2.5 on its agents. This version has the necessary hadoop-aws and aws-sdk:
>>
>> hadoop--afadb46fe64d0ee7ce23dbe769e44bfb0767a8b9]$ ls
>> usr/share/hadoop/tools/lib/ | grep aws
>> aws-java-sdk-1.7.4.jar
>> hadoop-aws-2.5.0-cdh5.3.3.jar
>>
>> What config/scripts do I need to hack to get these guys on the classpath
>> so that "hadoop fs -copyToLocal” works?
>>
>> Thanks,
>> Jamie
>
>


Re: Marathon GUI -Scaling An Application Issue.

2016-05-09 Thread Joseph Wu
Can you check the following?


   - Are offers being sent to Marathon from both agents?  This will show up
   in the master logs, with at least INFO level logging (default).
   - Do the resources from your second agent actually satisfy your
   container's constraints?  It would help to see your Marathon app definition
   and the resource string of your agents.  You can find the latter printed at
   the beginning of the agent's logs.  i.e.
   Agent resources: cpus(*):1; mem(*):1024; disk(*):2048
   - Are your tasks not being started?  Or are they failing on your second
   node?


On Mon, May 9, 2016 at 7:42 PM,  wrote:

> Hi,
>
>
>
> I have setup  3 mesos-masters and 2 mesos-slave nodes in my environment.
> And I’m trying to deploy Docker containers using Marathon UI. I was able to
> run the applications and scale it.
>
>
>
> Since  my first slave has short of resources ,the application (more than 5
> instances) which m scaling should get distributed but it is not getting
> distributed or moved  to another slave node which has enough resources
> available.
>
>
>
> I don’t know why is it happening, so could you please help me with this
> issue  and is Mesos helps the applications to get distributed in terms of
> shortage of resources on a particular slave node???
>
>
>
>
>
> Thank you,
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
>
> __
>
> www.accenture.com
>


Re: Dynamic scaling of DCOS slave

2016-05-03 Thread Joseph Wu
You should be able to add as many agents as you like via manual
installation.  The list of agents you supply in the genconf is used for the
automated install processes.  For manual installation, the list of agents
is inconsequential (since you, the operator, are SSH-ing into each box).
See: https://dcos.io/docs/1.7/administration/installing/custom/advanced/

Also, please direct DC/OS questions towards us...@dcos.io for
better/more-complete answers.

On Tue, May 3, 2016 at 2:53 AM, Dhiraj Thakur 
wrote:

> Hi,
> Is there any way to scale up dcos slave without uninstalling existing
> setup?
> I have done installation through
> https://dcos.io/docs/1.7/administration/installing/custom/
>
> -Dhiraj
>


Re: Offers to a framework

2016-05-02 Thread Joseph Wu
Both the Mesos master and Marathon have metrics that tell you how many
offers have been sent, but not the contents of said offers.  Marathon does
not keep offers long enough for them to show up as "outstanding offers" in
the Mesos UI.

As far as I know, one way to get the offer contents is by setting logging
verbosity to GLOG_v=2 (*Warning*: This prints a lot of stuff).  When the
master is started or toggled to this logging level, the allocator will
print each offer before it sends it:
https://github.com/apache/mesos/blob/2c6eeefe13c5706da70328af6cea14f802121bf2/src/master/allocator/mesos/hierarchical.cpp#L1462-L1463

For how to temporarily toggle the logging verbosity:
http://mesos.apache.org/documentation/latest/logging/

On Mon, May 2, 2016 at 6:46 AM, Vaibhav Khanduja 
wrote:

> For a use-case, we need to know offers been sent to a registered
> framework. For e.g. if we are using marathon as a framework, the master
> would send offers to it based on DRF as soon as a slave is available. The
> marathon would then accept the offer if it requires to start a job. Is
> there is way to know externally using an api - either master or marathon or
> framework to know what offers are been sent and what are been accepted?
>
> Thx
>


Re: HTTP API

2016-03-19 Thread Joseph Wu
Zameer,

In case you haven't seen this already, there is already a Java-based
scheduler driver for the HTTP API here:
https://github.com/mesosphere/mesos-rxjava


On Thu, Mar 17, 2016 at 5:26 PM, Zameer Manji  wrote:

>
> On Thu, Mar 17, 2016 at 10:03 AM, Vinod Kone  wrote:
>
>> Other than the issues listed above, we like frameworks to start testing
>> this API in their staging/testing clusters. This would give us the most
>> confidence to call it production ready. Can you help?
>>
>
> As a committer of Apache Aurora, I am interested in removing the
> dependency in libmesos and creating a Java Scheduler Driver that
> communicates with the HTTP API. However, it only seems worthwhile to do
> once the API has stabilized. I'll wait for the API to be finalized and then
> assess what work needs to be done for the framework.
>
> --
> Zameer Manji
>
>


Re: [RESULT][VOTE] Release Apache Mesos 0.27.2 (rc1)

2016-03-18 Thread Joseph Wu
Cong Wang,

The tags are sync'd.  See: https://github.com/apache/mesos/releases

You might not have done: git pull --tags

On Wed, Mar 16, 2016 at 11:49 AM, Cong Wang  wrote:

> On Mon, Mar 7, 2016 at 8:29 PM, Michael Park  wrote:
> > Please find the release at:
> > https://dist.apache.org/repos/dist/release/mesos/0.27.2
> >
> > It is recommended to use a mirror to download the release:
> > http://www.apache.org/dyn/closer.cgi
> >
> > The CHANGELOG for the release is available at:
> >
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.2
> >
> > The mesos-0.27.2.jar has been released to:
> > https://repository.apache.org
> >
>
> So why the git tags are not synced to github mirror?
>
> $ git tag -l | grep '0\.27\.2'
>


Re: How to manage maintenance windows?

2016-03-14 Thread Joseph Wu
Managing maintenance is currently up to the operator (you, presumably).  If
you have something to contribute (code, docs, or examples), that would be
greatly appreciated :)

We haven't prioritized other integration (like the CLI or web UI) since
maintenance primitives themselves need to be supported by frameworks using
the V1 HTTP API; and frameworks developers have not all started/finished
migrating to the new API yet.  (Without framework support, frameworks will
not react to your maintenance windows.)

Here's the tracking ticket for adding maintenance info to the web UI (no
work has been done on this yet):
https://issues.apache.org/jira/browse/MESOS-2082

On Mon, Mar 14, 2016 at 10:53 AM, Christoph Heer 
wrote:

> Hi,
>
> Mesos provides nice support for maintenance managing since some versions.
> Thank you.
> The API provides all the required functionality but I couldn't found a
> tool for operators to manage maintenance windows.
>
> How do you manage/plan such windows? Did you integrate the API in your
> existing tooling? Did you create a CLI for such tasks?
>
> Is it planned to extend the Mesos webinterface with a maintenance section?
>
> Thank you and best regards
> Christoph


Re: How did the mesos master detect the disconnect of a framework (scheduler)

2016-02-26 Thread Joseph Wu
Here's a brief(?) run-down:

   1.
   
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/src/master/master.cpp#L5739-L5748
   

   When a new framework is added, the master opens a socket connection with
   the framework.
   - If this is a scheduler-driver-based framework, this is a plain socket
  connection.
  - If this is a new HTTP API framework, the master uses the streaming
  HTTP connection instead.
   2. The HTTP API framework's exit logic is simpler to explain.  When the
   streaming connection closes, the master considers the framework to have
   exited.  In the above code, see this chunk of code:
   http.closed()
 .onAny(defer(self(), ::exited, framework->id(), http));
   3. The scheduler-driver-based framework exit is a bit more involved:
  1.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/3rdparty/libprocess/src/process.cpp#L1326
  Libprocess has a SocketManager which, as the name suggests, managed
  sockets.  Linking the master <-> framework spawns a socket here.
  2.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/3rdparty/libprocess/src/process.cpp#L1394-L1400
  Linking will install a dispatch loop, which continually reads the
  data from the socket until the socket closes.
  3.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/3rdparty/libprocess/src/process.cpp#L1300-L1312
  The dispatch loop calls "ignore_recv_data".  This detects when the
  socket closes and calls "SocketManager->close(s)".
  4.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/3rdparty/libprocess/src/process.cpp#L1928
  "SocketManager->close" will generate a libprocess "ExitedEvent".
  5.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/src/master/master.cpp#L1352
  Master has a listener for "ExitedEvent" which rate-limits these
  events.
  6.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/src/master/master.cpp#L1161
  The "ExitedEvent" eventually gets propagated to that ^ method
  (through a libprocess event visitor).
  7.
  
https://github.com/apache/mesos/blob/4376803007446b949840d53945547d8a61b91339/src/master/master.cpp#L1165
  Finally, the framework gets removed.

Hope that helps,

~Joseph

On Fri, Feb 26, 2016 at 10:45 AM, Chong Chen  wrote:

> Hi,
>
> When a running framework was disconnected (manually terminated), the Mesos
> master will detect it immediately.  The master::exited() function will be
> invoked with log info “framework disconnected”.
>
> I just wondering, how this disconnect detection was implemented in Mesos?
> I can’t find any place in mesos src directory where the Master::exit()
> function was called.
>
>
>
> Thanks!
>
>
>
> Best Regards,
>
> Chong
>


Re: Downloading s3 uris

2016-02-26 Thread Joseph Wu
The sandbox directory structure is a bit deep...  See the "Where is the
sandbox?" section here:
http://mesos.apache.org/documentation/latest/sandbox/


On Fri, Feb 26, 2016 at 10:15 AM, Aaron Carey  wrote:

> A second question for you all..
>
> I'm testing http uri downloads, and all the logs say that the file has
> downloaded (it even shows up in the mesos UI in the sandbox) but I can't
> find the file on disk anywhere. It doesn't appear in the docker container
> I'm running either (shouldn't it be in /mnt/mesos/sandbox?)
>
> Am I missing something here?
>
> Thanks for your help,
>
> Aaron
>
>
> --
> *From:* Radoslaw Gruchalski [ra...@gruchalski.com]
> *Sent:* 26 February 2016 17:41
>
> *To:* user@mesos.apache.org; user@mesos.apache.org
> *Subject:* Re: Downloading s3 uris
>
> Just keep in mind that every execution of such command starts a jvm and
> is, generally, heavyweight. Use WebHDFS if you can.
>
> Sent from Outlook Mobile 
>
>
>
>
> On Fri, Feb 26, 2016 at 9:13 AM -0800, "Shuai Lin"  > wrote:
>
> If you don't want to configure hadoop on your mesos slaves, the only
>> workaround I see is to write a "hadoop" script and put it in your PATH. It
>> need to support the following usage patterns:
>>
>> - hadoop version
>> - hadoop fs -copyToLocal s3n://path /target/directory/
>>
>> On Sat, Feb 27, 2016 at 12:31 AM, Aaron Carey  wrote:
>>
>>> I was trying to avoid generating urls for everything as this will
>>> complicate things a lot.
>>>
>>> Is there a straight forward way to get the fetcher to do it directly?
>>>
>>> --
>>> *From:* haosdent [haosd...@gmail.com]
>>> *Sent:* 26 February 2016 16:27
>>> *To:* user
>>> *Subject:* Re: Downloading s3 uris
>>>
>>> I think still could pass AWSAccessKeyId if it is private?
>>> http://www.bucketexplorer.com/documentation/amazon-s3--how-to-generate-url-for-amazon-s3-files.html
>>>
>>> On Sat, Feb 27, 2016 at 12:25 AM, Abhishek Amralkar <
>>> abhishek.amral...@talentica.com> wrote:
>>>
 In that case do we need to keep bucket/files public?

 -Abhishek

 From: Zhitao Li 
 Reply-To: "user@mesos.apache.org" 
 Date: Friday, 26 February 2016 at 8:23 AM
 To: "user@mesos.apache.org" 
 Subject: Re: Downloading s3 uris

 Haven't directly used s3 download, but I think a workaround (if you
 don't care ACL about the files) is to use http
 
  url
 instead.

 On Feb 26, 2016, at 8:17 AM, Aaron Carey  wrote:

 I'm attempting to fetch files from s3 uris in mesos, but we're not
 using hdfs in our cluster... however I believe I need the client installed.

 Is it possible to just have the client running without a full hdfs
 setup?

 I haven't been able to find much information in the docs, could someone
 point me in the right direction?

 Thanks!

 Aaron



>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>


[Proposal] Unified logging for containerizers & the external containerizer

2015-12-11 Thread Joseph Wu
Hello All,

As part of the work on managing the logs for executors and tasks, we're
introducing a "ContainerLogger" module.  This module will allow the
stdout/stderr of executors and tasks to be managed or redirected.
(Existing executor/task logs are written to plain files.)  For example:

   - The module would make it trivial to truncate logs to a maximum size.
   Or to rotate the logs.
   - A module could redirect logs into an aggregation service, like syslog
   or journald; or to external logging, like LogStash or Splunk.

See the epic for more details:
https://issues.apache.org/jira/browse/MESOS-4086

For the MVP, we will support the Mesos and Docker containerizers.  For the
external containerizer, we plan to exit if an agent is started with both
the external containerizer and the new ContainerLogger module.  i.e.

mesos-slave.sh --containerizers="mesos,external"
--container_logger="some_custom_logger"

Is there anyone, using the external containerizer, that would not prefer
this behavior?

Thanks,
~Joseph


Re: Viewing old versions of docs

2015-10-02 Thread Joseph Wu
Hi Alan,

I don't think it's recommended to refer to older versions of the docs.  But
if you absolutely need to, you can find those by browsing the source.

Take the version of Mesos you're looking for, and substitute it for
"" below:
https://github.com/apache/mesos/blob//docs/

i.e. For the most recent release:
https://github.com/apache/mesos/blob/0.24.1/docs/

~Joseph

On Fri, Oct 2, 2015 at 11:02 AM, Alan Braithwaite 
wrote:

> Hey All,
>
> Trying to figure out how to view older versions of the docs on the web.
> Can't find an index or link to versioned docs from google.
>
> Can anyone point me in the right direction?
>
> Thanks,
> - Alan
>


Re: Reservations for multiple different agents

2015-09-29 Thread Joseph Wu
Rinaldo,

The principle is taken from authentication, rather than from the body of
the resources.  In this case, you'll be using Basic Authentication:
https://en.wikipedia.org/wiki/Basic_access_authentication#Client_side

With curl, you'd add something like: -H "Authorization: Basic
bWVzb3MtbWFjaDUtYmV0YTpwYXNzd29yZA=="
That base64 blurb is the encoded version of "mesos-mach5-beta:password".

~Joseph

On Mon, Sep 28, 2015 at 8:25 PM, DiGiorgio, Mr. Rinaldo S. <
rdigior...@pace.edu> wrote:

>
> On Sep 28, 2015, at 8:03 PM, Joseph Wu <jos...@mesosphere.io> wrote:
>
> Hi Rinaldo,
>
> I'd like to point out a small error in your ACLs.
>
> If you want to specify "ANY", you should set the "type" field.  i.e. For
> the RegisterFramework ACL:
> "register_frameworks": [
>   {
> "principals": { "values": "mesos-mach5-beta" },
> "roles": { "type": 1 }
>   }
> ]
>
>
> Thanks — can’t keep my eyes open any more.  This is the response I get to
> the following request.
>
> *Invalid RESERVE operation: Cannot reserve resources without a principal.
>  *
>
> The example shows -u principal:password in curl which is an auentycation
> string for the browser so I am totally confused on how to provide a
> principal.   The documentation for the framework reserve
>
>
>
> curl -i  -d slaveId="$SLAVE_ID" -d @- -X POST
> http://$MESOS_HOST/master/reserve < resources=[
> {
>   "name": "cpus",
>  "type": "SCALAR",
>  "scalar": { "value": 8 },
>  "role": "mach5",
>  "reservation": {
>"principal": "mach5"
>  }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": { "value": 4096 },
> "role": "mach5",
> "reservation": {
> "principal": "mach5"
> }
> }
> ]
> <
>
> The ANY "type" is part of an enumeration, defined here:
>
> https://github.com/apache/mesos/blob/master/include/mesos/authorizer/authorizer.proto#L33-L45
>
> Hope that helps,
> ~Joseph
>
> On Mon, Sep 28, 2015 at 2:51 PM, DiGiorgio, Mr. Rinaldo S. <
> rdigior...@pace.edu> wrote:
>
>>
>> On Sep 28, 2015, at 5:27 PM, Marco Massenzio <ma...@mesosphere.io> wrote:
>>
>> Hi Rinaldo,
>>
>> sorry about the trouble you're having in getting this to work!
>> If I got this one right, the original requirement was...
>>
>> I have some tasks that need to run on different types of agents.
>>
>>
>> for that, I think you can use either (or both) of `roles` and
>> `attributes` (see the Configuration doc [0] for more info).
>>
>> If you would like to run a 0.24 Mesos on your Mac for testing, you could
>> use the Mesosphere published packages[1] or, if Vagrant is more your thing,
>> feel free to "take inspiration" form [2].
>>
>> Marco,
>>
>>Thanks — We  are running 0.23, 0.24 and the current branch as of this
>> morning in three mesos environments with linux and mac nodes and working on
>> porting Solaris. We have had various issues with building but are past most
>> of them. We are making progess on the  Solaris build and there is an issue
>> with libsvn-1 as you mentioned with OL7.
>>
>>
>> *Why do we need Dynamic Reservations?*
>>
>> We are also working with the mesos-plugin 0.8 and 0.9 and would like to
>> change some of the behaviors of the plugin. One of the changes we want to
>> make and we may move this out of the meson-plugin into workflow plugin in
>> jenkins is to be able to reserve all the resources we need before we start
>> a series of tasks. That is what we want to use dynamic reservations for.
>> There may be issues with the jenkins workflow architecture in that “slaves”
>> have to be requested via plugins.  Mesos is new and I am sure it will
>> provide a framework to innovate  on all the following currently supported
>> scheduling options in LSF.
>>
>> Fair share, preemptive, backfill and SLA scheduling
>> High throughput scheduling
>> Multicluster scheduling
>> Topology-, resource-, and energy-aware scheduling
>>
>>
>>
>>
>> I am trying to ask for a reservation and maybe I just don’t understand
>> the definitions. I seem to be unsure about what a principal is.  Maybe that
>> is the root of my current issue.   Unfortunately I am also a teacher so I
>> notice things like I still can’t 

Re: Disk Capacity Detection

2015-07-23 Thread Joseph Wu
Chris,

There is a workaround.  If you wrap up your storage devices prior to
starting Mesos, then you can transparently use multiple disks as a single
disk.  See [1].

~Joseph

[1] https://www.mail-archive.com/user@mesos.apache.org/msg01726.html

On Thu, Jul 23, 2015 at 2:31 PM, Jie Yu yujie@gmail.com wrote:

 Chris, Mesos currently only supports a single disk. In other words, if you
 have multiple disks, Mesos can only manage one of them. You can choose
 which disk to use by setting --work_dir to the appropriate directory under
 the given disk.

 On Thu, Jul 23, 2015 at 2:28 PM, Christopher Ketchum cketc...@ucsc.edu
 wrote:

 Hi,

 I've recently started using the disk resource, and it seems like mesos
 slaves only detect disk resources mounted on the root directory. Is there
 anyway to point mesos to storage mounted elsewhere? Should I just pass what
 I know to be the cumulative storage to the resources flag?

 Thanks!
 Chris