Re: [Breaking Change, MESOS-1865] Redirect to the leader master when current master is not a leader

2016-05-01 Thread Marco Massenzio
Hi,

sorry, I have not kept up with all the new endpoints :)
If there is already an endpoint (/redirect ?) that essentially addresses
the issue raised by MESOS-3841
 (https://issues.apache.org/jira/browse/MESOS-3841) then I'd suggest to
close it and add a note.

(I just saw it and thought it would have been useful, and fun to fix).

-- 
*Marco Massenzio*
http://codetrips.com

On Sat, Apr 30, 2016 at 9:24 PM, haosdent <haosd...@gmail.com> wrote:

> Oh, @Marco. Thank you very much for your reply, vinodkone shepherd this
> and it have already submitted after other kindly guys reviews.
>
> For MESOS-3841, it should be resolved now because could get the leading
> master by "/redirect" endpoint. Do you have any concerns about it? I would
> like to solve your concerns.
>
> On Sun, May 1, 2016 at 12:12 PM, Marco Massenzio <m.massen...@gmail.com>
> wrote:
>
>> @haosdent - thanks for doing this, very useful indeed!
>>
>> On a related issue [0], I'd like to that one on:
>>
>> - can anyone comment if that's a good idea/bad idea; and
>> - would anyone be willing to shepherd it?
>>
>> Thanks!
>>
>> [0] Master HTTP API support to get the leader (
>> https://issues.apache.org/jira/browse/MESOS-3841)
>>
>> --
>> *Marco Massenzio*
>> http://codetrips.com
>>
>> On Tue, Apr 19, 2016 at 12:34 AM, haosdent <haosd...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We intend to introduce a breaking change[1] in the http endpoints
>>> without the deprecation cycle.
>>> For below http endpoints, when user request to a master which is not a
>>> leader,
>>> user would get a 307 redirect(TEMPORARY_REDIRECT) to the leader master.
>>>
>>> * /create-volumes
>>> * /destroy-volumes
>>> * /frameworks
>>> * /reserve
>>> * /slaves
>>> * /quota
>>> * /weights
>>> * /state
>>> * /state.json
>>> * /state-summary
>>> * /tasks
>>> * /tasks.json
>>> * /roles
>>> * /roles.json
>>> * /teardown
>>> * /maintenance/schedule
>>> * /maintenance/status
>>> * /machine/down
>>> * /machine/up
>>> * /unreserve
>>>
>>> For other endpoints in master, the behaviour is not change.
>>>
>>> If your existing framework/tool relied on the old behaviour, I suggest
>>> to add a logic to handle 307 redirect response.
>>> Please let me know if you have any queries/concerns. Any comments will
>>> be appreciated.
>>>
>>> Links:
>>> [1]  Tracking JIRA: https://issues.apache.org/jira/browse/MESOS-1865
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: [Breaking Change, MESOS-1865] Redirect to the leader master when current master is not a leader

2016-04-30 Thread Marco Massenzio
@haosdent - thanks for doing this, very useful indeed!

On a related issue [0], I'd like to that one on:

- can anyone comment if that's a good idea/bad idea; and
- would anyone be willing to shepherd it?

Thanks!

[0] Master HTTP API support to get the leader (
https://issues.apache.org/jira/browse/MESOS-3841)

-- 
*Marco Massenzio*
http://codetrips.com

On Tue, Apr 19, 2016 at 12:34 AM, haosdent <haosd...@gmail.com> wrote:

> Hi All,
>
> We intend to introduce a breaking change[1] in the http endpoints without
> the deprecation cycle.
> For below http endpoints, when user request to a master which is not a
> leader,
> user would get a 307 redirect(TEMPORARY_REDIRECT) to the leader master.
>
> * /create-volumes
> * /destroy-volumes
> * /frameworks
> * /reserve
> * /slaves
> * /quota
> * /weights
> * /state
> * /state.json
> * /state-summary
> * /tasks
> * /tasks.json
> * /roles
> * /roles.json
> * /teardown
> * /maintenance/schedule
> * /maintenance/status
> * /machine/down
> * /machine/up
> * /unreserve
>
> For other endpoints in master, the behaviour is not change.
>
> If your existing framework/tool relied on the old behaviour, I suggest to
> add a logic to handle 307 redirect response.
> Please let me know if you have any queries/concerns. Any comments will be
> appreciated.
>
> Links:
> [1]  Tracking JIRA: https://issues.apache.org/jira/browse/MESOS-1865
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Safe update of agent attributes

2016-02-22 Thread Marco Massenzio
IIRC you can avoid the issue by either using a different work_dir for the
agent, or removing (and, possibly, re-creating) it.

I'm afraid I don't have a running instance of Mesos on this machine and
can't test it out.

Also (and this is strictly my opinion :) I would consider a change of
attribute a "material" change for the Agent and I would avoid trying to
recover state from previous runs; but, again, there may be perfectly
legitimate cases in which this is desirable.

-- 
*Marco Massenzio*
http://codetrips.com

On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:

> Hi,
>
> We recently discovered that updating attributes on Mesos agents is a very
> risk operation, and has a potential to send agent(s) into a crash loop if
> not done properly with errors like "Failed to perform recovery:
> Incompatible slave info detected". This combined with --recovery_timeout
> made the situation even worse.
>
> In our setup, some of the attributes are generated from automated
> configuration management system, so this opens a possibility that "bad"
> configuration could be left on the machine and causing big trouble on next
> agent upgrade, if the USR1 signal was not sent on time.
>
> Some questions:
>
> 1. Does anyone have a good practice recommended on managing these
> attributes safely?
> 2. Has Mesos considered to fallback to old metadata if it detects
> incompatibility, so agents would keep running with old attributes instead
> of falling into crash loop?
>
> Thanks.
>
> --
> Cheers,
>
> Zhitao Li
>


Re: Using Virtual Hosts

2016-02-11 Thread Marco Massenzio
How are you launching your tasks and are they containerized?

If you use your own framework and launch tasks in containers, you can
configure the networking mode as BRIDGED (in ContainerInfo), and your
Framework will obtain (in the response / callback it receives after the
task is launched) the port (and will already know the hostname/ip of the
Agent whose Offer it accepted) - this information can then be fed to
whatever discovery mechanism you use (or, more trivially, in the
Framework's Web UI - which itself can be advertised to the Master - via the
`webui_url` field in the FrameworkInfo protobuf [0]

I don't know enough of Marathon to really be able to help there - but if
you post the question in their user group, I'm sure there's a less involved
way to do this if you use it. :)

[0] ./include/mesos/mesos.proto LL #206

-- 
*Marco Massenzio*
http://codetrips.com

On Thu, Feb 11, 2016 at 9:27 PM, Jeff Schroeder <jeffschroe...@computer.org>
wrote:

> With a few of the newly added features, marathon-lb is actually a pretty
> elegant solution:
>
> https://github.com/mesosphere/marathon-lb
>
>
> On Thursday, February 11, 2016, Alfredo Carneiro <
> alfr...@simbioseventures.com> wrote:
>
>> Hi guys,
>>
>> I have been searching for the past few weeks about Mesos and VHosts,
>> saddly, I have not found anything useful.
>>
>> I have a mesos cluster running some webapps. So, I have assigned specifc
>> ports to these apps, so I access this apps using
>> *http://:*. How could I use Virtual Hosts to
>> access these apps? *http://myapp.com <http://myapp.com>*?
>>
>> 1x Mesos Master with HAProxy and Chronos
>> 9x Mesos Slave with Docker
>>
>> Thanks,
>>
>> --
>> Alfredo Miranda
>>
>
>
> --
> Text by Jeff, typos by iPhone
>


Re: mesos 0.23, long term quering state.json data.

2016-02-01 Thread Marco Massenzio
+1 to what Neil says

plus, if you don't need all the info contained in /state, /state-summary is
a much faster option.

-- 
*Marco Massenzio*
http://codetrips.com

On Mon, Feb 1, 2016 at 8:27 AM, Neil Conway <neil.con...@gmail.com> wrote:

> There are some known performance problems with the implementation of
> the /state endpoint in prior versions of Mesos (see MESOS-2353 for
> details). In Mesos 0.27, the performance of /state should be much,
> much faster.
>
> Neil
>
> On Mon, Feb 1, 2016 at 8:02 AM, tommy xiao <xia...@gmail.com> wrote:
> > David, Thanks for your quick response, i will feedback the result asap.
> >
> > haosdent, i am not sure, but the cluster is only 6 node, it is not a
> large
> > cluster.  Through the Mesos-2352, i found the description: "Looking at
> perf
> > data, it seems most of the time is spent doing memory allocation /
> > de-allocation. "  do you know how to do that command? i can make test on
> it.
> >
> >
> >
> > 2016-02-01 20:09 GMT+08:00 haosdent <haosd...@gmail.com>:
> >>
> >> Maybe this related to your problem
> >> https://issues.apache.org/jira/browse/MESOS-2353 ?
> >>
> >> On Mon, Feb 1, 2016 at 8:02 PM, tommy xiao <xia...@gmail.com> wrote:
> >>>
> >>> In bare mental server, the 5051 port, the state.json query will hang a
> >>> long time to response the json data. it about 5 minutes. the curious
> thing
> >>> is the 5051 port, the /help command can work correctly. so i wonder to
> know,
> >>> anyone came across the case in any time? no clues to debug in my view.
> >>>
> >>> --
> >>> Deshi Xiao
> >>> Twitter: xds2000
> >>> E-mail: xiaods(AT)gmail.com
> >>
> >>
> >>
> >>
> >> --
> >> Best Regards,
> >> Haosdent Huang
> >
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com
>


Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)

2016-01-29 Thread Marco Massenzio
On Fri, Jan 29, 2016 at 7:00 PM, Marco Massenzio <m.massen...@gmail.com>
wrote:

> Is there a 0.27.0-rc2 branch cut?
>
> $ git fetch --all
> Fetching origin
>
> $ git co 0.27.0-rc2
> error: pathspec '0.27.0-rc2' did not match any file(s) known to git.
>
> ​well, or a tag, for that matter...

$ git tag | grep 27
0.27.0-rc1
​


>
> --
> *Marco Massenzio*
> http://codetrips.com
>
> On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 0.27.0.
>>
>> 0.27.0 includes the following:
>>
>> 
>> We added major features such as Implicit Roles, Quota, Multiple Disks and
>> more.
>>
>> We also added major bug fixes such as performance improvements to
>> state.json requests and GLOG.
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2
>>
>> 
>>
>> The candidate for Mesos 0.27.0 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz
>>
>> The tag to be voted on is 0.27.0-rc2:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2
>>
>> The MD5 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1100
>>
>> Please vote on releasing this package as Apache Mesos 0.27.0!
>>
>> The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 0.27.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>>
>> Tim, Kapil, MPark
>>
>
>


Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)

2016-01-29 Thread Marco Massenzio
Is there a 0.27.0-rc2 branch cut?

$ git fetch --all
Fetching origin

$ git co 0.27.0-rc2
error: pathspec '0.27.0-rc2' did not match any file(s) known to git.


-- 
*Marco Massenzio*
http://codetrips.com

On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 0.27.0.
>
> 0.27.0 includes the following:
>
> 
> We added major features such as Implicit Roles, Quota, Multiple Disks and
> more.
>
> We also added major bug fixes such as performance improvements to
> state.json requests and GLOG.
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2
>
> 
>
> The candidate for Mesos 0.27.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz
>
> The tag to be voted on is 0.27.0-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2
>
> The MD5 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1100
>
> Please vote on releasing this package as Apache Mesos 0.27.0!
>
> The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 0.27.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> Tim, Kapil, MPark
>


Re: [VOTE] Release Apache Mesos 0.27.0 (rc2)

2016-01-29 Thread Marco Massenzio
Thanks, buddy - I keep forgetting that one!
(one assumes --all would, well, take care of that too :)

Have a great weekend!

-- 
*Marco Massenzio*
http://codetrips.com

On Fri, Jan 29, 2016 at 7:06 PM, Vinod Kone <vinodk...@gmail.com> wrote:

> Git fetch --tags
>
> @vinodkone
>
> On Jan 29, 2016, at 7:00 PM, Marco Massenzio <m.massen...@gmail.com>
> wrote:
>
> Is there a 0.27.0-rc2 branch cut?
>
> $ git fetch --all
> Fetching origin
>
> $ git co 0.27.0-rc2
> error: pathspec '0.27.0-rc2' did not match any file(s) known to git.
>
>
> --
> *Marco Massenzio*
> http://codetrips.com
>
> On Wed, Jan 27, 2016 at 11:12 PM, Michael Park <mp...@apache.org> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 0.27.0.
>>
>> 0.27.0 includes the following:
>>
>> 
>> We added major features such as Implicit Roles, Quota, Multiple Disks and
>> more.
>>
>> We also added major bug fixes such as performance improvements to
>> state.json requests and GLOG.
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.27.0-rc2
>>
>> 
>>
>> The candidate for Mesos 0.27.0 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz
>>
>> The tag to be voted on is 0.27.0-rc2:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.27.0-rc2
>>
>> The MD5 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.27.0-rc2/mesos-0.27.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1100
>>
>> Please vote on releasing this package as Apache Mesos 0.27.0!
>>
>> The vote is open until Sat Jan 30 23:59:59 PST 2016 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 0.27.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>>
>> Tim, Kapil, MPark
>>
>
>


Re: Basic questions about use of ZooKeeper

2016-01-17 Thread Marco Massenzio
Hi Michal,

1. While watching some talk I've heard that maybe in the future ZooKeeper
> won't be needed. Is this still planned?
>

​At some point there has been talk of moving towards using etcd instead of
ZooKeeper: y
ou can look into Jira[0]
​, and it seems that MESOS-1806[1] is the one that has received the most
attention/activity
.

Others may be able to provide more detailed guidance, but the impression I
have is that it may be some time before this becomes available as a
Production-ready alternative.
​


> 2. We're using mainly quite large boxes (>= 20 CPUs, >= 48GB RAM). Is it
> advised to put Mesos master and warm backup nodes inside ZooKeeper's
> cluster? (just to avoid wasting resources).
>
> ​There is really no reason not to have Master/ZooKeeper servers co-located
- in fact, this is the way DCOS CE is deployed in AWS (or, at least, used
to last time I looked into it).

Hope this helps!

​[0]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20Mesos%20and%20text%20~%20%22etcd%22
[1] ​https://issues.apache.org/jira/browse/MESOS-1806

​
​
-- 
*Marco Massenzio*
http://codetrips.com
​


Re: installing a framework after teardown

2016-01-17 Thread Marco Massenzio
Hey Viktor,

I'm not clear what you mean by "re-install the same framework" - do you
mean, just restarting the binary?
If so, as Vinod pointed out, you should re-register with a SUBSCRIBE
message and obtain a new FrameworkId in the response.

And, yes, the name can stay perfectly the same (in fact, you can have
several frameworks with the same name - but different IDs - connect to the
same Master).

Are you using the C++ API or the new HTTP API?

If the latter, please have a look at the example here[0] for how to
"terminate" a framework and then reconnect it.
(in particular, see the sections "Registering a Framework" and "Terminating
a Framework").

If the former, see [1] where I set the `name` in the `FrameworkInfo` (and
that one stays the same across runs) but not the ID (that one gets returned
in the `registered()`[2] method and can be used, if necessary elsewhere in
the code; for example, when accepting offers; but should otherwise stay
unique to the FW).

There are many (better!) examples around of frameworks also in the
"Examples" folder in the Mesos source code[3]; you may want to take a look
there too.


[0]
https://github.com/massenz/zk-mesos/blob/develop/notebooks/Demo-API.ipynb
[1]
https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L194
[2]
https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L70
[3]
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=tree;f=src/examples

-- 
*Marco Massenzio*
http://codetrips.com

On Sat, Jan 16, 2016 at 12:52 AM, Viktor Sadovnikov <vik...@jv-ration.com>
wrote:

> Yes, I need to re-install the same framework. It can get another ID, but
> its name should remain the same.
> I though the framework ID dynamically assigned by the Master upon the
> connection and did not expect the Master to provide the same ID
>
> On Fri, Jan 15, 2016 at 10:15 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> What do you mean by gracefully recover? If you mean the ability to
>> reconnect, you need to change the framework id in the FrameworkInfo when
>> registering with the master.
>>
>> As a hack, you could restart the master, so that it forgets that it
>> removed the framework with id  and hence allows it to re-register with
>> the old id.
>>
>> On Fri, Jan 15, 2016 at 5:37 AM, Viktor Sadovnikov <vik...@jv-ration.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I have removed a framework from Mesos Cluster by curl -X POST -d
>>> 'frameworkId=-b036-4cb7-af53-4c837dc9521d-0002' 
>>> http://${MASTER_IP}:5050/master/teardown;.
>>> This successfully removed all the framework tasks and scheduler.
>>>
>>> However now Mesos Cluster rejects my attempts to re-install the
>>> framework. Is there a way to gracefully recover from this situation?
>>>
>>> I0115 12:54:57.916470 28856 sched.cpp:1024] Got error 'Framework has
>>> been removed'
>>> I0115 12:54:57.916509 28856 sched.cpp:1805] Asked to abort the driver
>>> I0115 12:54:57.916824 28856 sched.cpp:1070] Aborting framework
>>> '8ca5c18f-b036-4cb7-af53-4c837dc9521d-0001'
>>>
>>> With regards,
>>> Viktor
>>>
>>
>>
>


Re: Running mesos slave in Docker on CoreOS

2015-12-31 Thread Marco Massenzio
Provided that I know close to nothing about CoreOS (and very little about
docker itself) usually the 127 exit code is for a "not found" binary - are
you sure that `docker` is in the PATH of the user/process running the Mesos
agent?

Much longer shot - but worth a try: look into the permissions around the
/var/run folder - what happens if you try to run the very same command that
failed, from the shell?
(but I do see that you mount it with the -v, so that should work, shouldn't
it?)

-- 
*Marco Massenzio*
http://codetrips.com

On Thu, Dec 31, 2015 at 1:17 PM, Taylor, Graham <
graham.x.tay...@capgemini.com> wrote:

> I did try removing the /proc and adding just pid=host but still no dice
> with that. Need to have a deeper dig into the docker 1.9 changelog. Will
> post back if I find anything.
>
> Thanks,
> Graham.
>
> On 31 Dec 2015, at 20:27, Tim Chen <t...@mesosphere.io> wrote:
>
> I don't think you need to mount in /proc if you have --pid=host already,
> can you try that?
>
> Tim
>
> On Thu, Dec 31, 2015 at 4:16 AM, Taylor, Graham <
> graham.x.tay...@capgemini.com> wrote:
>
>> Hey folks,
>> I’m trying to get Mesos slave up and running in a docker container on
>> CoreOS. I’ve successfully got the master up and running but anytime I start
>> the slave container I receive the following error -
>>
>> Failed to create a containerizer: Could not create DockerContainerizer:
>> Failed to create docker: Failed to get docker version: Failed to execute
>> 'docker -H unix:///var/run/docker.sock --version': exited with status 127
>>
>> I’m starting the slave container with the following command -
>>
>> /usr/bin/docker run --rm --name mesos_slave \
>> --net=host \
>> --privileged \
>> --pid=host \
>> -p 5051:5051 \
>> -v /sys:/sys \
>> -v /proc:/host/proc:ro \
>> -v /usr/bin/docker:/usr/bin/docker:ro \
>> -v /var/run/docker.sock:/var/run/docker.sock \
>> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
>> -e "MESOS_MASTER=zk://172.31.1.11:2181,172.31.1.12:2181,
>> 172.31.1.13:2181/mesos" \
>> -e "MESOS_EXECUTOR_REGISTRATION_TIMEOUT=10mins" \
>> -e "MESOS_CONTAINERIZERS=docker" \
>> -e "MESOS_RESOURCES=ports(*):[31000-32000]" \
>> -e "MESOS_IP=172.31.1.14" \
>> -e "MESOS_WORK_DIR=/tmp/mesos" \
>> -e "MESOS_HOSTNAME=172.31.1.14" \
>> mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404
>>
>> I’ve also tried with various other versions of the Docker image
>> (including 0.26.0) but I keep receiving the same error.
>>
>> I’m running on CoreOS beta channel (877.1.0) which has docker installed
>> and the service running -
>>
>> docker --version
>> Docker version 1.9.1, build 4419fdb-dirty
>>
>>
>> If I change the /proc mount to be /proc:/proc I get past the docker
>> version error but receive a different error -
>>
>> Error response from daemon: Cannot start container
>> 51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758: [8]
>> System error:
>> "/var/lib/docker/overlay/51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758/merged/proc"
>> cannot be mounted because it is located inside "/proc”
>>
>>
>> I had a search on the wiki and found some similar related issues
>> https://issues.apache.org/jira/browse/MESOS-3498?jql=project%20%3D%20MESOS%20AND%20text%20~%20%22Failed%20to%20execute%20%27docker%20version%22
>>  but
>> they all seem to be closed/resolved/won’t fix.
>>
>> Is anyone successfully running a slave on CoreOS and can help me fix up
>> my Docker command?
>>
>> Thanks,
>> Graham.
>>
>>
>> --
>>
>> Capgemini is a trading name used by the Capgemini Group of companies
>> which includes Capgemini UK plc, a company registered in England and Wales
>> (number 943935) whose registered office is at No. 1, Forge End, Woking,
>> Surrey, GU21 6DB.
>>
>
>
> --
>
> Capgemini is a trading name used by the Capgemini Group of companies which
> includes Capgemini UK plc, a company registered in England and Wales
> (number 943935) whose registered office is at No. 1, Forge End, Woking,
> Surrey, GU21 6DB.
>


Re: How can mesos print logs from VLOG function?

2015-12-30 Thread Marco Massenzio
Mesos uses Google Logging[0] and, according to the documentation there, the
VLOG(n) calls are only logged if a variable GLOG_v=m (where n > m) is
configured when running Mesos (the other suggested way, using --v=m won't
work for mesos).

Having said that, I have recently been unable to make this work - so there
may be some other trickery at work.

[0] https://google-glog.googlecode.com/svn/trunk/doc/glog.html

-- 
*Marco Massenzio*
http://codetrips.com

On Wed, Dec 30, 2015 at 12:30 AM, Nan Xiao <xiaonan830...@gmail.com> wrote:

> Hi all,
>
> I want mesos prints logs from VLOG function:
>
> VLOG(1) << "Executor started at: " << self()
> << " with pid " << getpid();
>
> But from mesos help:
>
> $ sudo ./bin/mesos-master.sh --help | grep -i LOG
>   --external_log_file=VALUESpecified the externally
> managed log file. This file will be
>stderr logging as the log
> file is otherwise unknown to Mesos.
>   --[no-]initialize_driver_logging Whether to automatically
> initialize Google logging of scheduler
>   --[no-]log_auto_initialize   Whether to automatically
> initialize the replicated log used for the
>registry. If this is set to
> false, the log has to be manually
>   --log_dir=VALUE  Directory path to put log
> files (no default, nothing
>does not affect logging to
> stderr).
>NOTE: 3rd party log
> messages (e.g. ZooKeeper) are
>   --logbufsecs=VALUE   How many seconds to buffer
> log messages for (default: 0)
>   --logging_level=VALUELog message at or above
> this level; possible values:
>will affect just the logs
> from log_dir (if specified) (default: INFO)
>   --[no-]quiet Disable logging to stderr
> (default: false)
>   --quorum=VALUE   The size of the quorum of
> replicas when using 'replicated_log' based
>available options are
> 'replicated_log', 'in_memory' (for testing). (default: replicated_log)
>
> I can't find related configurations.
>
> So how can mesos print  logs from VLOG function? Thanks in advance!
>
> Best Regards
> Nan Xiao
>


Re: Is it safe to replace mesos-master in fly

2015-11-24 Thread Marco Massenzio
The closest I could find is [0], but granted, much more detail could be
desirable :)
FYI - you may also want to check out the Maintenance Primitives [1] and
upgrades [2] (which is actually not directly applicable to your stated use
case, but may be of interest for future reference).

In any event, you're doing it right.
As for the "reasonable time to wait" - I'm afraid I don't really have a
good feel for it: probably keeping a eye on the logs may help, but I'm sure
other folks on this list will have a much more satisfying answer.

Let us know how you go along, and if you want to contribute back to
documenting how you did it, contributions always welcome!

[0] http://mesos.apache.org/documentation/latest/operational-guide/
[1] http://mesos.apache.org/documentation/latest/maintenance/
[2] http://mesos.apache.org/documentation/latest/upgrades/

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Tue, Nov 24, 2015 at 8:41 AM, Chengwei Yang <chengwei.yang...@gmail.com>
wrote:

> Thanks @Tommy,
>
> Since I didn't found any official document about migrate mesos-mater or
> resize
> mesos-master quorum size, so before anything missing that will supprise me,
> I came here to confirm. :-)
>
> --
> Thanks,
> Chengwei
>
> On Wed, Nov 25, 2015 at 12:07:43AM +0800, tommy xiao wrote:
> > This is correct way on upgrade your mesos cluster, more details see mesos
> > documents release note.
> >
> > 2015-11-24 9:47 GMT+08:00 Chengwei Yang <chengwei.yang...@gmail.com>:
> >
> > Hi all,
> >
> > We're using mesos in product on CentOS 6 and plan to upgrade CentOS
> to 7.1,
> > to
> > avoid affect any tasks running on mesos. We're about to replace all
> > mesos-masters in fly.
> >
> > The procedure listed below:
> >
> > 0. 3 mesos-masters running on CentOS 6
> > 1. shutdown 1 mesos-master(CentOS 6) and bring up 1
> mesos-master(CentOS 7)
> >wait the new master synced for some time(is there any simple way
> to know
> > when?)
> > 2. repeat step 1
> >
> > NOTE: we plan to shutdown non-leader first, and shutdown the
> leader(CentOS
> > 6)
> > last.
> >
> > Can we do this in such way? Or any other better suggestions?
> >
> > --
> > Thanks,
> > Chengwei
> >
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com
> > SECURITY NOTE: file ~/.netrc must not be accessible by others
>


Re: Zookeeper cluster changes

2015-11-09 Thread Marco Massenzio
The way I would do it in a production cluster would be *not* to use
directly IP addresses for the ZK ensemble, but instead rely on some form of
internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
Ansible, what have you) handle the setting of the hostname when
restarting/replacing a failing/crashed ZK server.

This way your list of zk's to Mesos never changes, even though the FQN's
will map to different IPs / VMs.

Obviously, this may not be always desirable / feasible (eg, if your prod
environment does not support DNS resolution).

You are correct in that Mesos does not currently support dynamically
changing the ZK's addresses, but I don't know whether that's a limitation
of Mesos code or of the ZK C++ client driver.
I'll look into it and let you know what I find (if anything).

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote:

> How do mesos masters and slaves react to zookeeper cluster changes? When
> the masters and slaves start they are given a set of addresses to connect
> to zookeeper. But over time, one of those zookeepers fails, and is replaced
> by a new server at a new address. How should this be handled in the mesos
> servers?
>
> I am guessing that mesos does not automatically detect and react to that
> change. But obviously we should do something to keep the mesos servers
> happy as well. What should be do?
>
> The obvious thing is to stop the mesos servers, one at a time, and restart
> them with the new configuration. But it would be really nice to be able to
> do this dynamically without restarting the server. After all, coordinating
> a rolling restart is a fairly hard job.
>
> Any suggestions or pointers?
>
> Best regards,
> Don Laidlaw
>
>
>


Re: Welcome Kapil as Mesos committer and PMC member!

2015-11-05 Thread Marco Massenzio
Awesome stuff!
Congratulations, Kapil - totally deserved!

On Thursday, November 5, 2015, Vinod Kone <vinodk...@gmail.com> wrote:

> welcome kapil!
>
> On Thu, Nov 5, 2015 at 6:49 AM, <connor@gmail.com
> <javascript:_e(%7B%7D,'cvml','connor@gmail.com');>> wrote:
>
>> Congrats Dr. Arya!
>>
>> > On Nov 5, 2015, at 02:02, Till Toenshoff <toensh...@me.com
>> <javascript:_e(%7B%7D,'cvml','toensh...@me.com');>> wrote:
>> >
>> > I'm happy to announce that Kapil Arya has been voted a Mesos committer
>> and PMC member!
>> >
>> > Welcome Kapil, and thanks for all of your great contributions to the
>> project so far!
>> >
>> > Looking forward to lots more of your contributions!
>> >
>> > Thanks
>> > Till
>>
>
>

-- 
--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com


Re: error: 'sasl_errdetail' is deprecated: first deprecated in OS X 10.11

2015-10-12 Thread Marco Massenzio
I'm almost sure that you're running into
https://issues.apache.org/jira/browse/MESOS-3030
(there is a patch out to fix this: https://reviews.apache.org/r/39230/)

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Mon, Oct 12, 2015 at 4:54 PM, yuankui <kui.y...@fraudmetrix.cn> wrote:

> hello,buddies
>
> I'm compiling mesos on mac os x 10.11 (EI capitan) and come across with
> some error as flowing
> version: mesos-0.24.0 & mesos-0.25.0-rc3
>
>
> /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been
> explicitly marked deprecated here
> LIBSASL_API const char *sasl_errstring(int saslerr,
>^
> ../../src/authentication/cram_md5/authenticator.cpp:334:20: error:
> 'sasl_errdetail' is deprecated: first deprecated in OS X 10.11
>  [-Werror,-Wdeprecated-declarations]
>  string error(sasl_errdetail(connection));
>   ^
> /usr/include/sasl/sasl.h:770:25: note: 'sasl_errdetail' has been
> explicitly marked deprecated here
> LIBSASL_API const char *sasl_errdetail(sasl_conn_t *conn)
> __OSX_AVAILABLE_BUT_DEPRECATED(__MAC_10_0,__MAC_10_11,__IPHONE_NA,__IPHONE_NA);
>^
> ../../src/authentication/cram_md5/authenticator.cpp:514:18: error:
> 'sasl_server_init' is deprecated: first deprecated in OS X 10.11
>  [-Werror,-Wdeprecated-declarations]
>int result = sasl_server_init(NULL, "mesos");
> ^
> /usr/include/sasl/sasl.h:1016:17: note: 'sasl_server_init' has been
> explicitly marked deprecated here
> LIBSASL_API int sasl_server_init(const sasl_callback_t *callbacks,
>^
> ../../src/authentication/cram_md5/authenticator.cpp:519:11: error:
> 'sasl_errstring' is deprecated: first deprecated in OS X 10.11
>  [-Werror,-Wdeprecated-declarations]
>  sasl_errstring(result, NULL, NULL));
>  ^
> /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been
> explicitly marked deprecated here
> LIBSASL_API const char *sasl_errstring(int saslerr,
>^
> ../../src/authentication/cram_md5/authenticator.cpp:521:16: error:
> 'sasl_auxprop_add_plugin' is deprecated: first deprecated in OS X 10.11
>  [-Werror,-Wdeprecated-declarations]
>  result = sasl_auxprop_add_plugin(
>   ^
> /usr/include/sasl/saslplug.h:1013:17: note: 'sasl_auxprop_add_plugin' has
> been explicitly marked deprecated here
> LIBSASL_API int sasl_auxprop_add_plugin(const char *plugname,
>^
> ../../src/authentication/cram_md5/authenticator.cpp:528:13: error:
> 'sasl_errstring' is deprecated: first deprecated in OS X 10.11
>  [-Werror,-Wdeprecated-declarations]
>sasl_errstring(result, NULL, NULL));
>^
> /usr/include/sasl/sasl.h:757:25: note: 'sasl_errstring' has been
> explicitly marked deprecated here
> LIBSASL_API const char *sasl_errstring(int saslerr,
>^
>
> as I'm not familiar with c++, I don't know how to solve this
>
> I believe I'm not the first one who came across with this problem, So I'm
> here for help!
> thanks.
>
>
>


Re: Can health-checks be run by Mesos for docker tasks?

2015-10-12 Thread Marco Massenzio
Are those the stdout logs of the Agent? Because I don't see the
--launcher-dir set, however, if I look into one that is running off the
same 0.24.1 package, this is what I see:

I1012 14:56:36.933856  1704 slave.cpp:191] Flags at startup:
--appc_store_dir="/tmp/mesos/store/appc"
--attributes="rack:r2d2;pod:demo,dev" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins" --docker="docker"
--docker_kill_orphans="true" --docker_remove_delay="6hrs"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
--enforce_container_disk_quota="false"
--executor_registration_timeout="1mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --initialize_driver_logging="true"
--ip="192.168.33.11" --isolation="cgroups/cpu,cgroups/mem"
--launcher_dir="/usr/libexec/mesos"
--log_dir="/var/local/mesos/logs/agent" --logbufsecs="0"
--logging_level="INFO" --master="zk://192.168.33.1:2181/mesos/vagrant"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs"
--resources="ports:[9000-1];ephemeral_ports:[32768-57344]"
--revocable_cpu_low_priority="true"
--sandbox_directory="/var/local/sandbox" --strict="true"
--switch_user="true" --version="false" --work_dir="/var/local/mesos/agent"
(this is run off the Vagrantfile at [0] in case you want to reproduce).
That agent is not run via the init command, though, I execute it manually
via the `run-agent.sh` in the same directory.

I don't really think this matters, but I assume you also restarted the
agent after making the config changes?
(and, for your own sanity - you can double check the version by looking at
the very head of the logs).






--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Mon, Oct 12, 2015 at 10:50 PM, Jay Taylor <outtat...@gmail.com> wrote:

> Hi Haosdent and Mesos friends,
>
> I've rebuilt the cluster from scratch and installed mesos 0.24.1 from the
> mesosphere apt repo:
>
> $ dpkg -l | grep mesos
> ii  mesos   0.24.1-0.2.35.ubuntu1404
>  amd64Cluster resource manager with efficient resource isolation
>
> Then added the `launcher_dir' flag to /etc/mesos-slave/launcher_dir on the
> slaves:
>
> mesos-worker1a:~$ cat /etc/mesos-slave/launcher_dir
> /usr/libexec/mesos
>
> And yet the task health-checks are still being launched from the sandbox
> directory like before!
>
> I've also tested setting the MESOS_LAUNCHER_DIR env var and get the
> identical result (just as before on the cluster where many versions of
> mesos had been installed):
>
> STDOUT:
>
> --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb"
>> --docker="docker" --help="false" --initialize_driver_logging="true"
>> --logbufsecs="0" --logging_level="INFO"
>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73-1943-41b4-a308-76132eebcc91/runs/62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb"
>> --stop_timeout="0ns"
>> --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb"
>> --docker="docker" --help="false" --initialize_driver_logging="true"
>> --logbufsecs="0" --logging_level="INFO"
>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73

Re: Can health-checks be run by Mesos for docker tasks?

2015-10-12 Thread Marco Massenzio
On Mon, Oct 12, 2015 at 11:26 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> Are those the stdout logs of the Agent? Because I don't see the
> --launcher-dir set, however, if I look into one that is running off the
> same 0.24.1 package, this is what I see:
>
> I1012 14:56:36.933856  1704 slave.cpp:191] Flags at startup:
> --appc_store_dir="/tmp/mesos/store/appc"
> --attributes="rack:r2d2;pod:demo,dev" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins" --docker="docker"
> --docker_kill_orphans="true" --docker_remove_delay="6hrs"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="1mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --initialize_driver_logging="true"
> --ip="192.168.33.11" --isolation="cgroups/cpu,cgroups/mem"
> --launcher_dir="/usr/libexec/mesos"
> --log_dir="/var/local/mesos/logs/agent" --logbufsecs="0"
> --logging_level="INFO" --master="zk://192.168.33.1:2181/mesos/vagrant"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs"
> --resource_monitoring_interval="1secs"
> --resources="ports:[9000-1];ephemeral_ports:[32768-57344]"
> --revocable_cpu_low_priority="true"
> --sandbox_directory="/var/local/sandbox" --strict="true"
> --switch_user="true" --version="false" --work_dir="/var/local/mesos/agent"
> (this is run off the Vagrantfile at [0] in case you want to reproduce).
> That agent is not run via the init command, though, I execute it manually
> via the `run-agent.sh` in the same directory.
>
> I don't really think this matters, but I assume you also restarted the
> agent after making the config changes?
> (and, for your own sanity - you can double check the version by looking at
> the very head of the logs).
>
>
> [0] http://github.com/massenz/zk-mesos

>
>
>
>
> --
> *Marco Massenzio*
> Distributed Systems Engineer
> http://codetrips.com
>
> On Mon, Oct 12, 2015 at 10:50 PM, Jay Taylor <outtat...@gmail.com> wrote:
>
>> Hi Haosdent and Mesos friends,
>>
>> I've rebuilt the cluster from scratch and installed mesos 0.24.1 from the
>> mesosphere apt repo:
>>
>> $ dpkg -l | grep mesos
>> ii  mesos   0.24.1-0.2.35.ubuntu1404
>>amd64Cluster resource manager with efficient resource isolation
>>
>> Then added the `launcher_dir' flag to /etc/mesos-slave/launcher_dir on
>> the slaves:
>>
>> mesos-worker1a:~$ cat /etc/mesos-slave/launcher_dir
>> /usr/libexec/mesos
>>
>> And yet the task health-checks are still being launched from the sandbox
>> directory like before!
>>
>> I've also tested setting the MESOS_LAUNCHER_DIR env var and get the
>> identical result (just as before on the cluster where many versions of
>> mesos had been installed):
>>
>> STDOUT:
>>
>> --container="mesos-20151012-184440-1625401536-5050-23953-S0.62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb"
>>> --docker="docker" --help="false" --initialize_driver_logging="true"
>>> --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/20151012-184440-1625401536-5050-23953-S0/frameworks/20151012-184440-1625401536-5050-23953-/executors/hello-app_web-v3.33597b73-1943-41b4-a308-76132eebcc91/runs/62d43b8f-6cd1-4c53-9ac8-84dbfc45bbcb"
>>> --stop_timeout="0ns"
>>> --container="mesos-20151012-184440-162540

Re: Framework control over slave recovery

2015-10-09 Thread Marco Massenzio
It sounds to me a reasonable expectation that the framework may be notified
if the agent(s) that are running one or more of its tasks starts showing
signs of unhealthiness - in most instances, we would expect them to happily
ignore such situation and just let Mesos take care of the matter, but if
they do care, they should be able to know.

Not so sure about the feasibility of a 'per task timeout', but the
notification would be probably not too complicated (although, it does open
up a whole new area of debate around implementation and how to modify the
API to enable that).

Could you please file a Jira requesting this as a feature on the Master?

Thanks!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Oct 9, 2015 at 3:29 PM, Marcus Larsson <marcus.lars...@oracle.com>
wrote:

> Hi,
>
> On 2015-10-09 15:26, Marco Massenzio wrote:
>
> The 'marking' of the task is not immediate: Master actually waits a beat
> or two to see if the Agent reconnects, there are various flags that control
> behavior around this [0].
>
> Naive question: I am assuming that you already looked into a combination
> of:
>
> --max_slave_ping_timeouts=VALUE
> --slave_ping_timeout=VALUE
> --slave_removal_rate_limit=VALUE
> --slave_reregister_timeout=VALUE
>
> that may help with your use case?
> I'm not really an expert into these flags, so not entirely sure whether a
> combination thereof may work with your scenario.
>
>
> Yeah I've seen and tried using these flags. While they can be used to
> prevent Mesos from killing the agents too quickly, the framework will not
> be notified about the slave failing the health checks unless it times out
> completely and the task is lost. Also, ideally we would want per-task
> timeouts, whereas these settings are global.
>
> Thanks,
> Marcus
>
>
> [0] http://mesos.apache.org/documentation/latest/configuration/
>
>
>
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>
> On Fri, Oct 9, 2015 at 11:48 AM, Marcus Larsson <marcus.lars...@oracle.com
> > wrote:
>
>> Hi,
>>
>> I'm part of a project investigating the use of Mesos for a distributed
>> build and test system. For some of our tasks we would like to have more
>> control over the slave recovery policy. Currently, when a slave fails its
>> health check, it seems Mesos will always mark any task on the slave as
>> lost, and shutdown the slave when (or if) it reconnects. We would like the
>> framework to have more information and control over this.
>>
>> I found an issue [1] in JIRA that mentions implementing something like
>> this, but it seems only the part with the slave removal rate limiter was
>> implemented. What I'm wondering is if there is any support in Mesos for
>> letting the framework decide how to handle slave removal/recovery?
>>
>> For our case, we would like the framework to be notified when a slave
>> fails its health check, so that the appropriate action for the task running
>> on that slave can be taken. Some of our tasks will be very long running and
>> we don't want to restart a few days worth of work because the network was
>> down for a while.
>>
>> Thanks,
>> Marcus
>>
>> [1]: https://issues.apache.org/jira/browse/MESOS-2246
>>
>
>
>


Re: mesos-ui

2015-10-09 Thread Marco Massenzio
Re: version, I tested against a 0.24.1 and it worked just fine (apart I
could not access info about tasks running for framework, but that seems to
be a "known issue:" #17 if memory serves).
And it did show resource utilized against available on the (one) node.

It definitely looks pretty, so quite looking forward to where you guys are
going to take it!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Oct 9, 2015 at 4:48 PM, Taylor, Graham <
graham.x.tay...@capgemini.com> wrote:

>
> 
>
> Capgemini is a trading name used by the Capgemini Group of companies which
> includes Capgemini UK plc, a company registered in England and Wales
> (number 943935) whose registered office is at No. 1, Forge End, Woking,
> Surrey, GU21 6DB.
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>
>
> -- Forwarded message --
> From: "Taylor, Graham" <graham.x.tay...@capgemini.com>
> To: "user@mesos.apache.org" <user@mesos.apache.org>
> Cc:
> Date: Fri, 9 Oct 2015 15:48:56 +
> Subject: Re: mesos-ui
> Hey Andras,
> Yep - we’ve admittedly only tested in against 0.23 and on fairly small
> sized clusters at the moment. There’s a ticket to look at supporting
> different versions in the future
> https://github.com/Capgemini/mesos-ui/issues/3
>
> @Marco - Cam should still be there, if you fire him a message on twitter
> at https://twitter.com/Wallies9 he might respond and you can hook up.
>
> Cheers,
> Graham.
>
>
> On 9 Oct 2015, at 16:26, Andras Kerekes <andras.kere...@ishisystems.com>
> wrote:
>
> Hi Graham,
>
> I was able to setup the UI in Marathon pretty quickly. Two quick things: I
> had to increase the allocated memory to 2Gb (vs the 512Mb on the website).
> Also the node level stats were showing only available resources but no
> actual allocation/utilization, I assume this might be because of the
> version
> we're using (0.22.1) vs the version it is tested against (0.23), right?
>
> The UI looks nice, thanks for open sourcing it!
>
> Andras
>
> -Original Message-
> From: Taylor, Graham [mailto:graham.x.tay...@capgemini.com
> <graham.x.tay...@capgemini.com>]
> Sent: Friday, October 09, 2015 6:17 AM
> To: user@mesos.apache.org
> Subject: Re: mesos-ui
>
>
> 
>
> Capgemini is a trading name used by the Capgemini Group of companies which
> includes Capgemini UK plc, a company registered in England and Wales
> (number
> 943935) whose registered office is at No. 1, Forge End, Woking, Surrey,
> GU21
> 6DB.
> This message contains information that may be privileged or confidential
> and
> is the property of the Capgemini Group. It is intended only for the person
> to whom it is addressed. If you are not the intended recipient, you are not
> authorized to read, print, retain, copy, disseminate, distribute, or use
> this message or any part thereof. If you receive this message in error,
> please notify the sender immediately and delete all copies of this message.
>
>
>
>


Re: Is there any APIs for status monitering, how did the Webui got the status of mesos?

2015-10-08 Thread Marco Massenzio
Probably the most appropriate endpoint(s) would be something like
http://mesos-master:5050/system/stats.json
http://mesos-master:5050/metrics/snapshot

for a much more basic 'health' check you can use the /health endpoint (this
just gives you back a 200 OK if the Master/Agent are... feeling well :)

I would recommend staying away from the /state.json (soon to be /state) as
it demands a heavy toll on the Master and you may end up DOS'ing your own
cluster.


*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Wed, Oct 7, 2015 at 6:37 PM, Klaus Ma <kl...@cguru.net> wrote:

> Hi Chong,
>
> I think you can use Mesos’s REST API to achieve that; please refer to the
> following URL for more detail:
> http://mesos.apache.org/documentation/latest/monitoring/
>
> 
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> +86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net
>
> On Oct 8, 2015, at 09:04, Chong Chen <chong.ch...@huawei.com> wrote:
>
> Hi,
> I want to implement a program to monitoring mesos. Is there exist any APIs
> already implemented in mesos that I can use to get the status of the Mesos?
> just like what  webui did: acquire the information about  the amount of
> total resources, allocated resources, dispatched tasks, finished/lost
> tasks….
> How did the webui of mesos got this information? I think the fast way for
> me  is using the same method as webui did.
>
> Thanks!
>
> Best Regards,
> Chong
>
>
>


Re: mesos-tail in 0.24.1

2015-09-29 Thread Marco Massenzio
Provided that I'm not familiar at all with mesos-tail and/or mesos-resolve,
you are correct in that this is due to the recent changes (in 0.24) to the
way we write MasterInfo data to ZooKeeper.

This is a genuine bug, thanks for reporting: would you mind terribly to
file a Jira and assign to me, please?
(marco-mesos)

Thanks!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 29, 2015 at 6:28 AM, Rad Gruchalski <ra...@gruchalski.com>
wrote:

> Thank you, that’s some progress:
>
> I changed the code at this line:
>
>
> https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L107
>
> to:
>
> try:
> parsed =  json.loads(val)
> return parsed["address"]["ip"] + ":" +
> str(parsed["address"]["port"])
> except Exception:
> return val.split("@")[-1]
>
> And now it gives me the correct master. However, executing mesos-tail or
> mesos-ps does not do anything, just hangs there without any output.
> Something obviously does not work as advertised.
> Or I should possibly switch to https://github.com/mesosphere/dcos-cli (
> https://pypi.python.org/pypi/dcoscli), but will this work with just a
> regular mesos 0.24.1 installation?
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Tuesday, 29 September 2015 at 15:20, haosdent wrote:
>
> I think the problem here is you use zk as schema in your config file(
> .mesos.json) or MESOS_CLI_CONFIG (
> https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/cfg.py#L42
> and
> https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L119).
> Not because 0.24.1, you use 0.24.0 should have same issue.
>
> On Tue, Sep 29, 2015 at 9:14 PM, haosdent <haosd...@gmail.com> wrote:
>
> I think you install mesos-cli from https://github.com/mesosphere/mesos-cli
>
> On Tue, Sep 29, 2015 at 8:51 PM, Rad Gruchalski <ra...@gruchalski.com>
> wrote:
>
> It seems that I found the reason for this behaviour.
> When I execute mesos-resolve, I get an output like this:
>
> 10.100.1.100:5050","port":5050,"version":"0.24.1"}
>
> I managed to get to the python sources on the machine, especially
> master.py. I verified that in my case the zookeeper_resolver is used.
> However, what gets returned from zookeeper resolver is:
>
> return val.split("@")[-1]
>
> Where the val is a JSON string:
>
>
>  
> {"address":{"hostname”:”mesos-master","ip":"10.100.1.100","port":5050},"hostname”:”mesos-master","id":"20150929-113531-244404234-5050-18065","ip”:...,"pid":"
> master@10.100.1.100:5050","port":5050,"version":"0.24.1”}
>
> Looking at these two, it is obvious why it does not work. I’m trying to
> find the code for master.py but it does not exist in
> https://github.com/apache/mesos/tree/master/src/python/interface/src/mesos/interface
> .
> Where does it come from? Is it somehow generated or is it a separate repo?
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Tuesday, 29 September 2015 at 13:02, Rad Gruchalski wrote:
>
> Hi everyone,
>
> I have upgraded my development mesos environment to 0.24.1 this morning.
> It’s a clean installation with new zookeeper and everything.
> Since the upgrade I get an error while executing mesos-tail:
>
> mesos-master ~$ mesos tail -f -n 50 service
> Traceback (most recent call last):
>   File "/usr/local/bin/mesos-tail", line 11, in 
> sys.exit(main())
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cli.py", line 61,
> in wrapper
> return fn(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cmds/tail.py",
> line 55, 

Re: Fwd: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change.

2015-09-25 Thread Marco Massenzio
+1 to what Alex says.

As far as we know, the functionality we use (ephemeral sequential nodes and
writing simple data to a znode) is part of the "base API" offered by
ZooKeeper and every version would support it.
(then again, not a ZK expert here - if anyone knows better, please feel
free to correct me).

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 25, 2015 at 6:24 AM, Alex Rukletsov <a...@mesosphere.com> wrote:

> James—
>
> Marco will correct me if I'm wrong, but my understanding is that this
> change does *not* impact what ZooKeeper version you can use with Mesos. We
> have changed the format of the message stored in ZK from protobuf to JSON.
> This message is needed by frameworks for mesos master leader detection.
>
> HTH,
> Alex
>
> On Fri, Sep 25, 2015 at 11:12 AM, CCAAT <cc...@tampabay.rr.com> wrote:
>
>> On 09/25/2015 08:13 AM, Marco Massenzio wrote:
>>
>>> Folks:
>>>
>>> as a reminder, please be aware that as of Mesos 0.24.0, as announced
>>> back in June, Mesos Master will write its information (`MasterInfo`) to
>>> ZooKeeper in JSON format (see below for details).
>>>
>>
>>
>> What versions of Zookeeper are supported by this change? That is, what
>> is the oldest version of Zookeeper known to work or not work with this
>> change in Mesos?
>>
>>
>> James
>>
>>
>>
>>
>>
>>> If your framework relied on parsing the info (either de-serializing the
>>> Protocol Buffer or just looking for an "IP-like" string) this change
>>> will be a breaking change.
>>>
>>> Just to confirm (see also Vinod's comments below) any rolling upgrades
>>> (i.e., clusters with 0.22+0.23 and 0.23+0.24) of Mesos will just work.
>>>
>>> This was in conjunction with the HTTP API release and removing the need
>>> for non-C++ developers to have to link with libmesos and have to deal
>>> with Protocol Buffers.
>>>
>>> An example of how to access the new format in Python can be found in [0]
>>> and we're happy to help with other languages too.
>>> Any questions, please just ask.
>>>
>>> [0] http://github.com/massenz/zk-mesos
>>>
>>> Marco Massenzio
>>> /Distributed Systems Engineer
>>> http://codetrips.com/
>>>
>>> -- Forwarded message --
>>> From: *Vinod Kone* <vinodk...@gmail.com <mailto:vinodk...@gmail.com>>
>>> Date: Wed, Jun 24, 2015 at 4:17 PM
>>> Subject: Re: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo
>>> change.
>>> To: dev <d...@mesos.apache.org <mailto:d...@mesos.apache.org>>
>>>
>>>
>>> Just to clarify, any frameworks that are using the Mesos provided
>>> bindings
>>> (aka libmesos.so) should not worry, as long as the version of the
>>> bindings
>>> and version of the mesos master are not separated by more than 1 version.
>>> In other words, you should be able to live upgrade a cluster from 0.23.0
>>> to
>>> 0.24.0.
>>>
>>> For framework schedulers that don't use the bindings (pesos, jesos etc),
>>> it
>>> is prudent to add support for JSON formatted ZNODE to their master
>>> detection code.
>>>
>>> Thanks,
>>>
>>> On Wed, Jun 24, 2015 at 4:10 PM, Marco Massenzio <ma...@mesosphere.io
>>> <mailto:ma...@mesosphere.io>>
>>> wrote:
>>>
>>> Folks,
>>>>
>>>> as heads-up, we are planning to convert the format of the MasterInfo
>>>> information stored in ZooKeeper from the Protocol Buffer binary format
>>>> to
>>>> JSON - this is in conjunction with the HTTP API development, to allow
>>>> frameworks *not* to depend on libmesos and other binary dependencies to
>>>> interact with Mesos Master nodes.
>>>>
>>>>  > *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework
>>> that is
>>>  > currently working in 0.22 *will continue to work* in 0.23 too) but as
>>> of
>>>
>>>> Mesos 0.24, frameworks and other clients relying on the binary format
>>>> will
>>>> break.
>>>>
>>>> The details of the design are in this Google Doc:
>>>>
>>>>
>>>> https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit
>>>>
>>>> the actual work is detailed in MESOS-2340:
>>>> https://issues.apache.org/jira/browse/MESOS-2340
>>>>
>>>> and the patch (and associated test) are here:
>>>> https://reviews.apache.org/r/35571/
>>>> https://reviews.apache.org/r/35815/
>>>>
>>>>  > *Marco Massenzio*
>>>  > *Distributed Systems Engineer*
>>>  >
>>>
>>>
>>
>


Re: Official RPMs

2015-09-25 Thread Marco Massenzio
Yes, the plan is definitely to make the tooling available to the project:
there is nothing "secret" about it - at the moment, unfortunately, it
relies on a bit of internal infrastructure and, well, yesss, it's a bit too
crafty to be ready for "external consumption" but we're working on it!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 25, 2015 at 11:33 AM, Zameer Manji <zma...@apache.org> wrote:

> Could mesosphere donate their tooling for packaging mesos to the project?
> This way any project member or contributor can build packages and it can be
> apart of the release process.
>
> On Fri, Sep 25, 2015 at 10:53 AM, Artem Harutyunyan <ar...@mesosphere.io>
> wrote:
>
>> The repositories have been updated yesterday, and the downloads page
>> was updated today. Mesos 0.24 packages are now available at
>> https://mesosphere.com/downloads/. Thank you very much for your
>> patience!
>>
>> Cheers,
>> Artem.
>>
>> On Tue, Sep 22, 2015 at 11:02 AM, Marco Massenzio <ma...@mesosphere.io>
>> wrote:
>> > Hi guys,
>> >
>> > just wanted to let you all know that we (Mesosphere) fully intend to
>> > continue supporting distributing binary packages for the current set of
>> > supported OSes (namely, Ubuntu / Debian / RedHat / CentOS as listed in
>> [0]).
>> >
>> > Sorry that 0.24 slipped through the cracks, the person who actually
>> takes
>> > care of that and knows the magic incantations has been unwell, and a
>> number
>> > of other competing priorities got in the way - we will eventually be
>> > automating the process, so that downloadable binary packages are
>> created out
>> > of each release/RC build (and, possibly, even more often) without pesky
>> > humans getting in the way :) but this may take some time.
>> > We're building the 0.24 ones as we speak, so please bear with us while
>> this
>> > gets done.
>> >
>> > Any questions / suggestions, we'd love to hear those too!
>> >
>> > [0] https://mesosphere.com/downloads/
>> >
>> > Marco Massenzio
>> > Distributed Systems Engineer
>> > http://codetrips.com
>> >
>> > On Tue, Sep 22, 2015 at 10:54 AM, CCAAT <cc...@tampabay.rr.com> wrote:
>> >>
>> >> On 09/21/2015 03:01 PM, Vinod Kone wrote:
>> >>>
>> >>> +Jake Farrell
>> >>>
>> >>> The mesos project doesn't publish platform dependent artifacts.  We
>> >>> currently only publish platform independent artifacts like JAR (to
>> >>> apache maven) and interface EGG (to PyPI).
>> >>>
>> >>> Recently we made the decision
>> >>> <http://www.mail-archive.com/dev%40mesos.apache.org/msg33148.html>
>> for
>> >>> the project to not officially support different language (java,
>> python)
>> >>> framework libraries going forward (likely after 1.0). The project will
>> >>> only support C++ libraries which will live in the repo and link to
>> other
>> >>> language libraries from our website.
>> >>>
>> >>> The main reason was that the PMC lacks the expertise to support
>> various
>> >>> language bindings and hence we wanted to remove the support burden.
>> >>>
>> >>> Option #1) It looks like we could do a similar thing with RPMs/DEBs,
>> >>> i.e., link to 3rd party artifacts from the project website. Similar to
>> >>> the client library authors, we could hold package maintainers
>> >>> accountable by providing guidelines.
>> >>>
>> >>> Option #2) Since the project officially supports certain platforms
>> >>> (Ubuntu, CentOS, OSX) and continuously tests this via CI, we could've
>> >>> the CI continuously build and upload the packages. Not sure what's ASF
>> >>> stance on this is. I filed a ticket
>> >>> <https://issues.apache.org/jira/browse/INFRA-10385> a while ago with
>> >>> INFRA regarding something similar, but never received any response.
>> >>>
>> >>> Personally, with the direction the project is headed towards, I prefer
>> >>> #1.
>> >>
>> >>
>> >> +1 (Option #1)
>> >>
>> >> This 'Option #1' approach will require the core dev team to clearly
>> convey
>> >> what is needed for any OS supported

Fwd: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change.

2015-09-25 Thread Marco Massenzio
Folks:

as a reminder, please be aware that as of Mesos 0.24.0, as announced back
in June, Mesos Master will write its information (`MasterInfo`) to
ZooKeeper in JSON format (see below for details).

If your framework relied on parsing the info (either de-serializing the
Protocol Buffer or just looking for an "IP-like" string) this change will
be a breaking change.

Just to confirm (see also Vinod's comments below) any rolling upgrades
(i.e., clusters with 0.22+0.23 and 0.23+0.24) of Mesos will just work.

This was in conjunction with the HTTP API release and removing the need for
non-C++ developers to have to link with libmesos and have to deal with
Protocol Buffers.

An example of how to access the new format in Python can be found in [0]
and we're happy to help with other languages too.
Any questions, please just ask.

[0] http://github.com/massenz/zk-mesos

Marco Massenzio

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

-- Forwarded message --
From: Vinod Kone <vinodk...@gmail.com>
Date: Wed, Jun 24, 2015 at 4:17 PM
Subject: Re: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo
change.
To: dev <d...@mesos.apache.org>


Just to clarify, any frameworks that are using the Mesos provided bindings
(aka libmesos.so) should not worry, as long as the version of the bindings
and version of the mesos master are not separated by more than 1 version.
In other words, you should be able to live upgrade a cluster from 0.23.0 to
0.24.0.

For framework schedulers that don't use the bindings (pesos, jesos etc), it
is prudent to add support for JSON formatted ZNODE to their master
detection code.

Thanks,

On Wed, Jun 24, 2015 at 4:10 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> Folks,
>
> as heads-up, we are planning to convert the format of the MasterInfo
> information stored in ZooKeeper from the Protocol Buffer binary format to
> JSON - this is in conjunction with the HTTP API development, to allow
> frameworks *not* to depend on libmesos and other binary dependencies to
> interact with Mesos Master nodes.
>
> *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework that is
> currently working in 0.22 *will continue to work* in 0.23 too) but as of
> Mesos 0.24, frameworks and other clients relying on the binary format will
> break.
>
> The details of the design are in this Google Doc:
>
>
https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit
>
> the actual work is detailed in MESOS-2340:
> https://issues.apache.org/jira/browse/MESOS-2340
>
> and the patch (and associated test) are here:
> https://reviews.apache.org/r/35571/
> https://reviews.apache.org/r/35815/
>
> *Marco Massenzio*
> *Distributed Systems Engineer*
>


Re: Official RPMs

2015-09-22 Thread Marco Massenzio
Hi guys,

just wanted to let you all know that we (Mesosphere) fully intend to
continue supporting distributing binary packages for the current set of
supported OSes (namely, Ubuntu / Debian / RedHat / CentOS as listed in [0]).

Sorry that 0.24 slipped through the cracks, the person who actually takes
care of that and knows the magic incantations has been unwell, and a number
of other competing priorities got in the way - we will eventually be
automating the process, so that downloadable binary packages are created
out of each release/RC build (and, possibly, even more often) without pesky
humans getting in the way :) but this may take some time.
We're building the 0.24 ones as we speak, so please bear with us while this
gets done.

Any questions / suggestions, we'd love to hear those too!

[0] https://mesosphere.com/downloads/

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 22, 2015 at 10:54 AM, CCAAT <cc...@tampabay.rr.com> wrote:

> On 09/21/2015 03:01 PM, Vinod Kone wrote:
>
>> +Jake Farrell
>>
>> The mesos project doesn't publish platform dependent artifacts.  We
>> currently only publish platform independent artifacts like JAR (to
>> apache maven) and interface EGG (to PyPI).
>>
>> Recently we made the decision
>> <http://www.mail-archive.com/dev%40mesos.apache.org/msg33148.html> for
>> the project to not officially support different language (java, python)
>> framework libraries going forward (likely after 1.0). The project will
>> only support C++ libraries which will live in the repo and link to other
>> language libraries from our website.
>>
>> The main reason was that the PMC lacks the expertise to support various
>> language bindings and hence we wanted to remove the support burden.
>>
>> Option #1) It looks like we could do a similar thing with RPMs/DEBs,
>> i.e., link to 3rd party artifacts from the project website. Similar to
>> the client library authors, we could hold package maintainers
>> accountable by providing guidelines.
>>
>> Option #2) Since the project officially supports certain platforms
>> (Ubuntu, CentOS, OSX) and continuously tests this via CI, we could've
>> the CI continuously build and upload the packages. Not sure what's ASF
>> stance on this is. I filed a ticket
>> <https://issues.apache.org/jira/browse/INFRA-10385> a while ago with
>> INFRA regarding something similar, but never received any response.
>>
>> Personally, with the direction the project is headed towards, I prefer #1.
>>
>
> +1 (Option #1)
>
> This 'Option #1' approach will require the core dev team to clearly convey
> what is needed for any OS supported, not the chosen OSes for support. Right
> now, I'm having to parse many documents to figure out how to extend the
> gentoo ebuild for mesos. And where to cut off what I do in the ebuilds and
> what to put into the configuration documents for gentoo. Naturally the
> minimial is only what should be in the the gentoo ebuild; with other items,
> such as HDFS as a compiler option. Once I get the btrfs/ceph work
> stabilized, there will be a compile time option for btrfs/ceph with the
> gentoo ebuild. Other distros that are not going that
> way should have other Distributed File System options 'baked into' their
> installation on that OS.
>
>
>
> 'Option #1' sets the stage for many OSes to be supported and the core dev
> team only has to support  a single document to clarify what any distro
> needs to robustly support mesos for their user community. This will
> facilitate a wider variety of experimentation, at the companion repos too.
> This  Option #1 approach will further accelerate adoption of Mesos on a
> very wide variety of platforms and architectures, imho. It sets the stage
> for valid benchmark performance comparison between distros; something that
> the gentoo community will no doubt win
>
> ;-)
>
> James
>
>
>
>
>
>> On Sat, Sep 19, 2015 at 3:39 AM, Carlos Sanchez <car...@apache.org
>> <mailto:car...@apache.org>> wrote:
>>
>> I'm using the same repo with some changes to build SSL enabled
>> packages
>>
>>
>> https://github.com/carlossg/mesos-deb-packaging/compare/master...carlossg:ssl
>>
>>
>> On Sat, Sep 19, 2015 at 4:22 AM, Rad Gruchalski
>> <ra...@gruchalski.com <mailto:ra...@gruchalski.com>> wrote:
>>  > Should be rather easy to package it with this little tool from
>> Mesosphere:
>>  > https://github.com/mesosphere/mesos-deb-packaging. I’ve done it
>> myself for
>>  > ubuntu 12.04 and 14.04.
>>  >

Re: Help interpreting output from running java test-framework example

2015-09-18 Thread Marco Massenzio
Thanks, Stephen - feedback much appreciated!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Thu, Sep 17, 2015 at 5:03 PM, Stephen Boesch <java...@gmail.com> wrote:

> Compared to Yarn Mesos is just faster. Mesos has a smaller  startup time
> and the delay between tasks is smaller.  The run times for terasort 100GB
> tended towards 110sec median on Mesos vs about double that on Yarn.
>
> Unfortunately we require mature Multi-Tenancy/Isolation/Queues support
> -which is still initial stages of WIP for Mesos. So we will need to use
> YARN for the near and likely medium term.
>
>
>
> 2015-09-17 15:52 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>:
>
>> Hey Stephen,
>>
>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>> mesos is  still just taking linearly increased time  compared to smaller
>>> datasizes.
>>
>>
>> Obviously delighted to hear that, BUT me not much like "but" :)
>> I've added Tim who is one of the main contributors to our Mesos/Spark
>> bindings, and it would be great to hear your use case/experience and find
>> out whether we can improve on that front too!
>>
>> As the case may be, we could also jump on a hangout if it makes
>> conversation easier/faster.
>>
>> Cheers,
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>
>> On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <java...@gmail.com> wrote:
>>
>>> Thanks Vinod. I went back to see the logs and nothing interesting .
>>> However int he process I found that my spark port was pointing to 7077
>>> instead of 5050. After re-running .. spark on mesos worked!
>>>
>>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>> mesos is  still just taking linearly increased time  compared to smaller
>>> datasizes.
>>>
>>> We have significant additional work to incorporate mesos into operations
>>> and support but given the strong perforrmance and stability characterstics
>>> we are initially seeing here that effort is likely to get underway.
>>>
>>>
>>>
>>> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vinodk...@gmail.com>:
>>>
>>>> sounds like it. can you see what the slave/agent and executor logs say?
>>>>
>>>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <java...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> I am in the process of learning how to run a mesos cluster with the
>>>>> intent for it to be the resource manager for Spark.  As a small step in
>>>>> that direction a basic test of mesos was performed, as suggested by the
>>>>> Mesos Getting Started page.
>>>>>
>>>>> In the following output we see tasks launched and resources offered on
>>>>> a 20 node cluster:
>>>>>
>>>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>>>> $(hostname -s):5050
>>>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>>>> master@10.64.204.124:5050
>>>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>>>> 20150908-182014-2093760522-5050-15313-
>>>>> Registered! ID = 20150908-182014-2093760522-5050-15313-
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with

Re: Help interpreting output from running java test-framework example

2015-09-17 Thread Marco Massenzio
Hey Stephen,

The spark on mesos is twice as fast as yarn on our 20 node cluster. In
> addition Mesos  is handling datasizes that yarn simply dies on  it. But
> mesos is  still just taking linearly increased time  compared to smaller
> datasizes.


Obviously delighted to hear that, BUT me not much like "but" :)
I've added Tim who is one of the main contributors to our Mesos/Spark
bindings, and it would be great to hear your use case/experience and find
out whether we can improve on that front too!

As the case may be, we could also jump on a hangout if it makes
conversation easier/faster.

Cheers,

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <java...@gmail.com> wrote:

> Thanks Vinod. I went back to see the logs and nothing interesting .
> However int he process I found that my spark port was pointing to 7077
> instead of 5050. After re-running .. spark on mesos worked!
>
> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
> addition Mesos  is handling datasizes that yarn simply dies on  it. But
> mesos is  still just taking linearly increased time  compared to smaller
> datasizes.
>
> We have significant additional work to incorporate mesos into operations
> and support but given the strong perforrmance and stability characterstics
> we are initially seeing here that effort is likely to get underway.
>
>
>
> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vinodk...@gmail.com>:
>
>> sounds like it. can you see what the slave/agent and executor logs say?
>>
>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>>>
>>> I am in the process of learning how to run a mesos cluster with the
>>> intent for it to be the resource manager for Spark.  As a small step in
>>> that direction a basic test of mesos was performed, as suggested by the
>>> Mesos Getting Started page.
>>>
>>> In the following output we see tasks launched and resources offered on a
>>> 20 node cluster:
>>>
>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>> $(hostname -s):5050
>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>> master@10.64.204.124:5050
>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>> Attempting to register without authentication
>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>> 20150908-182014-2093760522-5050-15313-
>>> Registered! ID = 20150908-182014-2093760522-5050-15313-
>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0
>>> and mem: 119855.0
>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-20937605

Re: Basic installation question

2015-09-05 Thread Marco Massenzio
Stephen:

Klaus is correct, you are starting the Master in "standalone" mode, not
with zookeeper support: it needs adding the --zk=zk://10.xx.xx.124:2181/mesos
--quorum=1 options (at the very least).

As you correctly noted, the contents of the /mesos znode is empty and thus
the agent nodes cannot find elected Master leader (also, if you are running
more than one Master, they won't 'know' about each other and won't be able
to elect a leader).

To check that your settings work, you can (a) look in Master logs (it will
log a lot of info when connecting to ZK) and (b) see that under /mesos a
number of json.info_nn nodes will appear (whose contents are JSON so
you can double check that the contents make sense).

You can find more info here[0].

[0]
http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 4, 2015 at 5:33 PM, Stephen Boesch <java...@gmail.com> wrote:

>
> I installed using yum -y install mesos. That did work.
>
> Now the master and slaves do not see each other.
>
>
> Here is the master:
> $ ps -ef | grep mesos | grep -v grep
> stack30236 17902  0 00:09 pts/400:00:04
> /mnt/mesos/build/src/.libs/lt-mesos-master --work_dir=/tmp/mesos
> --ip=10.xx.xx.124
>
>
> Here is one of the 20 slaves:
>
>  ps -ef | grep mesos | grep -v grep
> root 26086 1  0 00:10 ?00:00:00 /usr/sbin/mesos-slave
> --master=zk://10.xx.xx.124:2181/mesos --log_dir=/var/log/mesos
> root 26092 26086  0 00:10 ?00:00:00 logger -p user.info -t
> mesos-slave[26086]
> root 26093 26086  0 00:10 ?00:00:00 logger -p user.err -t
> mesos-slave[26086]
>
>
> Note the slave and master are on correct same ip address
>
> The /etc/mesos/zk seems to be set properly : and I do see the /mesos node
> in zookeeper is updated after restarting the master
>
> However the zookeeper node is empty:
>
> [zk: localhost:2181(CONNECTED) 10] ls /mesos
> []
>
> The node is world accessible so no permission issue:
>
> [zk: localhost:2181(CONNECTED) 12] getAcl /mesos
> 'world,'anyone
> : cdrwa
>
> Why is the zookeeper node empty?  Is this the reason the  master and
> slaves are not connecting?
>
> 2015-09-04 14:56 GMT-07:00 craig w <codecr...@gmail.com>:
>
>> No problem, they have a "downloads" link inn their menu:
>> https://mesosphere.com/downloads/
>> On Sep 4, 2015 5:43 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>>
>>> @Craig . That is an incomplete answer - given that such links are not
>>> presented in an obvious manner .  Maybe you managed to find  a link on
>>> their site that provides prebuilt for Centos7: if so then please share it.
>>>
>>>
>>> I had previously found a link on their site for prebuilt binaries but is
>>> based on using CDH4 (which is not possible for my company). It is also old.
>>>
>>> https://docs.mesosphere.com/tutorials/install_centos_rhel/
>>>
>>>
>>> 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>:
>>>
>>>> Mesosphere has packages prebuilt, go to their site to find how to
>>>> install
>>>> On Sep 4, 2015 5:11 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>>>>
>>>>>
>>>>> After following the directions here:
>>>>> http://mesos.apache.org/gettingstarted/
>>>>>
>>>>> Which for centos7 includes the following:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   # Change working directory.
>>>>> $ cd mesos
>>>>>
>>>>> # Bootstrap (Only required if building from git repository).
>>>>> $ ./bootstrap
>>>>>
>>>>> # Configure and build.
>>>>> $ mkdir build
>>>>> $ cd build
>>>>> $ ../configure
>>>>> $ make
>>>>>
>>>>> In order to speed up the build and reduce verbosity of the logs, you
>>>>> can append-j  V=0 to make.
>>>>>
>>>>> # Run test suite.
>>>>> $ make check
>>>>>
>>>>> # Install (Optional).
>>>>> $ make install
>>>>>
>>>>>
>>>>>
>>>>> But the installation is not correct afterwards: here is the bin
>>>>> directory:
>>>>>
>>>>> $ ll bin
>>>>> total 92
>>&g

Re: Basic installation question

2015-09-05 Thread Marco Massenzio
Thanks for follow-up, Stephen - this will be also useful to others finding
this in the archives!

Glad it eventually worked for you, I'll drop a line to our guys to update
the download page with this information, so it should hopefully be less
painful in the future for others.

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Sat, Sep 5, 2015 at 3:00 PM, Stephen Boesch <java...@gmail.com> wrote:

> Yes I had started the slaves as
>
> service mesos-slave start
>
> But had not done the correct way on the master, which is supposed to be:
>
> service mesos-master start
>
> The slaves do appear after having made that correction: thanks.
>
>
> 2015-09-05 14:55 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>:
>
>> Stephen:
>>
>> Klaus is correct, you are starting the Master in "standalone" mode, not
>> with zookeeper support: it needs adding the --zk=zk://10.xx.xx.124:2181/mesos
>> --quorum=1 options (at the very least).
>>
>> As you correctly noted, the contents of the /mesos znode is empty and
>> thus the agent nodes cannot find elected Master leader (also, if you are
>> running more than one Master, they won't 'know' about each other and won't
>> be able to elect a leader).
>>
>> To check that your settings work, you can (a) look in Master logs (it
>> will log a lot of info when connecting to ZK) and (b) see that under /mesos
>> a number of json.info_nn nodes will appear (whose contents are JSON so
>> you can double check that the contents make sense).
>>
>> You can find more info here[0].
>>
>> [0]
>> http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>
>> On Fri, Sep 4, 2015 at 5:33 PM, Stephen Boesch <java...@gmail.com> wrote:
>>
>>>
>>> I installed using yum -y install mesos. That did work.
>>>
>>> Now the master and slaves do not see each other.
>>>
>>>
>>> Here is the master:
>>> $ ps -ef | grep mesos | grep -v grep
>>> stack30236 17902  0 00:09 pts/400:00:04
>>> /mnt/mesos/build/src/.libs/lt-mesos-master --work_dir=/tmp/mesos
>>> --ip=10.xx.xx.124
>>>
>>>
>>> Here is one of the 20 slaves:
>>>
>>>  ps -ef | grep mesos | grep -v grep
>>> root 26086 1  0 00:10 ?00:00:00 /usr/sbin/mesos-slave
>>> --master=zk://10.xx.xx.124:2181/mesos --log_dir=/var/log/mesos
>>> root 26092 26086  0 00:10 ?00:00:00 logger -p user.info -t
>>> mesos-slave[26086]
>>> root 26093 26086  0 00:10 ?00:00:00 logger -p user.err -t
>>> mesos-slave[26086]
>>>
>>>
>>> Note the slave and master are on correct same ip address
>>>
>>> The /etc/mesos/zk seems to be set properly : and I do see the /mesos
>>> node in zookeeper is updated after restarting the master
>>>
>>> However the zookeeper node is empty:
>>>
>>> [zk: localhost:2181(CONNECTED) 10] ls /mesos
>>> []
>>>
>>> The node is world accessible so no permission issue:
>>>
>>> [zk: localhost:2181(CONNECTED) 12] getAcl /mesos
>>> 'world,'anyone
>>> : cdrwa
>>>
>>> Why is the zookeeper node empty?  Is this the reason the  master and
>>> slaves are not connecting?
>>>
>>> 2015-09-04 14:56 GMT-07:00 craig w <codecr...@gmail.com>:
>>>
>>>> No problem, they have a "downloads" link inn their menu:
>>>> https://mesosphere.com/downloads/
>>>> On Sep 4, 2015 5:43 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>>>>
>>>>> @Craig . That is an incomplete answer - given that such links are not
>>>>> presented in an obvious manner .  Maybe you managed to find  a link on
>>>>> their site that provides prebuilt for Centos7: if so then please share it.
>>>>>
>>>>>
>>>>> I had previously found a link on their site for prebuilt binaries but
>>>>> is based on using CDH4 (which is not possible for my company). It is also
>>>>> old.
>>>>>
>>>>> https://docs.mesosphere.com/tutorials/install_centos_rhel/
>>>>>
>>>>>
>>>>> 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>:
>>>>>
>>>>>> Mesosphere has pack

Re: Basic installation question

2015-09-04 Thread Marco Massenzio
I think you are looking into the wrong bin/ folder (the one under top-level
mesos/) - the actual binaries are in ${MESOS_HOME}/bin/build

I am positive that the instructions work on CentOS 7.1 as I had to run all
those recently on a VM of mine.

BTW - If you are looking for the libmesos and various includes, they will
be under /usr/local (you can change that by using something like:

../configure --prefix /path/to/install/dir



*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 4, 2015 at 2:10 PM, Stephen Boesch <java...@gmail.com> wrote:

>
> After following the directions here:
> http://mesos.apache.org/gettingstarted/
>
> Which for centos7 includes the following:
>
>
>
>
>   # Change working directory.
> $ cd mesos
>
> # Bootstrap (Only required if building from git repository).
> $ ./bootstrap
>
> # Configure and build.
> $ mkdir build
> $ cd build
> $ ../configure
> $ make
>
> In order to speed up the build and reduce verbosity of the logs, you can
> append-j  V=0 to make.
>
> # Run test suite.
> $ make check
>
> # Install (Optional).
> $ make install
>
>
>
> But the installation is not correct afterwards: here is the bin directory:
>
> $ ll bin
> total 92
> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in
> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in
> -rw-r--r--.  1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in
> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in
> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in
> -rw-r--r--.  1 stack stack  901 Jul 17 23:14 mesos-tests-flags.sh.in
> -rw-r--r--.  1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in
> -rw-r--r--.  1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in
> -rw-r--r--.  1 stack stack 1366 Jul 17 23:14 mesos.sh.in
> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in
> -rw-r--r--.  1 stack stack  858 Jul 17 23:14 mesos-master-flags.sh.in
> -rw-r--r--.  1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in
> -rw-r--r--.  1 stack stack  935 Jul 17 23:14 mesos-local-flags.sh.in
> -rw-r--r--.  1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in
> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in
> -rw-r--r--.  1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in
> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in
> -rw-r--r--.  1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in
> -rw-r--r--.  1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in
> -rw-r--r--.  1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in
> -rw-r--r--.  1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in
> drwxr-xr-x.  2 stack stack 4096 Jul 17 23:21 .
> drwxr-xr-x. 11 stack stack 4096 Sep  4 20:08 ..
>
> So .. two things:
>
> (a) what is missing from the installation instructions?
>
> (b) Is there an *up to date *rpm/yum installation for centos7?
>
>
>
>
>
>
>


Re: Basic installation question

2015-09-04 Thread Marco Massenzio
argh - sorry!

${MESOS_HOME}/build/bin
(I'd mixed the two around)

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 4, 2015 at 2:39 PM, Marco Massenzio <ma...@mesosphere.io> wrote:

> I think you are looking into the wrong bin/ folder (the one under
> top-level mesos/) - the actual binaries are in ${MESOS_HOME}/bin/build
>
> I am positive that the instructions work on CentOS 7.1 as I had to run all
> those recently on a VM of mine.
>
> BTW - If you are looking for the libmesos and various includes, they will
> be under /usr/local (you can change that by using something like:
>
> ../configure --prefix /path/to/install/dir
>
>
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Fri, Sep 4, 2015 at 2:10 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>>
>> After following the directions here:
>> http://mesos.apache.org/gettingstarted/
>>
>> Which for centos7 includes the following:
>>
>>
>>
>>
>>   # Change working directory.
>> $ cd mesos
>>
>> # Bootstrap (Only required if building from git repository).
>> $ ./bootstrap
>>
>> # Configure and build.
>> $ mkdir build
>> $ cd build
>> $ ../configure
>> $ make
>>
>> In order to speed up the build and reduce verbosity of the logs, you can
>> append-j  V=0 to make.
>>
>> # Run test suite.
>> $ make check
>>
>> # Install (Optional).
>> $ make install
>>
>>
>>
>> But the installation is not correct afterwards: here is the bin directory:
>>
>> $ ll bin
>> total 92
>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in
>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in
>> -rw-r--r--.  1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in
>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in
>> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in
>> -rw-r--r--.  1 stack stack  901 Jul 17 23:14 mesos-tests-flags.sh.in
>> -rw-r--r--.  1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in
>> -rw-r--r--.  1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in
>> -rw-r--r--.  1 stack stack 1366 Jul 17 23:14 mesos.sh.in
>> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in
>> -rw-r--r--.  1 stack stack  858 Jul 17 23:14 mesos-master-flags.sh.in
>> -rw-r--r--.  1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in
>> -rw-r--r--.  1 stack stack  935 Jul 17 23:14 mesos-local-flags.sh.in
>> -rw-r--r--.  1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in
>> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in
>> -rw-r--r--.  1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in
>> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in
>> -rw-r--r--.  1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in
>> -rw-r--r--.  1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in
>> -rw-r--r--.  1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in
>> -rw-r--r--.  1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in
>> drwxr-xr-x.  2 stack stack 4096 Jul 17 23:21 .
>> drwxr-xr-x. 11 stack stack 4096 Sep  4 20:08 ..
>>
>> So .. two things:
>>
>> (a) what is missing from the installation instructions?
>>
>> (b) Is there an *up to date *rpm/yum installation for centos7?
>>
>>
>>
>>
>>
>>
>>
>


Re: Basic installation question

2015-09-04 Thread Marco Massenzio
Hey Stephen,

the Mesos packages for download from Mesosphere are available here:
https://mesosphere.com/downloads/
(for Mesos, just click on the Getting Started button - sorry, no direct URL
- it will show the steps to install on the supported distros using
apt-get/yum).

Those work and I obviously recommend them :)
But I think you wanted the "full developer experience" as you pointed to
the make steps.

Also, if you haven't looked at the tutorials in a while (as you seem to
imply in your message) I would recommend you give them another shot: we've
been doing some work on revamping them and making them more accessible.



*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Fri, Sep 4, 2015 at 2:38 PM, Stephen Boesch <java...@gmail.com> wrote:

> @Craig . That is an incomplete answer - given that such links are not
> presented in an obvious manner .  Maybe you managed to find  a link on
> their site that provides prebuilt for Centos7: if so then please share it.
>
>
> I had previously found a link on their site for prebuilt binaries but is
> based on using CDH4 (which is not possible for my company). It is also old.
>
> https://docs.mesosphere.com/tutorials/install_centos_rhel/
>
>
> 2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>:
>
>> Mesosphere has packages prebuilt, go to their site to find how to install
>> On Sep 4, 2015 5:11 PM, "Stephen Boesch" <java...@gmail.com> wrote:
>>
>>>
>>> After following the directions here:
>>> http://mesos.apache.org/gettingstarted/
>>>
>>> Which for centos7 includes the following:
>>>
>>>
>>>
>>>
>>>   # Change working directory.
>>> $ cd mesos
>>>
>>> # Bootstrap (Only required if building from git repository).
>>> $ ./bootstrap
>>>
>>> # Configure and build.
>>> $ mkdir build
>>> $ cd build
>>> $ ../configure
>>> $ make
>>>
>>> In order to speed up the build and reduce verbosity of the logs, you can
>>> append-j  V=0 to make.
>>>
>>> # Run test suite.
>>> $ make check
>>>
>>> # Install (Optional).
>>> $ make install
>>>
>>>
>>>
>>> But the installation is not correct afterwards: here is the bin
>>> directory:
>>>
>>> $ ll bin
>>> total 92
>>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-tests.sh.in
>>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-slave.sh.in
>>> -rw-r--r--.  1 stack stack 1772 Jul 17 23:14 valgrind-mesos-master.sh.in
>>> -rw-r--r--.  1 stack stack 1769 Jul 17 23:14 valgrind-mesos-local.sh.in
>>> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-tests.sh.in
>>> -rw-r--r--.  1 stack stack  901 Jul 17 23:14 mesos-tests-flags.sh.in
>>> -rw-r--r--.  1 stack stack 1019 Jul 17 23:14 mesos-slave.sh.in
>>> -rw-r--r--.  1 stack stack 1721 Jul 17 23:14 mesos-slave-flags.sh.in
>>> -rw-r--r--.  1 stack stack 1366 Jul 17 23:14 mesos.sh.in
>>> -rw-r--r--.  1 stack stack 1026 Jul 17 23:14 mesos-master.sh.in
>>> -rw-r--r--.  1 stack stack  858 Jul 17 23:14 mesos-master-flags.sh.in
>>> -rw-r--r--.  1 stack stack 1023 Jul 17 23:14 mesos-local.sh.in
>>> -rw-r--r--.  1 stack stack  935 Jul 17 23:14 mesos-local-flags.sh.in
>>> -rw-r--r--.  1 stack stack 1466 Jul 17 23:14 lldb-mesos-tests.sh.in
>>> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-slave.sh.in
>>> -rw-r--r--.  1 stack stack 1492 Jul 17 23:14 lldb-mesos-master.sh.in
>>> -rw-r--r--.  1 stack stack 1489 Jul 17 23:14 lldb-mesos-local.sh.in
>>> -rw-r--r--.  1 stack stack 1498 Jul 17 23:14 gdb-mesos-tests.sh.in
>>> -rw-r--r--.  1 stack stack 1527 Jul 17 23:14 gdb-mesos-slave.sh.in
>>> -rw-r--r--.  1 stack stack 1530 Jul 17 23:14 gdb-mesos-master.sh.in
>>> -rw-r--r--.  1 stack stack 1521 Jul 17 23:14 gdb-mesos-local.sh.in
>>> drwxr-xr-x.  2 stack stack 4096 Jul 17 23:21 .
>>> drwxr-xr-x. 11 stack stack 4096 Sep  4 20:08 ..
>>>
>>> So .. two things:
>>>
>>> (a) what is missing from the installation instructions?
>>>
>>> (b) Is there an *up to date *rpm/yum installation for centos7?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


Re: [VOTE] Release Apache Mesos 0.24.0 (rc2)

2015-09-02 Thread Marco Massenzio
+1 (non-binding)

All tests (including ROOT) pass on Ubuntu 14.04
All tests pass on CentOS 7.1; ROOT tests cause 1 failure:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS

$ cat /etc/centos-release
CentOS Linux release 7.1.1503 (Core)

This seems to be new[0], but possibly related to some limitation/setting of
my test machine (VirtualBox VM, running 2 CPUs on Ubuntu host).
Interestingly enough, I don't see the 4 failures as Vaibhav, but in my log
it shows *YOU HAVE 11 DISABLED TESTS* (he has 12).

[0] https://issues.apache.org/jira/issues/?filter=12333150

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 1, 2015 at 5:45 PM, Vinod Kone <vinodk...@apache.org> wrote:

> Hi all,
>
>
> Please vote on releasing the following candidate as Apache Mesos 0.24.0.
>
>
> 0.24.0 includes the following:
>
>
> 
>
> Experimental support for v1 scheduler HTTP API!
>
> This release also wraps up support for fetcher.
>
> The CHANGELOG for the release is available at:
>
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.24.0-rc2
>
>
> 
>
>
> The candidate for Mesos 0.24.0 release is available at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz
>
>
> The tag to be voted on is 0.24.0-rc2:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.24.0-rc2
>
>
> The MD5 checksum of the tarball can be found at:
>
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.md5
>
>
> The signature of the tarball can be found at:
>
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.asc
>
>
> The PGP key used to sign the release is here:
>
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
>
> The JAR is up in Maven in a staging repository here:
>
> https://repository.apache.org/content/repositories/orgapachemesos-1066
>
>
> Please vote on releasing this package as Apache Mesos 0.24.0!
>
>
> The vote is open until Fri Sep  4 17:33:05 PDT 2015 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
>
> [ ] +1 Release this package as Apache Mesos 0.24.0
>
> [ ] -1 Do not release this package because ...
>
>
> Thanks,
>
> Vinod
>


Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Marco Massenzio
@Steven - agreed!
As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to
speak) I'm all for it - let's document and add Jiras for that, by all means.

@Scott - LoL: you certainly didn't; I was more worried my email would ;-)

Thanks, guys!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker <
sschlans...@opentable.com> wrote:

> I 100% agree with your philosophy here, and I suspect it's something
> shared in the Mesos community.
>
> I just think that we can restrict the domain of the failure to a smaller
> reasonable window -- once you are in the context of "I am doing work to
> launch a specific task", there is a well defined "success / failure / here
> is an error message" path defined already.  Users expect tasks to fail and
> can see the errors.
>
> I think that a lot of these assertions are in fact more appropriate as
> task failures.  But I agree that they should be fatal to *some* part of the
> system, just not the whole agent entirely.
>
> On Sep 1, 2015, at 4:33 PM, Marco Massenzio <ma...@mesosphere.io> wrote:
>
> > That's one of those areas for discussions that is so likely to generate
> a flame war that I'm hesitant to wade in :)
> >
> > In general, I would agree with the sentiment expressed there:
> >
> > > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
> >
> > which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
> >
> > The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
> >
> > I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
> >
> > All this to say, that it's difficult to come up with a general
> *automated* approach to unequivocally decide if a failure is "fatal" or
> could just be safely "ignored" (after appropriate error logging) - in
> general, when in doubt it's probably safer to "noisily crash & restart" and
> rely on the overall system's HA architecture to take care of replication
> and consistency.
> > (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
> >
> > From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
> >
> > Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
> >
> > [0] http://research.google.com/pubs/pub43438.html
> >
> > Marco Massenzio
> > Distributed Systems Engineer
> > http://codetrips.com
> >
> > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
> >
> >
> > On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
> > >
> > > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> >
> > I reported a similar bug a while back:
> >
> > https://issues.apache.org/jira/browse/MESOS-2684
> >
> > This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
> >
> >
> >
>
>


Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)

2015-09-01 Thread Marco Massenzio
Hey guys,

just a quick note to bring back the conversation on track to the 0.24-RC1
release.
Is my understanding correct that there are currently no binding -1's?

@Vinod: what do you think, are we good to release?

Thanks!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote:

> One more question. From the Mesos code it doesn’t look like events are
> being split or combined, so given I have a client that gives me access to
> the individual chunks, is it safe to assume that each chunk contains
> exactly one event? Because that would make parsing the events a lot easier
> for me.
>
> Thanks,
> Dario
>
> On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote:
>
> Hi Vinod,
>
> thanks for the explanation, I got it now.
>
> Thanks,
> Dario
>
> On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote:
>
> I think you might be confused with the HTTP chunked encoding and RecordIO
> encoding. Most HTTP client libraries dechunk the stream before presenting
> it to the application. So the application needs to know the encoding of the
> dechunked data to be able to process it.
>
> In Mesos's case, the server (master here) can encode it in JSON or
> Protobuf. We wanted to have a consistent way to encode both these formats
> and Record-IO format was the one we settled on. Note that this format is
> also used by the Twitter streaming API
> <https://dev.twitter.com/streaming/overview/processing> (see delimited
> messages section).
>
> HTH,
>
> On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com> wrote:
>
>> Hi Vino,
>>
>> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>> Hi Dario,
>>
>> Can you test with "curl --no-buffer" option? Looks like your stdout might
>> be line-buffered.
>>
>>
>> that did the trick, thanks!
>>
>>
>> The reason we used record-io formatting is to be consistent in how we
>> stream protobuf and json encoded data.
>>
>>
>> How does simple chunked encoding prevent you from doing this?
>>
>> Thanks,
>> Dario
>>
>> On Fri, Aug 28, 2015 at 2:04 PM, <dario.re...@me.com> wrote:
>>
>>> Anand,
>>>
>>> thanks for the explanation. I didn't think about the case when you have
>>> to split a message, now it makes sense.
>>>
>>> But the case I observed with curl is still weird. Even when splitting a
>>> message, it should still receive both parts almost at the same time. Do you
>>> have any idea why it could behave like this?
>>>
>>> On 28.08.2015, at 21:31, Anand Mazumdar <an...@mesosphere.io> wrote:
>>>
>>> Dario,
>>>
>>> Most HTTP libraries/parsers ( including one that Mesos uses internally )
>>> provide a way to specify a default size of each chunk. If a Mesos Event is
>>> too big , it would get split into smaller chunks and vice-versa.
>>>
>>> -anand
>>>
>>> On Aug 28, 2015, at 11:51 AM, dario.re...@me.com wrote:
>>>
>>> Anand,
>>>
>>> in the example from my first mail you can see that curl prints the size
>>> of a message and then waits for the next message and only when it receives
>>> that message it will print the prior message plus the size of the next
>>> message, but not the actual message.
>>>
>>> What's the benefit of encoding multiple messages in a single chunk? You
>>> could simply create a single chunk per event.
>>>
>>> Cheers,
>>> Dario
>>>
>>> On 28.08.2015, at 19:43, Anand Mazumdar <an...@mesosphere.io> wrote:
>>>
>>> Dario,
>>>
>>> Can you shed a bit more light on what you still find puzzling about the
>>> CURL behavior after my explanation ?
>>>
>>> PS: A single HTTP chunk can have 0 or more Mesos (Scheduler API) Events.
>>> So in your example, the first chunk had complete information about the
>>> first “event”, followed by partial information about the subsequent event
>>> from another chunk.
>>>
>>> As for the benefit of using RecordIO format here, how else do you think
>>> we could have de-marcated two events in the response ?
>>>
>>> -anand
>>>
>>>
>>> On Aug 28, 2015, at 10:01 AM, dario.re...@me.com wrote:
>>>
>>> Anand,
>>>
>>> thanks for the explanation. I'm still a little puzzled why curl behaves
>>>

Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)

2015-09-01 Thread Marco Massenzio
Cool - I'll ping Joseph on that one.

(the -1 from Nik was related to the known ROOT test issues that -if memory
serves- we agreed were non-blocking: I'll follow up with him too)

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 1, 2015 at 10:49 AM, Vinod Kone <vinodk...@apache.org> wrote:

> Thanks for the nudge Marco.
>
> There was a binding -1 from Niklas.
>
> I'm planning to cut an RC2. The cherry picks I've selected so far are in
> MESOS-2562 <https://issues.apache.org/jira/browse/MESOS-2562>.
>
> The only one I'm currently waiting on to get a resolution for is
> https://issues.apache.org/jira/browse/MESOS-3345.
>
> On Tue, Sep 1, 2015 at 10:44 AM, Marco Massenzio <ma...@mesosphere.io>
> wrote:
>
>> Hey guys,
>>
>> just a quick note to bring back the conversation on track to the 0.24-RC1
>> release.
>> Is my understanding correct that there are currently no binding -1's?
>>
>> @Vinod: what do you think, are we good to release?
>>
>> Thanks!
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>
>> On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote:
>>
>>> One more question. From the Mesos code it doesn’t look like events are
>>> being split or combined, so given I have a client that gives me access to
>>> the individual chunks, is it safe to assume that each chunk contains
>>> exactly one event? Because that would make parsing the events a lot easier
>>> for me.
>>>
>>> Thanks,
>>> Dario
>>>
>>> On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote:
>>>
>>> Hi Vinod,
>>>
>>> thanks for the explanation, I got it now.
>>>
>>> Thanks,
>>> Dario
>>>
>>> On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote:
>>>
>>> I think you might be confused with the HTTP chunked encoding and
>>> RecordIO encoding. Most HTTP client libraries dechunk the stream before
>>> presenting it to the application. So the application needs to know the
>>> encoding of the dechunked data to be able to process it.
>>>
>>> In Mesos's case, the server (master here) can encode it in JSON or
>>> Protobuf. We wanted to have a consistent way to encode both these formats
>>> and Record-IO format was the one we settled on. Note that this format is
>>> also used by the Twitter streaming API
>>> <https://dev.twitter.com/streaming/overview/processing> (see delimited
>>> messages section).
>>>
>>> HTH,
>>>
>>> On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com> wrote:
>>>
>>>> Hi Vino,
>>>>
>>>> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>>>
>>>> Hi Dario,
>>>>
>>>> Can you test with "curl --no-buffer" option? Looks like your stdout
>>>> might be line-buffered.
>>>>
>>>>
>>>> that did the trick, thanks!
>>>>
>>>>
>>>> The reason we used record-io formatting is to be consistent in how we
>>>> stream protobuf and json encoded data.
>>>>
>>>>
>>>> How does simple chunked encoding prevent you from doing this?
>>>>
>>>> Thanks,
>>>> Dario
>>>>
>>>> On Fri, Aug 28, 2015 at 2:04 PM, <dario.re...@me.com> wrote:
>>>>
>>>>> Anand,
>>>>>
>>>>> thanks for the explanation. I didn't think about the case when you
>>>>> have to split a message, now it makes sense.
>>>>>
>>>>> But the case I observed with curl is still weird. Even when splitting
>>>>> a message, it should still receive both parts almost at the same time. Do
>>>>> you have any idea why it could behave like this?
>>>>>
>>>>> On 28.08.2015, at 21:31, Anand Mazumdar <an...@mesosphere.io> wrote:
>>>>>
>>>>> Dario,
>>>>>
>>>>> Most HTTP libraries/parsers ( including one that Mesos uses internally
>>>>> ) provide a way to specify a default size of each chunk. If a Mesos Event
>>>>> is too big , it would get split into smaller chunks and vice-versa.
>>>>>
>>>>> -anand
>>>>>
>>>>> On Aug 28, 2015, at 11:51 AM, dario.

Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)

2015-09-01 Thread Marco Massenzio
Awesome - we'll be running RC2 through our CI env and let you know the
outcome soon as we know.
Thanks!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 1, 2015 at 11:42 AM, Vinod Kone <vinodk...@apache.org> wrote:

> My only concern is that if we decide to change the protobuf -> json
> conversion for int64 (from json number to string), we should do that in the
> scheduler http api as well (Resource protobuf uses int64 for ports).
>
> But since the scheduler http api is labeled beta for 0.24, we can still
> change the semantics in 0.25.
>
> So, I'll go ahead and call the vote for RC2 today.
>
> On Tue, Sep 1, 2015 at 11:05 AM, Marco Massenzio <ma...@mesosphere.io>
> wrote:
>
>> Cool - I'll ping Joseph on that one.
>>
>> (the -1 from Nik was related to the known ROOT test issues that -if
>> memory serves- we agreed were non-blocking: I'll follow up with him too)
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>
>> On Tue, Sep 1, 2015 at 10:49 AM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>>> Thanks for the nudge Marco.
>>>
>>> There was a binding -1 from Niklas.
>>>
>>> I'm planning to cut an RC2. The cherry picks I've selected so far are in
>>> MESOS-2562 <https://issues.apache.org/jira/browse/MESOS-2562>.
>>>
>>> The only one I'm currently waiting on to get a resolution for is
>>> https://issues.apache.org/jira/browse/MESOS-3345.
>>>
>>> On Tue, Sep 1, 2015 at 10:44 AM, Marco Massenzio <ma...@mesosphere.io>
>>> wrote:
>>>
>>>> Hey guys,
>>>>
>>>> just a quick note to bring back the conversation on track to the
>>>> 0.24-RC1 release.
>>>> Is my understanding correct that there are currently no binding -1's?
>>>>
>>>> @Vinod: what do you think, are we good to release?
>>>>
>>>> Thanks!
>>>>
>>>> *Marco Massenzio*
>>>>
>>>> *Distributed Systems Engineerhttp://codetrips.com
>>>> <http://codetrips.com>*
>>>>
>>>> On Tue, Sep 1, 2015 at 1:49 AM, Dario Rexin <dario.re...@me.com> wrote:
>>>>
>>>>> One more question. From the Mesos code it doesn’t look like events are
>>>>> being split or combined, so given I have a client that gives me access to
>>>>> the individual chunks, is it safe to assume that each chunk contains
>>>>> exactly one event? Because that would make parsing the events a lot easier
>>>>> for me.
>>>>>
>>>>> Thanks,
>>>>> Dario
>>>>>
>>>>> On Sep 1, 2015, at 8:42 AM, dario.re...@me.com wrote:
>>>>>
>>>>> Hi Vinod,
>>>>>
>>>>> thanks for the explanation, I got it now.
>>>>>
>>>>> Thanks,
>>>>> Dario
>>>>>
>>>>> On 31.08.2015, at 23:47, Vinod Kone <vinodk...@apache.org> wrote:
>>>>>
>>>>> I think you might be confused with the HTTP chunked encoding and
>>>>> RecordIO encoding. Most HTTP client libraries dechunk the stream before
>>>>> presenting it to the application. So the application needs to know the
>>>>> encoding of the dechunked data to be able to process it.
>>>>>
>>>>> In Mesos's case, the server (master here) can encode it in JSON or
>>>>> Protobuf. We wanted to have a consistent way to encode both these formats
>>>>> and Record-IO format was the one we settled on. Note that this format is
>>>>> also used by the Twitter streaming API
>>>>> <https://dev.twitter.com/streaming/overview/processing> (see
>>>>> delimited messages section).
>>>>>
>>>>> HTH,
>>>>>
>>>>> On Mon, Aug 31, 2015 at 2:09 PM, Dario Rexin <dario.re...@me.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vino,
>>>>>>
>>>>>> On Aug 31, 2015, at 9:36 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>>>>>
>>>>>> Hi Dario,
>>>>>>
>>>>>> Can you test with "curl --no-buffer" option? Looks like your stdout
>>>>>> might be line-buffered.
>>>>>>
>>>>>>
>>>>>> that did the trick, thanks!
>&

Re: mesos-slave crashing with CHECK_SOME

2015-09-01 Thread Marco Massenzio
That's one of those areas for discussions that is so likely to generate a
flame war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world.
Other tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example
given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
on as if nothing had happened, is only likely to lead to further (and
worse) disappointment.

The general philosophy back at Google (and which certainly informs the
design of Borg[0]) was "fail early, fail hard" so that either (a) the
service is restarted and hopefully the root cause cleared or (b) someone
(who can hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you
can assume there is some sort of 'log-monitor' that looks out for errors
and other anomalies and alerts humans that will then take a look and
possibly apply some corrective action - when you're up to hundreds or
thousands (definitely Mesos territory) that's not practical: the system
should either self-heal or crash-and-restart.

All this to say, that it's difficult to come up with a general *automated*
approach to unequivocally decide if a failure is "fatal" or could just be
safely "ignored" (after appropriate error logging) - in general, when in
doubt it's probably safer to "noisily crash & restart" and rely on the
overall system's HA architecture to take care of replication and
consistency.
(and an intelligent monitoring system that only alerts when some failure
threshold is exceeded).

>From what I've seen so far (granted, still a novice here) it seems that
Mesos subscribes to this notion, assuming that Agent Nodes will come and
go, and usually Tasks survive (for a certain amount of time anyway) a Slave
restart (obviously, if the physical h/w is the ultimate cause of failure,
well, then all bets are off).

Having said all that - if there are areas where we have been over-eager
with our CHECKs, we should definitely revisit that and make it more
crash-resistant, absolutely.

[0] http://research.google.com/pubs/pub43438.html

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
sschlans...@opentable.com> wrote:

>
>
> On Aug 31, 2015, at 11:54 AM, Scott Rankin <sran...@motus.com> wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>
> I reported a similar bug a while back:
>
> https://issues.apache.org/jira/browse/MESOS-2684
>
> This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
>
>
>


Re: Prepping for next release

2015-09-01 Thread Marco Massenzio
Uhm, that's a tricky one...
Considering that JDK6 was EOL'd in 2011[0] and even JDK7 is now officially
out of support from Oracle, I don't think this should be a major issue?

I'm also assuming that, if anyone really needs to use JDK6 they can build
from source, by simply running `mvn package` and replace the JAR?
(not terribly familiar with our build process, so no idea if that would
work at all).

[0] http://www.oracle.com/technetwork/java/eol-135779.html

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Tue, Sep 1, 2015 at 4:46 PM, Vinod Kone <vinodk...@apache.org> wrote:

> +user
>
> So looks like this issue is related to JDK6 and not my maven password
> settings.
>
> Related ASF ticket: https://issues.apache.org/jira/browse/BUILDS-85
>
> The reason it worked for me, when I tagged RC1, was because I also pointed
> my maven to use JDK7.
>
> So we have couple options here:
>
> #1) (Easy) Do same thing with RC2 as we did for RC1. This does mean the
> artifacts we upload to nexus will be compiled with JDK7. IIUC, if any JVM
> based frameworks are still on JDK6 they can't link in the new artifacts?
>
> #2) (Harder) As mentioned in the ticket, have maven compile Mesos jar with
> JDK6 but use JDK7 when uploading. Not sure how easy it is to adapt our
> Mesos build tool chain for this. Anyone has expertise in this area?
>
> Thoughts?
>
>
> On Tue, Aug 18, 2015 at 3:14 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
> > I re-encrypted the maven passwords and that seemed to have done the
> trick.
> > Thanks Adam!
> >
> > On Tue, Aug 18, 2015 at 1:59 PM, Adam Bordelon <a...@mesosphere.io>
> wrote:
> >
> >> Update your ~/.m2/settings.xml?
> >> Also check that the output of `gpg --list-keys` and `--list-sigs`
> matches
> >> the keypair you expect
> >>
> >> On Tue, Aug 18, 2015 at 1:48 PM, Vinod Kone <vinodk...@apache.org>
> wrote:
> >>
> >> > I definitely had to create a new gpg key because my previous one
> >> expired! I
> >> > uploaded them id.apache and our SVN repo containing KEYS.
> >> >
> >> > Do I need to do anything specific for maven?
> >> >
> >> > On Tue, Aug 18, 2015 at 1:25 PM, Adam Bordelon <a...@mesosphere.io>
> >> wrote:
> >> >
> >> > > Haven't seen that one. Are you sure you've got your gpg key properly
> >> set
> >> > up
> >> > > with Maven?
> >> > >
> >> > > On Tue, Aug 18, 2015 at 1:13 PM, Vinod Kone <vinodk...@apache.org>
> >> > wrote:
> >> > >
> >> > > > I'm getting the following error when running ./support/tag.sh. Has
> >> any
> >> > of
> >> > > > the recent release managers seen this one before?
> >> > > >
> >> > > > [ERROR] Failed to execute goal
> >> > > > org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy
> >> > (default-deploy)
> >> > > on
> >> > > > project mesos: Failed to deploy artifacts: Could not transfer
> >> artifact
> >> > > > org.apache.mesos:mesos:jar:0.24.0-rc1 from/to
> apache.releases.https
> >> (
> >> > > > https://repository.apache.org/service/local/staging/deploy/maven2
> ):
> >> > > > java.lang.RuntimeException: Could not generate DH keypair: Prime
> >> size
> >> > > must
> >> > > > be multiple of 64, and can only range from 512 to 1024 (inclusive)
> >> ->
> >> > > [Help
> >> > > > 1]
> >> > > >
> >> > > > On Mon, Aug 17, 2015 at 11:23 AM, Vinod Kone <
> vinodk...@apache.org>
> >> > > wrote:
> >> > > >
> >> > > > > Update:
> >> > > > >
> >> > > > > There are 3 outstanding tickets (all related to flaky tests),
> >> that we
> >> > > are
> >> > > > > trying to resolve. Any help fixing those (esp. MESOS-3050
> >> > > > > <https://issues.apache.org/jira/browse/MESOS-3050>) would be
> >> > > > appreciated!
> >> > > > >
> >> > > > > Planning to cut an RC as soon as they are fixed (assuming no new
> >> ones
> >> > > > crop
> >> > > > > up).
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > &g

Re: Recommended way to discover current master

2015-08-31 Thread Marco Massenzio
The easiest way is via accessing directly Zookeeper - as you don't need to
know a priori the list of Masters; if you do, however, hitting any one of
them will redirect (302) to the current Leader.

If you would like to see an example of how to retrieve that info from ZK, I
have written about it here[0].
Finally, we're planning to make all this available via the Mesos Commons[1]
library (currently, there is a PR[2] waiting to be be merged).


[0]
http://codetrips.com/2015/08/16/apache-mesos-leader-master-discovery-using-zookeeper-part-2/
[1] https://github.com/mesos/commons
[2] https://github.com/mesos/commons/pull/2/files

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Aug 31, 2015 at 10:25 AM, Philip Weaver <philip.wea...@gmail.com>
wrote:

> My framework knows the list of zookeeper hosts and the list of mesos
> master hosts.
>
> I can think of a few ways for the framework to figure out which host is
> the current master. What would be the best? Should I check in zookeeper
> directly? Does the mesos library expose an interface to discover the master
> from zookeeper or otherwise? Should I just try each possible master until
> one responds?
>
> Apologies if this is already well documented, but I wasn't able to find
> it. Thanks!
>
> - Philip
>
>


Re: Mesos-master complains about quorum being a duplicate flag on CoreOS

2015-08-31 Thread Marco Massenzio
Thanks for following up, glad we figured it out.

IMO the current behavior (and the error message) are non-intuitive and I've
filed a Jira[0] to address that.

[0] https://issues.apache.org/jira/browse/MESOS-3340

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Aug 31, 2015 at 1:59 AM, F21 <f21.gro...@gmail.com> wrote:

> Ah, that makes sense!
>
> I have the environment variable MESOS_QUORUM exported and it was
> conflicting with the --quorum passed to the command line.
>
> Removing the MESOS_QUORUM environment variable fixed it.
>
>
> On 31/08/2015 5:36 PM, Marco Massenzio wrote:
>
> Command line flags are parsed using stout/flags.hpp[0] and the FlagsBase
> class is derived in mesos::internal::master::Flags (see
> src/master/flags.hpp[1]).
>
> I am not sure why you are seeing that behavior on CoreOS, but I'd be
> curious to know what happens if you omit the --quorum when you start
> master: it should usually fail and complain that it's a required flag (when
> used in conjunction with --zk).  If it works, it will emit in the logs
> (towards the very beginning) all the values of the flags: what does it say
> about --quorum?
>
> Completely random question: I assume you don't already have in the
> environment a MESOS_QUORUM variable exported?
>
> If the issue persists in a "clean" OS install and a recent build, it's
> definitely a bug: it'd be great if you could please file a ticket at
> http://issues.apache.org/jira (feel free to assign to me).
>
> Thanks!
>
> [0]
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp
> [1]
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/master/flags.hpp
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com>*
>
> On Sun, Aug 30, 2015 at 8:21 PM, F21 <f21.gro...@gmail.com> wrote:
>
>> I've gotten the mesos binaries compiled and packaged and deployed them
>> onto a CoreOS instance.
>>
>>
>> When I run the master, it complains that the quorum flag is duplicated:
>>
>> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1
>> --hostname=192.168.1.4 --ip=192.168.1.4
>> Duplicate flag 'quorum' on command line
>> ...
>>
>> However, if I try and run mesos-master on Ubuntu 15.04 64-bit (where the
>> binaries were built), it seems to work properly:
>>
>> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1
>> --hostname=192.168.1.4 --ip=192.168.1.4
>>
>> I0830 18:31:20.983999 2830 main.cpp:181] Build: 2015-08-30 10:11:54 by
>> I0830 18:31:20.984246 2830 main.cpp:183] Version: 0.23.0
>> I0830 18:31:20.984694 2830 main.cpp:204] Using 'HierarchicalDRF'
>> allocator --work_dir needed for replicated log based registry
>>
>> How are the command line flags parsed in mesos? What causes this strange
>> behavior on CoreOS?
>>
>>
>>
>
>


Re: Mesos-master complains about quorum being a duplicate flag on CoreOS

2015-08-31 Thread Marco Massenzio
Command line flags are parsed using stout/flags.hpp[0] and the FlagsBase
class is derived in mesos::internal::master::Flags (see
src/master/flags.hpp[1]).

I am not sure why you are seeing that behavior on CoreOS, but I'd be
curious to know what happens if you omit the --quorum when you start
master: it should usually fail and complain that it's a required flag (when
used in conjunction with --zk).  If it works, it will emit in the logs
(towards the very beginning) all the values of the flags: what does it say
about --quorum?

Completely random question: I assume you don't already have in the
environment a MESOS_QUORUM variable exported?

If the issue persists in a "clean" OS install and a recent build, it's
definitely a bug: it'd be great if you could please file a ticket at
http://issues.apache.org/jira (feel free to assign to me).

Thanks!

[0]
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=3rdparty/libprocess/3rdparty/stout/include/stout/flags.hpp
[1]
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=src/master/flags.hpp

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Sun, Aug 30, 2015 at 8:21 PM, F21 <f21.gro...@gmail.com> wrote:

> I've gotten the mesos binaries compiled and packaged and deployed them
> onto a CoreOS instance.
>
>
> When I run the master, it complains that the quorum flag is duplicated:
>
> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1
> --hostname=192.168.1.4 --ip=192.168.1.4
> Duplicate flag 'quorum' on command line
> ...
>
> However, if I try and run mesos-master on Ubuntu 15.04 64-bit (where the
> binaries were built), it seems to work properly:
>
> $ ./mesos-master --zk=zk://192.168.1.4/mesos --quorum=1
> --hostname=192.168.1.4 --ip=192.168.1.4
>
> I0830 18:31:20.983999 2830 main.cpp:181] Build: 2015-08-30 10:11:54 by
> I0830 18:31:20.984246 2830 main.cpp:183] Version: 0.23.0
> I0830 18:31:20.984694 2830 main.cpp:204] Using 'HierarchicalDRF'
> allocator --work_dir needed for replicated log based registry
>
> How are the command line flags parsed in mesos? What causes this strange
> behavior on CoreOS?
>
>
>


Re: Use docker start rather than docker run?

2015-08-29 Thread Marco Massenzio
Hi Paul,

+1 to what Alex/Tim say.

Maybe a (simple) example will help: a very basic framework I created
recently, does away with the Executor and only uses the Scheduler,
sending a CommandInfo structure to Mesos' Agent node to execute.

See:
https://github.com/massenz/mongo_fw/blob/develop/src/mongo_scheduler.cpp#L124

If Python is more your thing, there are examples in the Mesos repository,
or you can take a look at something I started recently to use the new
(0.24) HTTP API (NOTE - this is still very much still WIP):
https://github.com/massenz/zk-mesos/blob/develop/notebooks/HTTP%20API%20Tests.ipynb

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Fri, Aug 28, 2015 at 8:44 AM, Paul Bell arach...@gmail.com wrote:

 Alex  Tim,

 Thank you both; most helpful.

 Alex, can you dispel my confusion on this point: I keep reading that a
 framework in Mesos (e.g., Marathon) consists of a scheduler and an
 executor. This reference to executor made me think that Marathon must
 have *some* kind of presence on the slave node. But the more familiar I
 become with Mesos the less likely this seems to me. So, what does it mean
 to talk about the Marathon framework executor?

 Tim, I did come up with a simple work-around that involves re-copying the
 needed file into the container each time the application is started. For
 reasons unknown, this file is not kept in a location that would readily
 lend itself to my use of persistent storage (Docker -v). That said, I am
 keenly interested in learning how to write both custom executors 
 schedulers. Any sense for what release of Mesos will see persistent
 volumes?

 Thanks again, gents.

 -Paul



 On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos
 Agent (check
 /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim,
 could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try
 it here.

 I'm still not sure what component (mesos-master, mesos-slave, marathon)
 generates the docker run command that launches containers on a slave
 node. I suppose that it's the framework executor (Marathon) on the slave
 that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of docker
 start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul







Re: Talks at MesosCon 2015

2015-08-22 Thread Marco Massenzio
On Fri, Aug 21, 2015 at 12:07 AM, Marco Massenzio ma...@mesosphere.io
wrote:

  Great talks today, can't wait to get hands on the new APIs.

 You can ;)

 Mesos 0.25-rc1 is out for grabs and testing...


And, of course, I meant *0.24-rc1* ... this is what happens when one spends
all day building @HEAD :D

Apologies for confusion and thanks to @mpark for being eagle-eyed!


 will require building from source: not for the faint of heart, but not an
 insurmountable hurdle either.

 *Marco Massenzio*

 *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

 On Thu, Aug 20, 2015 at 9:22 PM, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 thanks! that would be very helpful.

 Great talks today, can't wait to get hands on the new APIs.

 On Thu, Aug 20, 2015 at 10:57 AM, Chris Aniszczyk z...@twitter.com wrote:

 Yes, they will all be recorded and posted on YouTube.

 On Thu, Aug 20, 2015 at 10:30 AM, craig w codecr...@gmail.com wrote:

 An earlier post said all videos will be online a few days after the
 conference.

 On Thu, Aug 20, 2015 at 1:29 PM, Kenneth Su su.ke...@gmail.com wrote:

 Yes, agree that. I am looking for live video today on the youtube, but
 nothing there.

 Definitely the forum will help to follow up and discuss from there

 Thanks!

 Kenneth

 On Thu, Aug 20, 2015 at 11:26 AM, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Hi all,

 I'm at the MesosCon 2015 today and was just curious if all the talks/
 presentations would be captured anywhere( mesosphere blog/ youtube). It
 would be very helpful to have them recorded. There are multiple 
 interesting
 talks at the same time scheduled and Its not possible to cover all.
 I strongly believe if we have a forum to follow up with these talks /
 topics presented here will be helpful.

 Thanks.


 --
 Regards,
 Haripriya Ayyalasomayajula






 --

 https://github.com/mindscratch
 https://www.google.com/+CraigWickesser
 https://twitter.com/mind_scratch
 https://twitter.com/craig_links




 --
 Cheers,

 Chris Aniszczyk | Open Source | Twitter, Inc.
 @cra | +1 512 961 6719




 --
 Regards,
 Haripriya Ayyalasomayajula





Re: Talks at MesosCon 2015

2015-08-21 Thread Marco Massenzio
 Great talks today, can't wait to get hands on the new APIs.

You can ;)

Mesos 0.25-rc1 is out for grabs and testing...
will require building from source: not for the faint of heart, but not an
insurmountable hurdle either.

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Thu, Aug 20, 2015 at 9:22 PM, Haripriya Ayyalasomayajula 
aharipriy...@gmail.com wrote:

 thanks! that would be very helpful.

 Great talks today, can't wait to get hands on the new APIs.

 On Thu, Aug 20, 2015 at 10:57 AM, Chris Aniszczyk z...@twitter.com wrote:

 Yes, they will all be recorded and posted on YouTube.

 On Thu, Aug 20, 2015 at 10:30 AM, craig w codecr...@gmail.com wrote:

 An earlier post said all videos will be online a few days after the
 conference.

 On Thu, Aug 20, 2015 at 1:29 PM, Kenneth Su su.ke...@gmail.com wrote:

 Yes, agree that. I am looking for live video today on the youtube, but
 nothing there.

 Definitely the forum will help to follow up and discuss from there

 Thanks!

 Kenneth

 On Thu, Aug 20, 2015 at 11:26 AM, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Hi all,

 I'm at the MesosCon 2015 today and was just curious if all the talks/
 presentations would be captured anywhere( mesosphere blog/ youtube). It
 would be very helpful to have them recorded. There are multiple 
 interesting
 talks at the same time scheduled and Its not possible to cover all.
 I strongly believe if we have a forum to follow up with these talks /
 topics presented here will be helpful.

 Thanks.


 --
 Regards,
 Haripriya Ayyalasomayajula






 --

 https://github.com/mindscratch
 https://www.google.com/+CraigWickesser
 https://twitter.com/mind_scratch
 https://twitter.com/craig_links




 --
 Cheers,

 Chris Aniszczyk | Open Source | Twitter, Inc.
 @cra | +1 512 961 6719




 --
 Regards,
 Haripriya Ayyalasomayajula




Re: Assertion `data.isNone()' failed

2015-08-18 Thread Marco Massenzio
Are you sure this is a 0.21.1 cluster? the line numbers in the logs match
the code in Mesos 0.23.0

This is, however, a genuine bug (src/launcher/fetcher.cpp#L99):

  Trybool available = hdfs.available();

  if (available.isError() || !available.get()) {
return Error(Skipping fetch with Hadoop Client as
  Hadoop Client not available:  + available.error());
  }

The root cause is that (probably) the HDFS client is not available on the
slave; however, we do not 'error()' but rather return a 'false' - this is
all good.
The bug is exposed in the return line, where we try to retrieve
available.error() (which we should not - it's just `false`).

This was a 'latent' bug that *may* have been exposed by (my) recent
refactoring of os::shell which is used by hdfs.available() under the covers.
(this is a bit unclear, though, as that refactoring is post-0.23)

Be that as it may, I've filed
https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and I
may be able to sneak it into 0.24 (which we're cutting now).

Thanks for reporting!

PS - bad code aside, the root cause is that the `hdfs` binary seems to be
unreachable on the slave: is it installed in the PATH of the user under
which the slave binary executes?



*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com wrote:

 We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on
 top of this setup without any problems for ~4 months now. I'm now trying to
 get hadoop mesos https://github.com/mesos/hadoop/ integration working
 but I see the TaskTrackers that gets launched are failing with the
 following error

 I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info:
 {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop}
 I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the
 sandbox directory
 I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 *mesos-fetcher:
 /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90:
 const string TryT::error() const [with T = bool; std::string =
 std::basic_stringchar]: Assertion `data.isNone()' failed.*
 *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you are
 using GNU date ***
 PC: @   0x343ee32635 (unknown)
 *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID
 24428; stack trace: ***
 @   0x343f20f710 (unknown)
 @   0x343ee32635 (unknown)
 @   0x343ee33e15 (unknown)
 @   0x343ee2b75e (unknown)
 @   0x343ee2b820 (unknown)
 @   0x408b0a Try::error()
 @   0x40cbcf download()
 @   0x4098a3 main
 @   0x343ee1ed5d (unknown)
 @   0x40aeb5 (unknown)
 Failed to synchronize with slave (it's probably exited)

 Environment
 - EC2 Machines
 - Output of lsb_release -a
 LSB Version:
  
 :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
 Distributor ID: CentOS
 Description:  CentOS release 6.5 (Final)
 Release:  6.5
 Codename: Final

 Any ideas what I'm doing wrong?

 --
 -- Ashwanth Kumar



Re: Assertion `data.isNone()' failed

2015-08-18 Thread Marco Massenzio
Hi Ashwanth,

I've pushed a fix out for review https://reviews.apache.org/r/37584/,
we'll see if it makes it in time for 0.24.

As for the version, you can quickly verify that by running `mesos-master
--version` (or just look at the very beginning of the logs, it will tell
you a bunch of stuff about version, build, etc.)

I am sorry, I don't really know enough about setting up Hadoop on Mesos to
give you any useful guidance; from a quick glance at the code, it seems to
me that, if the URI is a `hdfs://` one, the only way to retrieve the
tarball is via HDFS (so you will need the hdfs client to be available on
the Slave(s)).
If you do use an HTTP URI (http://) then it should work just fine.

Hopefully others will be able to chime in with a more informed view.

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Tue, Aug 18, 2015 at 2:46 AM, Ashwanth Kumar ashwa...@indix.com wrote:

 Thanks Marco for the update.

 My understanding of the hadoop mesos framework was that the executor would
 download the hadoop distro from mapred.mesos.executor.uri and execute the
 TTs. I didn't know that to download from HDFS it needs `hdfs` binary in
 PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and
 add them?

 Regarding the line number mismatch, I installed the package through
 mesosphere not sure if that's the reason.


 On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio ma...@mesosphere.io
 wrote:

 Are you sure this is a 0.21.1 cluster? the line numbers in the logs match
 the code in Mesos 0.23.0

 This is, however, a genuine bug (src/launcher/fetcher.cpp#L99):

   Trybool available = hdfs.available();

   if (available.isError() || !available.get()) {
 return Error(Skipping fetch with Hadoop Client as
   Hadoop Client not available:  + available.error());
   }

 The root cause is that (probably) the HDFS client is not available on the
 slave; however, we do not 'error()' but rather return a 'false' - this is
 all good.
 The bug is exposed in the return line, where we try to retrieve
 available.error() (which we should not - it's just `false`).

 This was a 'latent' bug that *may* have been exposed by (my) recent
 refactoring of os::shell which is used by hdfs.available() under the covers.
 (this is a bit unclear, though, as that refactoring is post-0.23)

 Be that as it may, I've filed
 https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and
 I may be able to sneak it into 0.24 (which we're cutting now).

 Thanks for reporting!

 PS - bad code aside, the root cause is that the `hdfs` binary seems to be
 unreachable on the slave: is it installed in the PATH of the user under
 which the slave binary executes?



 *Marco Massenzio*

 *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

 On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com
 wrote:

 We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on
 top of this setup without any problems for ~4 months now. I'm now trying to
 get hadoop mesos https://github.com/mesos/hadoop/ integration working
 but I see the TaskTrackers that gets launched are failing with the
 following error

 I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info:
 {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop}
 I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the
 sandbox directory
 I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 *mesos-fetcher:
 /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90:
 const string TryT::error() const [with T = bool; std::string =
 std::basic_stringchar]: Assertion `data.isNone()' failed.*
 *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you
 are using GNU date ***
 PC: @   0x343ee32635 (unknown)
 *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from
 PID 24428; stack trace: ***
 @   0x343f20f710 (unknown)
 @   0x343ee32635 (unknown)
 @   0x343ee33e15 (unknown)
 @   0x343ee2b75e (unknown)
 @   0x343ee2b820 (unknown)
 @   0x408b0a Try::error()
 @   0x40cbcf download()
 @   0x4098a3 main
 @   0x343ee1ed5d (unknown)
 @   0x40aeb5 (unknown)
 Failed to synchronize with slave (it's

Re: Assertion `data.isNone()' failed

2015-08-18 Thread Marco Massenzio
For info, the patch was committed today and made the cut to 0.24-rc1.




Thanks to @vinodkone for super-quick turnaround.



—
Sent from Mailbox

On Tue, Aug 18, 2015 at 10:45 AM, Marco Massenzio ma...@mesosphere.io
wrote:

 Hi Ashwanth,
 I've pushed a fix out for review https://reviews.apache.org/r/37584/,
 we'll see if it makes it in time for 0.24.
 As for the version, you can quickly verify that by running `mesos-master
 --version` (or just look at the very beginning of the logs, it will tell
 you a bunch of stuff about version, build, etc.)
 I am sorry, I don't really know enough about setting up Hadoop on Mesos to
 give you any useful guidance; from a quick glance at the code, it seems to
 me that, if the URI is a `hdfs://` one, the only way to retrieve the
 tarball is via HDFS (so you will need the hdfs client to be available on
 the Slave(s)).
 If you do use an HTTP URI (http://) then it should work just fine.
 Hopefully others will be able to chime in with a more informed view.
 *Marco Massenzio*
 *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*
 On Tue, Aug 18, 2015 at 2:46 AM, Ashwanth Kumar ashwa...@indix.com wrote:
 Thanks Marco for the update.

 My understanding of the hadoop mesos framework was that the executor would
 download the hadoop distro from mapred.mesos.executor.uri and execute the
 TTs. I didn't know that to download from HDFS it needs `hdfs` binary in
 PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and
 add them?

 Regarding the line number mismatch, I installed the package through
 mesosphere not sure if that's the reason.


 On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio ma...@mesosphere.io
 wrote:

 Are you sure this is a 0.21.1 cluster? the line numbers in the logs match
 the code in Mesos 0.23.0

 This is, however, a genuine bug (src/launcher/fetcher.cpp#L99):

   Trybool available = hdfs.available();

   if (available.isError() || !available.get()) {
 return Error(Skipping fetch with Hadoop Client as
   Hadoop Client not available:  + available.error());
   }

 The root cause is that (probably) the HDFS client is not available on the
 slave; however, we do not 'error()' but rather return a 'false' - this is
 all good.
 The bug is exposed in the return line, where we try to retrieve
 available.error() (which we should not - it's just `false`).

 This was a 'latent' bug that *may* have been exposed by (my) recent
 refactoring of os::shell which is used by hdfs.available() under the covers.
 (this is a bit unclear, though, as that refactoring is post-0.23)

 Be that as it may, I've filed
 https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and
 I may be able to sneak it into 0.24 (which we're cutting now).

 Thanks for reporting!

 PS - bad code aside, the root cause is that the `hdfs` binary seems to be
 unreachable on the slave: is it installed in the PATH of the user under
 which the slave binary executes?



 *Marco Massenzio*

 *Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

 On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar ashwa...@indix.com
 wrote:

 We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on
 top of this setup without any problems for ~4 months now. I'm now trying to
 get hadoop mesos https://github.com/mesos/hadoop/ integration working
 but I see the TaskTrackers that gets launched are failing with the
 following error

 I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info:
 {cache_directory:\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop,items:[{action:BYPASS_CACHE,uri:{extract:true,value:hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz}}],sandbox_directory:\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd,user:hadoop}
 I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the
 sandbox directory
 I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI
 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
 *mesos-fetcher:
 /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90:
 const string TryT::error() const [with T = bool; std::string =
 std::basic_stringchar]: Assertion `data.isNone()' failed.*
 *** Aborted at 1439876195 (unix time) try date -d @1439876195 if you
 are using GNU date ***
 PC: @   0x343ee32635 (unknown)
 *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from
 PID 24428; stack trace: ***
 @   0x343f20f710 (unknown)
 @   0x343ee32635 (unknown)
 @   0x343ee33e15 (unknown)
 @   0x343ee2b75e (unknown

Re: SSL in Mesos 0.23

2015-08-14 Thread Marco Massenzio
FYI - Joris is out this week, he'll be probably able to get back to you
early next (modulo MesosCon craziness :)

*Marco Massenzio*
*Distributed Systems Engineer*

On Fri, Aug 14, 2015 at 9:14 AM, Carlos Sanchez car...@apache.org wrote:

 no suggestions?

 On Tue, Aug 11, 2015 at 6:47 PM, Vinod Kone vinodk...@apache.org wrote:
  @joris, can you help out here?
 
  On Tue, Aug 11, 2015 at 9:43 AM, Carlos Sanchez car...@apache.org
 wrote:
 
  I have tried to enable SSL with no success, even compiling from source
  with the ssl flags --enable-libevent --enable-ssl
 
  export SSL_ENABLED=true
  export SSL_SUPPORT_DOWNGRADE=false
  export SSL_REQUIRE_CERT=true
  export SSL_CERT_FILE=/etc/mesos/...
  export SSL_KEY_FILE=/etc/mesos/...
  export SSL_CA_FILE=/etc/mesos/...
 
 
  /home/ubuntu/mesos-deb-packaging/mesos-repo/build/src/mesos-master
  --work_dir=/var/lib/mesos
 
  Port 5050 is still served as plain http, no SSL
 
  Nothing about ssl shows up in the logs, any ideas?
 
  Thanks
 
 
  
   From: Dharmit Shah shahdhar...@gmail.com
   To: user@mesos.apache.org
   Cc:
   Date: Mon, 10 Aug 2015 14:13:04 +0530
   Subject: Re: SSL in Mesos 0.23
   Hi Jeff,
  
   Thanks for the suggestion.
  
   I modified the systemd service file to use
   `/etc/sysconfig/mesos-master` and `/etc/sysconfig/mesos-slave` as
   environment files for master and slave services respectively. In these
   files, I specified the environment variables that I used to specify on
   the command line.
  
   Now if I check `strings /proc/pid/environ | grep SSL` for pids of
   master and slave services, I see the environment variables that I set
   in the /etc/sysconfig/environment-file.
  
   Now that it looks like I have started the master and slave services
   with SSL enabled, how do I really confirm that communication between
   master and slaves is really happening over SSL?
  
   Also, how do I enable SSL communication for a framework like Marathon?
  
   Regards,
   Dharmit.
  
   On Fri, Aug 7, 2015 at 10:56 PM, Jeff Schroeder
   jeffschroe...@computer.org wrote:
The sudo command defaults to envreset (look for that in the man
 page)
which
strips all env variables sans a select few. I'd almost bet that your
SSL_*
variables are not present and were not passed to the slave. Just
 sudo
-i and
start the slaves *as root* without sudo. There is no benefit to
starting
them with sudo. You can verify what I'm saying with something along
the
lines of:
   
strings /proc/$(pidof mesos-slave)/environ | grep ^SSL_
   
   
On Friday, August 7, 2015, Dharmit Shah shahdhar...@gmail.com
 wrote:
   
Hello again,
   
Thanks for your responses. I will share what I tried after your
suggestions.
   
1. `ldd /usr/sbin/mesos-master` and `ldd /usr/sbin/mesos-slave`
returned similar output as one suggested by Craig. So, I guess, the
Mesosphere repo binaries have SSL enabled. Right?
   
2. I created SSL private key and cert on one system in my cluster
 by
referring this guide on DO [1]. Admittedly, my knowledge of SSL is
limited.
   
3. Next, I copied the key and cert to all three mesos-master nodes
and
four mesos-slave nodes. Shouldn't slave nodes be provided only with
the cert and not the private key? Whereas all master nodes may have
the private key and cert both. Or am I understanding SSL
 incorrectly
here?
   
4. After copying the cert and key, I started the mesos-master
 service
on master nodes with below command:
   
$ sudo SSL_ENABLED=true SSL_KEY_FILE=~/ssl/mesos.key
SSL_CERT_FILE=~/ssl/mesos.crt /usr/sbin/mesos-master
   
--zk=zk://172.19.10.111:2181,172.19.10.112:2181,
 172.19.10.193:2181/mesos
--port=5050 --log_dir=/var/log/mesos --acls=file:///root/acls.json
--credentials=/home/isys/mesos --quorum=2 --work_dir=/var/lib/mesos
   
I check web UI and things look good. I am not completely sure if
https should have worked for mesos web UI but, it didn't.
   
5. Next, I start slave nodes with below command:
   
  $ sudo SSL_ENABLED=true SSL_CERT_FILE=~/mesos.crt
SSL_KEY_FILE=~/mesos.key /usr/sbin/mesos-slave
   
   
--master=zk://172.19.10.111:2181,172.19.10.112:2181,
 172.19.10.193:2181/mesos
--log_dir=/var/log/mesos --containerizers=docker,mesos
--executor_registration_timeout=15mins
   
Mesos web UI reported four mesos-slave nodes in Activated mode.
 So
far so good. I am still wondering how I should verify if
communication
is happening over SSL.
   
6. To check if SSL is indeed working, I stopped one slave node and
started it without SSL using `systemctl start mesos-slave`. I was
expecting it to not get into Activated state on Mesos web UI but
 it
did. So, I think SSL is not configured properly by me.
   
I am attaching logs from the master nodes. These logs were
 generated
after starting masters with command specified in point 4.
   
Let

Re: Can't start master properly (stale state issue?); help!

2015-08-14 Thread Marco Massenzio
Thanks for the summary, Paul.

As mentioned, I'm not terribly familiar with what happens in the
'log-replicas' folder so I will not even try to comment as I don't want to
mislead you (and future readers) on red-herring chases.

I can tell you that 'zapping' the ZK data folders is essentially harmless
(well, as to Mesos - not sure if you use for other stuff) so long as that
happens while the Master/Agent Nodes are NOT running (or you can seriously
send them in a spin) and I would certainly strongly suggest that
hostname/hosts files be touched *before* Mesos starts up (if you think that
was not the case, it would certainly explain the weirdness).

If you do see it again, my recommendation (or, at least, what I do when I
see Leader-related weirdness) is to use zkCli.sh and go looking into the
znode contents as I mentioned in my previous emails.

The good news is that, as of 0.24 (out probably next week) we write to ZK
in JSON, so that will be easy to parse for humans too (and non-PB-aware
code too).

The Leader is always the one with the lowest-numbered sequential node there
and you should be able to confirm that by looking at the logs too.

Good luck with your app, sound like fun and exciting!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Fri, Aug 14, 2015 at 5:53 AM, Paul Bell arach...@gmail.com wrote:

 All,

 By way of some background: I'm not running a data center (or centers).
 Rather, I work on a distributed application whose trajectory is taking it
 into a realm of many Docker containers distributed across many hosts
 (mostly virtual hosts at the outset). An environment that supports
 isolation, multi-tenancy, scalability, and some fault tolerance is
 desirable for this application. Also, the mere ability to simplify - at
 least somewhat - the management of multiple hosts is of great importance.
 So, that's more or less how I got to Mesos and to here...

 I ended up writing a Java program that configures a collection of host VMs
 as a Mesos cluster and then, via Marathon, distributes the application
 containers across the cluster. Configuring  building the cluster is
 largely a lot of SSH work. Doing the same for the application is part
 Marathon, part Docker remote API. The containers that need to talk to each
 other via TCP are connected with Weave's (http://weave.works) overlay
 network. So the main infrastructure consists of Mesos, Docker, and Weave.
 The whole thing is pretty amazing - for which I take very little credit.
 Rather, these are some wonderful technologies, and the folks who write 
 support them are very helpful. That said, I sometimes feel like I'm
 juggling chain saws!

 *In re* the issues raised on this thread:

 All Mesos components were installed via the Mesosphere packages. The 4 VMs
 in the cluster are all running Ubuntu 14.04 LTS.

 My suspicions about the IP@ 127.0.1.1 were raised a few months ago when,
 after seeing this IP in a mesos-master log when things weren't working, I
 discovered these articles:


 https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4
 https://groups.google.com/forum/%20/l%20!topic/marathon-framework/1qboeZTOLU4

 *http://frankhinek.com/build-mesos-multi-node-ha-cluster/
 http://frankhinek.com/build-mesos-multi-node-ha-cluster/%20/l%20note2* (see
 note 2)


 So, to the point raised just now by Klaus (and earlier in the thread), the
 aforementioned configuration program does change /etc/hosts (and
 /etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco 
 hasodent, I might have encountered a race condition wherein ZK 
 mesos-master saw the unchanged /etc/hosts before I altered it. I believe
 that I yesterday fixed that issue.

 Also, as part of the cluster create step, I get a bit aggressive
 (perhaps unwisely) with what I believe are some state repositories.
 Specifically, I

 rm /var/lib/zookeeper/version-2/*
 rm -Rf /var/lib/mesos/replicated_log

 Should I NOT be doing this? I know from experience that zapping the
 version-2 directory (ZK's data_Dir if IIRC)  can solve occasional
 weirdness. Marco is /var/lib/mesos/replicated_log what you are referring
 to when you say some issue with the log-replica?

 Just a day or two ago I first heard the term znode  learned a little
 about zkCli.sh. I will experiment with it more in the coming days.

 As matters now stand, I have the cluster up and running. But before I
 again deploy the application, I am trying to put the cluster through its
 paces by periodically cycling it through the states my program can bring
 about, e.g.,

 --cluster create (takes a clean VM and configures it to act as one
 or more Mesos components: ZK, master, slave)
 --cluster stop(stops the Mesos services on each node)
 --cluster destroy   (configures the VM back to its original clean
 state)
 --cluster create
 --cluster stop
 --cluster start


 et cetera.

 *The only way I got rid of the no leading master issue that started

Re: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Marco Massenzio
To be really sure about the possible root cause, I'd need to know how you
installed Mesos on your server, if it's via Mesosphere packages, the
configuration is described here:
https://open.mesosphere.com/reference/packages/

I am almost[0] sure the behavior you are seeing has something to do how the
server resolves the hostname to an IP for your Master - unless you give an
explicit IP address to bind to (--ip) libprocess will look up the hostname,
reverse-DNS it, and resolve to an IP address: if that fails, it falls back
to localhost.

If you want to try a quick hack, you can run `cat /etc/hostname` on that
server, and add a line in /etc/hosts that resolves that name to the actual
IP address (71.100.14.9, in your logs).

The other possibility is that it's really a 'stale state' in ZK - you can
either drop the znode (whichever you used for the --zk path) or launch with
a different one.

Finally, if you have the option to run master without using the `service
start`, by SSH'ing into the server and doing something like:

/path/to/install/bin/mesos-master.sh --quorum=1 --work_dir=/tmp/mesos
--zk=zk://ZK-IP:ZK-PORT/mesos/test --ip=71.100.14.9

and see whether that works.

If none of the above helps, please let us know what you see and we'll keep
debugging it :)

BTW - the new leading master is a bit of a logging decoy, it's not
actually new per se - so I'm almost[0] sure the leader never changed.

[0] almost as this line confuses me:
I0813 10:19:46.601297  2612 network.hpp:466] ZooKeeper group PIDs: {
log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 }
(but that's because of my lack of deep understanding of how the
log-replicas work)

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, Aug 13, 2015 at 7:37 AM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I hope someone can shed some light on this because I'm getting desperate!

 I try to start components zk, mesos-master, and marathon in that order.
 They are started via a program that SSHs to the sole host and does service
 xxx start. Everyone starts happily enough. But the Mesos UI shows me:

 *This master is not the leader, redirecting in 0 seconds ... go now*

 The pattern seen in all of the mesos-master.INFO logs (one of which shown
 below) is that the mesos-master with the correct IP@ starts. But then a
 new leader is detected and becomes leading master. This new leader shows
 UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

 I've tried clearing what ZK and mesos-master state I can find, but this
 problem will not go away.

 Would someone be so kind as to a) explain what is happening here and b)
 suggest remedies?

 Thanks very much.

 -Paul


 Log file created at: 2015/08/13 10:19:43
 Running on machine: 71.100.14.9
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
 I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
 root
 I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
 I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
 I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
 d6309f92a7f9af3ab61a878403e3d9c284ea87e0
 I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
 I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
 I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in 13961ns
 I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db in
 677ns
 I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in
 the db in 243ns
 I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
 positions 0 - 0 with 1 holes and 0 unlearned
 I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
 ZooKeeper group
 I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
 I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
 I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
 I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
 received a broadcasted recover request
 I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
 from a replica in EMPTY status
 *I0813 10:19:43.249503  2542 master.cpp:349] Master
 20150813-101943-151938119-5050-2542 (71.100.14.9) started on
 71.100.14.9:5050 http://71.100.14.9:5050*
 I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
 STARTING
 I0813 10:19:43.252571  2542 master.cpp:397] Master allowing
 unauthenticated frameworks to register
 I0813 10:19:43.253159  2542 master.cpp:402] Master allowing
 unauthenticated slaves to register
 I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8 bytes)
 to leveldb took 1.816161ms
 I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
 STARTING
 I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING status
 I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status

Re: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Marco Massenzio
On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote:

 Marco  hasodent,

 This is just a quick note to say thank you for your replies.

 No problem, you're welcome.


 I will answer you much more fully tomorrow, but for now can only manage a
 few quick observations  questions:

 1. Having some months ago encountered a known problem with the IP@
 127.0.1.1 (I'll provide references tomorrow), I early on configured
 /etc/hosts, replacing myHostName 127.0.1.1 with myHostName Real_IP.
 That said, I can't rule out a race condition whereby ZK | mesos-master saw
 the original unchanged /etc/hosts before I zapped it.

 2. What is a znode and how would I drop it?

 so, the znode is the fancy name that ZK gives to the nodes in its tree
(trivially, the path) - assuming that you give Mesos the following ZK URL:
zk://10.10.0.5:2181/mesos/prod

the 'znode' would be `/mesos/prod` and you could go inspect it (using
zkCli.sh) by doing:
 ls /mesos/prod

you should see at least one (with the Master running) file: info_001 or
json.info_0001 (depending on whether you're running 0.23 or 0.24) and
you could then inspect its contents with:
 get /mesos/prod/info_001

For example, if I run a Mesos 0.23 on my localhost, against ZK on the same:

$ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1
--work_dir=/tmp/m23-2 --port=5053
I can connect to ZK via zkCli.sh and:

[zk: localhost:2181(CONNECTED) 4] ls /mesos/test
[info_06, log_replicas]
[zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06
#20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053
* 192.168.33.120.23.0

cZxid = 0x314
dataLength = 93
 // a bunch of other metadata
numChildren = 0

(you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI
prompt - stop Mesos first, or it will be a very unhappy Master :).
in the corresponding logs I see (note the new leader here too, even
though this was the one and only):

I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get
'/mesos/test/info_06' in ZooKeeper
W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master
master@192.168.33.1:5053 is using a Protobuf binary format when registering
with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see
MESOS-2340)
I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master
(UPID=master@192.168.33.1:5053) is detected
I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader
is master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072
I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading
master!


At this point, I'm almost sure you're running up against some issue with
the log-replica; but I'm the least competent guy here to help you on that
one, hopefully someone else will be able to add insight here.

I start the services (zk, master, marathon; all on same host) by SSHing
 into the host  doing service  start commands.

 Again, thanks very much; and more tomorrow.

 Cordially,

 Paul

 On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote:

 Hello, how you start the master? And could you try use netstat
 -antp|grep 5050 to find whether there are multi master processes run at a
 same machine or not?

 On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I hope someone can shed some light on this because I'm getting desperate!

 I try to start components zk, mesos-master, and marathon in that order.
 They are started via a program that SSHs to the sole host and does service
 xxx start. Everyone starts happily enough. But the Mesos UI shows me:

 *This master is not the leader, redirecting in 0 seconds ... go now*

 The pattern seen in all of the mesos-master.INFO logs (one of which
 shown below) is that the mesos-master with the correct IP@ starts. But
 then a new leader is detected and becomes leading master. This new leader
 shows UPID *(UPID=master@127.0.1.1:5050 http://master@127.0.1.1:5050*

 I've tried clearing what ZK and mesos-master state I can find, but this
 problem will not go away.

 Would someone be so kind as to a) explain what is happening here and b)
 suggest remedies?

 Thanks very much.

 -Paul


 Log file created at: 2015/08/13 10:19:43
 Running on machine: 71.100.14.9
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
 I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
 root
 I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
 I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
 I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
 d6309f92a7f9af3ab61a878403e3d9c284ea87e0
 I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
 I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
 I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in
 13961ns
 I0813 10:19:43.247206  2542 

Re: Mesos slave help

2015-08-07 Thread Marco Massenzio
Hi Stephen,

You can see all the launch flags here:
http://mesos.apache.org/documentation/latest/configuration/
(or just running .../mesos-slave.sh --help)

If you launch it via systemd (which is actually how we run it ourselves in
DCOS) you will have to configure your nodes (master/agents) via the MESOS_*
environment variables.
In production, obviously, you want to use ZooKeeper as the discovery /
coordination method (as you correctly did here): you can obviously use
whatever you like as the znode path there, but it must be the same for all
masters/agents.

Make sure, if your run a test/dev configuration with multiple
masters/agents on the same node to (a) configure each master on their own
port (--port) and (b) to make each node point to a different work_dir (or
you'll get confusing errors around log-replicas).

(@haosdent: I'm *almost* sure the packaging is correct, but needs the env
vars to be configured properly)

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com http://codetrips.com*

On Thu, Aug 6, 2015 at 4:12 AM, Stephen Knight skni...@pivotal.io wrote:

 Ok, that's working if I run it like this: /usr/sbin/mesos-slave
 --master=zk://172.31.x.x:2181/mesos   /dev/null 21

 Thanks for your help, really appreciate it.

 On Thu, Aug 6, 2015 at 3:03 PM, haosdent haosd...@gmail.com wrote:

 Hm, need pass your master location, for example:

 /usr/sbin/mesos-slave --master=x.x.x.x:5050

 if you use zookeeper, need use the format like:

 /usr/sbin/mesos-slave --master=zk://host1:port1,host2:port2,.../path

 On Thu, Aug 6, 2015 at 6:55 PM, Stephen Knight skni...@pivotal.io
 wrote:

 My system doesn't support cat with systemctl for some reason but here is
 the contents of /usr/lib/systemd/system/mesos-slave.service

 [Unit]

 Description=Mesos Slave

 After=network.target

 Wants=network.target


 [Service]

 ExecStart=/usr/bin/mesos-init-wrapper slave

 KillMode=process

 Restart=always

 RestartSec=20

 LimitNOFILE=16384

 CPUAccounting=true

 MemoryAccounting=true


 [Install]

 WantedBy=multi-user.target


 What are the required flags to start it manually?

 On Thu, Aug 6, 2015 at 2:51 PM, haosdent haosd...@gmail.com wrote:

 Or you could try systemctl cat mesos-slave.service and show us the
 file content.

 On Thu, Aug 6, 2015 at 6:49 PM, haosdent haosd...@gmail.com wrote:

 From this message, I think systemctl status mesos-slave.service -l
 run mesos-slave with uncorrect flags. And the status out of it is the help
 message of slave. Could you try to start mesos-slave in manual way? Not
 through systemctl.

 On Thu, Aug 6, 2015 at 6:41 PM, Stephen Knight skni...@pivotal.io
 wrote:

 systemctl gives me the following output on CentOS: The command to
 start I ran was systemctl start mesos-slave.service

 [root@ip-172-31-35-167 mesos]# systemctl status mesos-slave.service
 -l

 mesos-slave.service - Mesos Slave

Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service;
 enabled)

   Drop-In: /etc/systemd/system/mesos-slave.service.d

└─mesos-slave-containerizers.conf

Active: activating (auto-restart) (Result: exit-code) since Thu
 2015-08-06 10:38:08 UTC; 2s ago

   Process: 1472 ExecStart=/usr/bin/mesos-init-wrapper slave 
 *(code=exited,
 status=1/FAILURE)*

  Main PID: 1472 (code=exited, status=1/FAILURE)


 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *If
 strict=false, any expected errors (e.g., slave cannot recover*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *information
 about an executor, because the slave died right before*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *the
 executor registered.) during recovery are ignored and as much*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *state
 as possible is recovered.*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *(default:
 true)*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *--[no-]switch_user
   Whether to run tasks as the user who*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *submitted
 them rather than the user running*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: *the
 slave (requires setuid permission) (default: true)*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *--[no-]version
   Show version and exit. (default: 
 false)*

 Aug 06 10:38:08 ip-172-31-35-167.ec2.internal mesos-slave[1483]: 
 *--work_dir=VALUE
 Directory path to place framework work
 directories*



 I've also run strace against it, nothing sticks out:


 strace systemctl start mesos-slave.service

 execve(/bin/systemctl, [systemctl, start,
 mesos-slave.service], [/* 18 vars */]) = 0

 brk(0)  = 0x7f5c2af9f000

 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
 0) = 0x7f5c2a5c6000

 access

Re: Metering for Mesos

2015-08-07 Thread Marco Massenzio
Hi Sam,

Mesos (both Master and Agents) publish a wealth of metrics that can be used
for metering, diagnostic, fault discovery/prediction and, I presume,
accounting and billing too (that very much depends on what pricing model
you guys use).

As an example, you may want to take a look at https://github.com/nqn/nibbler
.

Hope this helps.

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, Aug 6, 2015 at 9:48 PM, Sam Chen usultra...@gmail.com wrote:

 Haosdent ,
 Let me bring one example on the table . We are using Mesos and Marathon ,
 and deployed two tier application (web tier is Tomcat , database layer is
 mysql) .
 We are frustrated in terms of how to charge this service, so that we are
 thinking whether mesos or marathon can have metering service to let us
 reference .
 Hope its clear :)

 Sam

 On Fri, Aug 7, 2015 at 12:33 PM, haosdent haosd...@gmail.com wrote:

 You mean metering by resource? You could get every task resource usage
 through send http request to state.json .

 On Fri, Aug 7, 2015 at 12:23 PM, Sam Chen usultra...@gmail.com wrote:

 Guys ,
 We are planning to use Mesos as production  platform and based on
 Openstack , My question is , Is there any solution for metering ?  then
 billing .  Since we want to have our platform online and have
  pay-as-you-go mode . Anyone have any suggsetion ?  Very appreciated .


 Sam




 --
 Best Regards,
 Haosdent Huang





Re: Get List of Active Slaves

2015-08-04 Thread Marco Massenzio
Now that Mesos (0.24, to be released soon) publishes the Master info to
ZooKeeper in JSON, it should be (relatively) easier to get the info about
the leading master directly from there (or even set a Watcher on the znode
to be alerted of leadership changes).
Not as easy as hitting an HTTP endpoint, granted, but that's just a hard
problem to solve anyway.

I'm planning to provide sample code and a blog entry about this soon as I
have time, but it won't be before this weekend at the earliest (and more
likely the next one).

*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, Aug 4, 2015 at 5:04 PM, Steven Schlansker sschlans...@opentable.com
 wrote:

 Unfortunately that sort of solution is also prone to races.
 I do not think this is really possible (at least not even remotely
 elegantly) to solve externally to Mesos itself.

 On Aug 4, 2015, at 4:49 PM, James DeFelice james.defel...@gmail.com
 wrote:

  If you're using mesos-dns I think you can query slave.mesos to get an a
 record for each. I believe it responds to srv requests too.
 
  On Aug 4, 2015 7:29 PM, Steven Schlansker sschlans...@opentable.com
 wrote:
  Unfortunately this is racey.  If you redirect to a master just as it is
 removed from leadership, you can still get bogus data, with no indication
 anything went wrong.  Some people are reporting that this breaks tools that
 generate HTTP proxy configurations.
 
  I filed this issue a while ago as
 https://issues.apache.org/jira/browse/MESOS-1865
 
  On Aug 4, 2015, at 3:49 PM, Vinod Kone vinodk...@gmail.com wrote:
 
   Not today, no.
  
   But, you could either hit the /redirect endpoint on any master that
 should redirect you to the leading master.
  
   On Tue, Aug 4, 2015 at 3:29 PM, Nastooh Avessta (navesta) 
 nave...@cisco.com wrote:
   I see. Nope, and pointing to the leading master shows the proper
 resultJ Thanks.
  
   Is there a REST equivalent to mesos-resolve, so that one can ascertain
 who is the leader without having to point  to the leader?
  
   Cheers,
  
  
  
   image001.jpg
  
   Nastooh Avessta
   ENGINEER.SOFTWARE ENGINEERING
   nave...@cisco.com
   Phone: +1 604 647 1527
  
   Cisco Systems Limited
   595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
   VANCOUVER
   BRITISH COLUMBIA
   V7X 1J1
   CA
   Cisco.com
  
  
  
   image002.gifThink before you print.
  
   This email may contain confidential and privileged material for the
 sole use of the intended recipient. Any review, use, distribution or
 disclosure by others is strictly prohibited. If you are not the intended
 recipient (or authorized to receive for the recipient), please contact the
 sender by reply email and delete all copies of this message.
  
   For corporate legal information go to:
   http://www.cisco.com/web/about/doing_business/legal/cri/index.html
  
   Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada,
 M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferences -
 Unsubscribe – Privacy
  
  
  
   From: Vinod Kone [mailto:vinodk...@gmail.com]
   Sent: Tuesday, August 04, 2015 3:19 PM
   To: user@mesos.apache.org
   Subject: Re: Get List of Active Slaves
  
  
  
   Is that the leading master?
  
  
  
   On Tue, Aug 4, 2015 at 3:09 PM, Nastooh Avessta (navesta) 
 nave...@cisco.com wrote:
  
   Hi
  
   Trying to get the list of active slaves, via cli, e.g. curl
 http://10.4.50.80:5050/master/slaves | python -m json.tool and am not
 getting the expected results.  The returned value is empty:
  
   {
  
   slaves: []
  
   }
  
   , whereas, looking at web gui I can see that there are deployed
 slaves. Am I missing something?
  
   Cheers,
  
  
  
   image001.jpg
  
   Nastooh Avessta
   ENGINEER.SOFTWARE ENGINEERING
   nave...@cisco.com
   Phone: +1 604 647 1527
  
   Cisco Systems Limited
   595 Burrard Street, Suite 2123 Three Bentall Centre, PO Box 49121
   VANCOUVER
   BRITISH COLUMBIA
   V7X 1J1
   CA
   Cisco.com
  
  
  
   image002.gifThink before you print.
  
   This email may contain confidential and privileged material for the
 sole use of the intended recipient. Any review, use, distribution or
 disclosure by others is strictly prohibited. If you are not the intended
 recipient (or authorized to receive for the recipient), please contact the
 sender by reply email and delete all copies of this message.
  
   For corporate legal information go to:
   http://www.cisco.com/web/about/doing_business/legal/cri/index.html
  
   Cisco Systems Canada Co, 181 Bay St., Suite 3400, Toronto, ON, Canada,
 M5J 2T3. Phone: 416-306-7000; Fax: 416-306-7099. Preferences -
 Unsubscribe – Privacy
  
  
  
  
  
  
 




Re: How to measure the ZooKeeper Resilience on mesos cluster

2015-08-03 Thread Marco Massenzio
Distributed systems are hard - but most importantly, they all differ in
various ways.

  I feel the zookeeper is almost unstable for a cluster.

this is too a general and vague statement to be either true or false (or
provide any guidance): it all depends on how you deploy your ensemble, what
hardware it runs on, what virtualization layer you use, how do you manage
failovers and recovery.

But, way more importantly, it all depends on *your* requirements: a
configuration that works perfectly fine for a few hundred nodes,
distributed across 2-3 DCs in a geographically contained region (eg,
North America) would be woefully inadequate for a system running across 6
global DCs, covering several thousand of nodes, with tight latency
requirements.

Outside of Google (where we would use our own stuff - Borg, Chubby 
friends) I've never really had any trouble with ZK - then again, maybe the
stuff I worked on, was nowhere near as complex as what you're trying to
achieve.

My suggestion would be to try it out on a staging environment, conduct some
performance and stress test, and find out whether the performance,
stability and availability of the ZK ensemble (and, consequently, of the
Mesos cluster) meet your requirements.

Hope this helps.

*Marco Massenzio*
*Distributed Systems Engineer*

On Sun, Aug 2, 2015 at 10:15 AM, tommy xiao xia...@gmail.com wrote:

 today i reading  ZooKeeper Resilience at Pinterest (
 https://engineering.pinterest.com/blog/zookeeper-resilience-pinterest?route=/post/%3Aid/%3Asummary),
  I feel the zookeeper is almost unstable for a cluster.

 Does anyone have some experience with the zookeeper usage?

 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



Re: [RESULT] [VOTE] Release Apache Mesos 0.23.0 (rc4)

2015-07-23 Thread Marco Massenzio
Great news, indeed!

Thanks, Adam, for all the hard work in driving this release to fruition,
you're a star!

*Marco Massenzio*
*Distributed Systems Engineer*

On Wed, Jul 22, 2015 at 9:29 PM, Adam Bordelon a...@mesosphere.io wrote:

 Good news, everyone!

 The vote for Mesos 0.23.0 (rc4) has passed with the following votes.

 +1 (Binding)
 --
 *** Vinod Kone
 *** Adam B
 *** Benjamin Hindman
 *** Timothy Chen

 +1 (Non-binding)
 --
 *** Vaibhav Khanduja
 *** Marco Massenzio

 There were no 0 or -1 votes.

 Known issue: `sudo make check` may fail on some OSes. These tests have
 been fixed in 0.24.0 without any changes to the rest of the code.

 Please find the release at:
 https://dist.apache.org/repos/dist/release/mesos/0.23.0

 It is recommended to use a mirror to download the release:
 http://www.apache.org/dyn/closer.cgi

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0

 The mesos-0.23.0.jar has been released to:
 https://repository.apache.org

 The website (http://mesos.apache.org) will be updated shortly to reflect
 this release.

 Thanks,
 -Adam-



 On Wed, Jul 22, 2015 at 1:20 PM, Timothy Chen tnac...@gmail.com wrote:

 +1

 The docker bridge network test failed because some iptable rules that was
 set on the environment. I will comment on the JIRA but not a blocker.

 Tim


  On Jul 22, 2015, at 1:07 PM, Benjamin Hindman 
 benjamin.hind...@gmail.com wrote:
 
  +1 (binding)
 
  On Ubuntu 14.04:
 
  $ make check
  ... all tests pass ...
  $ sudo make check
  ... tests with known issues fail, but ignoring because these have all
 been
  resolved and are issues with the tests alone ...
 
  Thanks Adam.
 
  On Fri, Jul 17, 2015 at 4:42 PM Adam Bordelon a...@mesosphere.io
 wrote:
 
  Hello Mesos community,
 
  Please vote on releasing the following candidate as Apache Mesos
 0.23.0.
 
  0.23.0 includes the following:
 
 
 
  - Per-container network isolation
  - Dockerized slaves will properly recover Docker containers upon
 failover.
  - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
 
  as well as experimental support for:
  - Fetcher Caching
  - Revocable Resources
  - SSL encryption
  - Persistent Volumes
  - Dynamic Reservations
 
  The CHANGELOG for the release is available at:
 
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4
 
 
 
 
  The candidate for Mesos 0.23.0 release is available at:
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz
 
  The tag to be voted on is 0.23.0-rc4:
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4
 
  The MD5 checksum of the tarball can be found at:
 
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5
 
  The signature of the tarball can be found at:
 
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc
 
  The PGP key used to sign the release is here:
  https://dist.apache.org/repos/dist/release/mesos/KEYS
 
  The JAR is up in Maven in a staging repository here:
  https://repository.apache.org/content/repositories/orgapachemesos-1062
 
  Please vote on releasing this package as Apache Mesos 0.23.0!
 
  The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!)
  [ ] -1 Do not release this package because ...
 
  Thanks,
  -Adam-
 
 





Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)

2015-07-22 Thread Marco Massenzio
+1

Run all tests on Ubuntu 14.04 (physical box, not a VM).
All tests pass (as regular user).

`sudo make distcheck` still fails with the following errors; I am assuming
these are known issues and not deemed to be blockers?

[  FAILED  ] 9 tests, listed below:
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where
TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess
[  FAILED  ]
MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
[  FAILED  ] NsTest.ROOT_setns
[  FAILED  ] PerfTest.ROOT_Events
[  FAILED  ] PerfTest.ROOT_SamplePid

I tried to check out 0.22.1 and run the same tests, but it has several
failures and it complains about already existing cgroups hierarchies; so
I'm assuming the earlier test run left the system in an unclean state.


*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, Jul 21, 2015 at 3:12 PM, Adam Bordelon a...@mesosphere.io wrote:

 +1 (binding) to Mesos 0.23.0-rc4 as 0.23.0
 As I mentioned before, for rc3, basic integration tests passed for Mesos 0
 .23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark, HDFS,
 Cassandra, and Kafka.

 We have been tracking the Ubuntu `sudo make check` failures in
 https://issues.apache.org/jira/browse/MESOS-3079 and related CentOS ROOT_
 test failures in https://issues.apache.org/jira/browse/MESOS-3050 (some
 fixes already pulled into rc4).
 After pulling down the latest master, including a series of test code only
 fixes for MESOS-3079, `sudo make check` passed for me on Ubuntu 14.04,
 excluding only ROOT_DOCKER_Launch_Executor_Bridged (segfault tracked in
 MESOS-3123). There are at least two remaining test-only fixes tracked in
 MESOS-3079, but none of these are critical for Mesos 0.23.0, so I'm not
 inclined to call for a rc5. We can call out the ROOT_ test failures as a
 known issue with the release.

 Anybody else have any test results?

 Please vote,
 -Adam-

 On Fri, Jul 17, 2015 at 8:18 PM, Marco Massenzio ma...@mesosphere.io
 wrote:

 I am almost sure (more like hoping) I'm missing something fundamental
 here and/or there is some basic configuration missing on my box.
 Running tests as root, causes a significant number of failures.

 Has anyone else *ever* run tests as root in the last few weeks?

 Here's the headline, the full log of the failed tests attached.

 $ lsb_release -a
 LSB Version:
  
 core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:desktop-4.1-amd64:desktop-4.1-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:graphics-4.1-amd64:graphics-4.1-noarch:languages-3.2-amd64:languages-3.2-noarch:languages-4.0-amd64:languages-4.0-noarch:languages-4.1-amd64:languages-4.1-noarch:multimedia-3.2-amd64:multimedia-3.2-noarch:multimedia-4.0-amd64:multimedia-4.0-noarch:multimedia-4.1-amd64:multimedia-4.1-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:printing-4.1-amd64:printing-4.1-noarch:qt4-3.1-amd64:qt4-3.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
 Distributor ID: Ubuntu
 Description:*Ubuntu 14.04.2 LTS*
 Release:14.04
 Codename:   *trusty*

 $ sudo make -j12 V=0 check

 [==] 712 tests from 116 test cases ran. (318672 ms total)
 [  PASSED  ] 676 tests.
 [  FAILED  ] 36 tests, listed below:
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 [  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where
 TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess
 [  FAILED  ] SlaveRecoveryTest/0.RecoverSlaveState, where TypeParam =
 mesos::internal::slave::MesosContainerizer
 [  FAILED  ] SlaveRecoveryTest/0.RecoverStatusUpdateManager, where
 TypeParam = mesos::internal::slave::MesosContainerizer
 [  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam =
 mesos::internal::slave::MesosContainerizer
 [  FAILED  ] SlaveRecoveryTest/0.RecoverUnregisteredExecutor, where
 TypeParam = mesos::internal::slave::MesosContainerizer
 [  FAILED  ] SlaveRecoveryTest/0.RecoverTerminatedExecutor, where
 TypeParam = mesos::internal::slave::MesosContainerizer
 [  FAILED  ] SlaveRecoveryTest/0.RecoverCompletedExecutor, where
 TypeParam = mesos::internal::slave::MesosContainerizer

Re: Cluster of Workstations type design for a Mesos cluster

2015-07-21 Thread Marco Massenzio
You're not crazy :)

This will work just fine, the Master takes up very little CPU/RAM, and, as
you plan to have it on your desktop you could even wrap with some
send-notify script so that should it fail or something, you could get an
alert.
I'm not sure why you want to segment out the Agent Nodes and isolate them
from (outbound) web connectivity, the one thing to bear in mind is that you
won't be able to install packages directly (apt-get) and anything you want
to run off them you will need to instead fuffle around with binary
installers and the like - then again, you may just install the smallest
footprint OS (CoreOS springs to mind) and maximize resources for tasks.

Keep us posted on how you progress, I may eventually go down the same path
:)



*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, Jul 21, 2015 at 6:44 AM, Gaston, Dan dan.gas...@nshealth.ca wrote:

  Is there likely to be any issues with the Master? Given it would be an
 active desktop it would be running all of the typical mesos master stuff,
 plus say an active Ubuntu desktop environment. It would also need to host
 things like a local Docker registry and the like as well, since the compute
 nodes wouldn’t have direct access to the wider internet.



 *From:* jeffschr...@gmail.com [mailto:jeffschr...@gmail.com] *On Behalf
 Of *Jeff Schroeder
 *Sent:* Tuesday, July 21, 2015 10:42 AM
 *To:* user@mesos.apache.org
 *Subject:* Re: Cluster of Workstations type design for a Mesos cluster



 As far as mesos is concerned, compute is a commodity. This should work
 just fine. Put Aurora or Marathon ontop of mesos if you need a general
 purpose scheduler and you're good to go. The nice thing is that you can add
 additional slaves as you need. I believe heterogeneous clusters are best if
 possible, but absolutely not a requirement of any sort.

 On Tuesday, July 21, 2015, Gaston, Dan dan.gas...@nshealth.ca wrote:

 Let’s say I had 2 high-performance workstations kicking around (dual
 6-core, 2.4GHz, xeon processors; 128 GB RAM each; etc) and a smaller
 workstation (single Xeon 4-core, 3.5GHz and 16 GB RAM) available and I
 wanted to cluster them together with Mesos. What is the best way of doing
 this? My thought was that the smaller workstation would be at my desk (the
 other two would be in the same office) because it would be used for
 development work and some general tasks but would also be the master node
 of the mesos cluster (note that HA isn’t a requirement here). This
 workstation would have two NICs, one connected to our institutional network
 and the other making up the private network between the clusters.



 Is this even doable? Normally you would have some sort of client
 submitting to the Master but in this case the Master node would be serving
 up multiple roles. The other workstations would probably not have access to
 the institutional network, so all software updates and the like would have
 to be piped through the master workstation. There would also be a
 relatively large NAS device connected into this network as well.



 Thoughts and suggestions welcome, even if it is to tell me I’m crazy. I’m
 building a small scale compute “cluster” that is fairly limited by budget
 (and the needs aren’t high either) and it may not be able to be located in
 a datacenter, hence the cluster of workstations type setup.







 [image: NSHA_colour_logo.jpg]

 Dan Gaston, PhD

 Clinical Laboratory Bioinformatician

 Department of Pathology and Laboratory Medicine

 Division of Hematopathology

 Rm 511, 5788 University Ave.

 Halifax, NS B3H 1V8







 --
 Text by Jeff, typos by iPhone



Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)

2015-07-17 Thread Marco Massenzio
Ubuntu 14.04

Not sure if I'm doing something wrong, `sudo make distcheck` fails -
re-running after a `make clean`

If it continues failing, I'll provide more detailed log output.
In the meantime, if anyone has any suggestions as to what I may be doing
wrong, please let me know.

$ ../configure  make -j8 V=0  make -j12 V=0 check

[==] 649 tests from 94 test cases ran. (254152 ms total)
[  PASSED  ] 649 tests.

$ sudo make -j12 V=0 distcheck

[==] 712 tests from 116 test cases ran. (325751 ms total)
[  PASSED  ] 702 tests.
[  FAILED  ] 10 tests, listed below:
[  FAILED  ] LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where
TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess
[  FAILED  ]
MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
[  FAILED  ] NsTest.ROOT_setns
[  FAILED  ] PerfTest.ROOT_Events
[  FAILED  ] PerfTest.ROOT_SamplePid

10 FAILED TESTS
  YOU HAVE 12 DISABLED TESTS


*Marco Massenzio*
*Distributed Systems Engineer*

On Fri, Jul 17, 2015 at 6:49 PM, Vinod Kone vinodk...@gmail.com wrote:

 +1 (binding)

 Successfully built RPMs for CentOS5 and CentOS6 with network isolator.


 On Fri, Jul 17, 2015 at 4:56 PM, Khanduja, Vaibhav 
 vaibhav.khand...@emc.com
  wrote:

  +1
 
  Sent from my iPhone. Please excuse the typos and brevity of this message.
 
   On Jul 17, 2015, at 4:43 PM, Adam Bordelon a...@mesosphere.io wrote:
  
   Hello Mesos community,
  
   Please vote on releasing the following candidate as Apache Mesos
 0.23.0.
  
   0.23.0 includes the following:
  
 
 
   - Per-container network isolation
   - Dockerized slaves will properly recover Docker containers upon
  failover.
   - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
  
   as well as experimental support for:
   - Fetcher Caching
   - Revocable Resources
   - SSL encryption
   - Persistent Volumes
   - Dynamic Reservations
  
   The CHANGELOG for the release is available at:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4
  
 
 
  
   The candidate for Mesos 0.23.0 release is available at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz
  
   The tag to be voted on is 0.23.0-rc4:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4
  
   The MD5 checksum of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5
  
   The signature of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc
  
   The PGP key used to sign the release is here:
   https://dist.apache.org/repos/dist/release/mesos/KEYS
  
   The JAR is up in Maven in a staging repository here:
   https://repository.apache.org/content/repositories/orgapachemesos-1062
  
   Please vote on releasing this package as Apache Mesos 0.23.0!
  
   The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a
   majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!)
   [ ] -1 Do not release this package because ...
  
   Thanks,
   -Adam-
 



Re: [VOTE] Release Apache Mesos 0.23.0 (rc4)

2015-07-17 Thread Marco Massenzio
  ]
MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
[  FAILED  ]
MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
[  FAILED  ] NsTest.ROOT_setns
[  FAILED  ] PerfTest.ROOT_Events
[  FAILED  ] PerfTest.ROOT_SamplePid

36 FAILED TESTS
  YOU HAVE 12 DISABLED TESTS



*Marco Massenzio*
*Distributed Systems Engineer*

On Fri, Jul 17, 2015 at 7:26 PM, Marco Massenzio ma...@mesosphere.io
wrote:

 Ubuntu 14.04

 Not sure if I'm doing something wrong, `sudo make distcheck` fails -
 re-running after a `make clean`

 If it continues failing, I'll provide more detailed log output.
 In the meantime, if anyone has any suggestions as to what I may be doing
 wrong, please let me know.

 $ ../configure  make -j8 V=0  make -j12 V=0 check

 [==] 649 tests from 94 test cases ran. (254152 ms total)
 [  PASSED  ] 649 tests.

 $ sudo make -j12 V=0 distcheck

 [==] 712 tests from 116 test cases ran. (325751 ms total)
 [  PASSED  ] 702 tests.
 [  FAILED  ] 10 tests, listed below:
 [  FAILED  ] LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids
 [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
 [  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where
 TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess
 [  FAILED  ]
 MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward
 [  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
 [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
 [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
 [  FAILED  ] NsTest.ROOT_setns
 [  FAILED  ] PerfTest.ROOT_Events
 [  FAILED  ] PerfTest.ROOT_SamplePid

 10 FAILED TESTS
   YOU HAVE 12 DISABLED TESTS


 *Marco Massenzio*
 *Distributed Systems Engineer*

 On Fri, Jul 17, 2015 at 6:49 PM, Vinod Kone vinodk...@gmail.com wrote:

 +1 (binding)

 Successfully built RPMs for CentOS5 and CentOS6 with network isolator.


 On Fri, Jul 17, 2015 at 4:56 PM, Khanduja, Vaibhav 
 vaibhav.khand...@emc.com
  wrote:

  +1
 
  Sent from my iPhone. Please excuse the typos and brevity of this
 message.
 
   On Jul 17, 2015, at 4:43 PM, Adam Bordelon a...@mesosphere.io
 wrote:
  
   Hello Mesos community,
  
   Please vote on releasing the following candidate as Apache Mesos
 0.23.0.
  
   0.23.0 includes the following:
  
 
 
   - Per-container network isolation
   - Dockerized slaves will properly recover Docker containers upon
  failover.
   - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
  
   as well as experimental support for:
   - Fetcher Caching
   - Revocable Resources
   - SSL encryption
   - Persistent Volumes
   - Dynamic Reservations
  
   The CHANGELOG for the release is available at:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc4
  
 
 
  
   The candidate for Mesos 0.23.0 release is available at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz
  
   The tag to be voted on is 0.23.0-rc4:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc4
  
   The MD5 checksum of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.md5
  
   The signature of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc4/mesos-0.23.0.tar.gz.asc
  
   The PGP key used to sign the release is here:
   https://dist.apache.org/repos/dist/release/mesos/KEYS
  
   The JAR is up in Maven in a staging repository here:
  
 https://repository.apache.org/content/repositories/orgapachemesos-1062
  
   Please vote on releasing this package as Apache Mesos 0.23.0!
  
   The vote is open until Wed July 22nd, 17:00 PDT 2015 and passes if a
   majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Mesos 0.23.0 (I've tested it!)
   [ ] -1 Do not release this package because ...
  
   Thanks,
   -Adam-
 






$ sudo make -j12 V=0 check

[==] 712 tests from 116 test cases ran. (318672 ms total)
[  PASSED  ] 676 tests.
[  FAILED  ] 36 tests, listed below:
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess
[  FAILED  ] SlaveRecoveryTest/0.RecoverSlaveState, where TypeParam = mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.RecoverStatusUpdateManager, where TypeParam = mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0

Re: [VOTE] Release Apache Mesos 0.23.0 (rc3)

2015-07-16 Thread Marco Massenzio
Just to add my +1




Built  Make check on Ubuntu 14.04

With  Without SSL / libevent

(no 'sudo' - can test all 4 variants this evening on rc4)



—
Sent from Mailbox

On Thu, Jul 16, 2015 at 3:10 PM, Timothy Chen tnac...@gmail.com wrote:

 As Adam mention I also think this is not a blocker, as it only affects
 the way we test the cgroup on CentOS 7.x due to a CentOS bug and
 doesn't actually impact Mesos normal operations.
 My vote is +1 as well.
 Tim
 On Thu, Jul 16, 2015 at 12:10 PM, Vinod Kone vinodk...@gmail.com wrote:
 Found a bug in HTTP API related code: MESOS-3055
 https://issues.apache.org/jira/browse/MESOS-3055

 If we don't fix this in 0.23.0, we cannot expect the 0.24.0 scheduler
 driver (that will send Calls) to properly subscribe with a 0.23.0 master. I
 could add a work around in the driver to only send Calls if the master
 version is 0.24.0, but would prefer to not have to do that.

 Also, on the review https://reviews.apache.org/r/36518/ for that bug, we
 realized that we might want to make Subscribe.force 'optional' instead of
 'required'. That's an API change, which would be nice to go into 0.23.0 as
 well.

 So, not a -1 per se, but if you are willing to cut another RC, I can land
 the fixes today. Sorry for the trouble.

 On Thu, Jul 16, 2015 at 11:48 AM, Adam Bordelon a...@mesosphere.io wrote:

 +1 (binding)
 This vote has been silent for almost a week. I assume everybody's busy
 testing. My testing results: basic integration tests passed for Mesos
 0.23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark, HDFS,
 Cassandra, and Kafka.

 `make check` passes on Ubuntu and CentOS, but `sudo make check` fails on
 CentOS 7.1 due to errors in CentOS. See
 https://issues.apache.org/jira/browse/MESOS-3050 for more details. I'm not
 convinced this is serious enough to do another release candidate and voting
 round, but I'll let Tim and others chime in with their thoughts.

 If we don't get enough deciding votes by 6pm Pacific today, I'll extend the
 vote for another day.

 On Thu, Jul 9, 2015 at 6:09 PM, Khanduja, Vaibhav 
 vaibhav.khand...@emc.com
 wrote:

  +1
 
  Sent from my iPhone. Please excuse the typos and brevity of this message.
 
   On Jul 9, 2015, at 6:07 PM, Adam Bordelon a...@mesosphere.io wrote:
  
   Hello Mesos community,
  
   Please vote on releasing the following candidate as Apache Mesos
 0.23.0.
  
   0.23.0 includes the following:
  
 
 
   - Per-container network isolation
   - Dockerized slaves will properly recover Docker containers upon
  failover.
   - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
  
   as well as experimental support for:
   - Fetcher Caching
   - Revocable Resources
   - SSL encryption
   - Persistent Volumes
   - Dynamic Reservations
  
   The CHANGELOG for the release is available at:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc3
  
 
 
  
   The candidate for Mesos 0.23.0 release is available at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz
  
   The tag to be voted on is 0.23.0-rc3:
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc3
  
   The MD5 checksum of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz.md5
  
   The signature of the tarball can be found at:
  
 
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc3/mesos-0.23.0.tar.gz.asc
  
   The PGP key used to sign the release is here:
   https://dist.apache.org/repos/dist/release/mesos/KEYS
  
   The JAR is up in Maven in a staging repository here:
   https://repository.apache.org/content/repositories/orgapachemesos-1060
  
   Please vote on releasing this package as Apache Mesos 0.23.0!
  
   The vote is open until Thurs July 16th, 18:00 PDT 2015 and passes if a
   majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Mesos 0.23.0
   [ ] -1 Do not release this package because ...
  
   Thanks,
   -Adam-
 


Re: [VOTE] Release Apache Mesos 0.23.0 (rc3)

2015-07-16 Thread Marco Massenzio
Adam - thanks.
Please let me know soon as you push an rc4, if I'm still home, I can test
it against Ubuntu 14.04 with/without SSL, with/without sudo (or I can
always VPN in :)

Very minor doc update: https://reviews.apache.org/r/36532/
(feel free to ignore).

Thanks, everyone!

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, Jul 16, 2015 at 8:05 PM, Adam Bordelon a...@mesosphere.io wrote:

 Thanks, Vinod. I've got those commits in the list already. We'll pull in
 fixes for MESOS-3055 and others for rc4.
 I'll give it another night for Bernd to commit the fetcher fix and for
 Niklas to update the oversubscription doc.
 Then I'll cut rc4 tomorrow and leave the new vote open until next
 Wednesday.
 See the dashboard for status on remaining issues:
 https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326227

 Jeff, see my cherry-pick spreadsheet to see what we're planning to pull
 into rc4:

 https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0
 If anybody has any other high priority fixes or doc updates that they want
 in rc4, let me know asap.

 On Thu, Jul 16, 2015 at 7:58 PM, Jeff Schroeder 
 jeffschroe...@computer.org wrote:

 What about MESOS-3055 in 0.23? Is that going to get passed up on even if
 we are going to cut another rc?

 On Thursday, July 16, 2015, Vinod Kone vinodk...@gmail.com wrote:

 -1 so that we can cherry pick MESOS-3055.

 The master crash bug is MESOS-3070
 https://issues.apache.org/jira/browse/MESOS-3070 but the fix is
 non-trivial and the bug has been in the code base prior to 23.0. So I won't
 make it a blocker.

 Can't update the spreadsheet. So here are the commits I would like
 cherry-picked.

 fc85cc512b7767fc2e3921b15cf6602c0c68593e
 bfe6c07b79550bb3d1f2ab6f5344d740e6eb6f60

 Thanks Adam.

 On Thu, Jul 16, 2015 at 7:39 PM, Adam Bordelon a...@mesosphere.io
 wrote:

 The 7 day voting period has ended with only 2 binding +1s (we needed 3)
 and
 no explicit -1s.
 However, Vinod says they've found a bug that crashes master when a
 framework uses duplicate task ids.
 Vinod, can you please share the new JIRA and officially vote -1 for rc3
 if
 you want to call for an rc4?
 Assuming we'll cut an rc4, I'm tracking the JIRAs/patches to pull in
 here:


 https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0
 Since the rc4 changes are minor (mostly tests) and we've heavily tested
 rc3, the next vote will only last for 3 (business) days.


 On Thu, Jul 16, 2015 at 6:38 PM, Marco Massenzio ma...@mesosphere.io
 wrote:

  Just to add my +1
 
  Built  Make check on Ubuntu 14.04
  With  Without SSL / libevent
  (no 'sudo' - can test all 4 variants this evening on rc4)
 
  —
  Sent from Mailbox https://www.dropbox.com/mailbox
 
 
  On Thu, Jul 16, 2015 at 3:10 PM, Timothy Chen tnac...@gmail.com
 wrote:
 
  As Adam mention I also think this is not a blocker, as it only
 affects
  the way we test the cgroup on CentOS 7.x due to a CentOS bug and
  doesn't actually impact Mesos normal operations.
 
  My vote is +1 as well.
 
  Tim
 
  On Thu, Jul 16, 2015 at 12:10 PM, Vinod Kone vinodk...@gmail.com
  wrote:
   Found a bug in HTTP API related code: MESOS-3055
   https://issues.apache.org/jira/browse/MESOS-3055
  
   If we don't fix this in 0.23.0, we cannot expect the 0.24.0
 scheduler
   driver (that will send Calls) to properly subscribe with a 0.23.0
  master. I
   could add a work around in the driver to only send Calls if the
 master
   version is 0.24.0, but would prefer to not have to do that.
  
   Also, on the review https://reviews.apache.org/r/36518/ for that
  bug, we
   realized that we might want to make Subscribe.force 'optional'
 instead
  of
   'required'. That's an API change, which would be nice to go into
 0.23.0
  as
   well.
  
   So, not a -1 per se, but if you are willing to cut another RC, I
 can
  land
   the fixes today. Sorry for the trouble.
  
   On Thu, Jul 16, 2015 at 11:48 AM, Adam Bordelon 
 a...@mesosphere.io
  wrote:
  
   +1 (binding)
   This vote has been silent for almost a week. I assume everybody's
 busy
   testing. My testing results: basic integration tests passed for
 Mesos
   0.23.0 on CoreOS with DCOS GUI/CLI, Marathon, Chronos, Spark,
 HDFS,
   Cassandra, and Kafka.
  
   `make check` passes on Ubuntu and CentOS, but `sudo make check`
 fails
  on
   CentOS 7.1 due to errors in CentOS. See
   https://issues.apache.org/jira/browse/MESOS-3050 for more
 details.
  I'm not
   convinced this is serious enough to do another release candidate
 and
  voting
   round, but I'll let Tim and others chime in with their thoughts.
  
   If we don't get enough deciding votes by 6pm Pacific today, I'll
  extend the
   vote for another day.
  
   On Thu, Jul 9, 2015 at 6:09 PM, Khanduja, Vaibhav 
   vaibhav.khand...@emc.com
   wrote:
  
+1
   
Sent from my iPhone. Please excuse the typos and brevity of this
  message.
   
 On Jul 9, 2015, at 6:07 PM

Re: [VOTE] Release Apache Mesos 0.23.0 (rc2)

2015-07-09 Thread Marco Massenzio
This seems to be somewhat related to PB 2.4 v 2.5 (what Mesos uses) - and
possibly, indirectly, to Py 2.6 v 2.7 (wild guess here).

The problem with Python is that it's always difficult to figure out where
it goes looking for imports (unless you have a virtualenv and/or munge
sys.path) so it may well be that it finds `mesos.interface` from the main
system site-packages folder (where you may have an old version of the
protobuf libraries) instead of the correct (for 2.5.0) place (under our
build/3rdparty/... foders).

As in the other instance, a log dump of sys.path just before the import
*may* shed some light (or add to the confusion).

IMO we should require Python == 2.7 (no idea if we can support Python 3, my
guess is we can't, because of this
https://github.com/google/protobuf/issues/9), but that's probably another
story.

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, Jul 9, 2015 at 3:21 PM, Ian Downes idow...@twitter.com wrote:

 The ExamplesTest.PythonFramework test fails differently for me on CentOS5
 with python 2.6.6. I presume we don't require 2.7?

 [idownes@hostname build]$ MESOS_VERBOSE=1 ./bin/mesos-tests.sh
 --gtest_filter=ExamplesTest.PythonFramework
 Source directory: /home/idownes/workspace/mesos
 Build directory: /home/idownes/workspace/mesos/build
 -
 We cannot run any cgroups tests that require mounting
 hierarchies because you have the following hierarchies mounted:
 /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/freezer,
 /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event
 We'll disable the CgroupsNoHierarchyTest test fixture for now.
 -
 -
 We cannot run any Docker tests because:
 Failed to get docker version: Failed to execute 'docker --version': exited
 with status 127
 -
 /usr/bin/nc
 Note: Google Test filter = trimmed
 [==] Running 1 test from 1 test case.
 [--] Global test environment set-up.
 [--] 1 test from ExamplesTest
 [ RUN  ] ExamplesTest.PythonFramework
 Using temporary directory '/tmp/ExamplesTest_PythonFramework_igPnUB'
 Traceback (most recent call last):
   File
 /home/idownes/workspace/mesos/build/../src/examples/python/test_framework.py,
 line 24, in module
 from mesos.interface import mesos_pb2
   File build/bdist.linux-x86_64/egg/mesos/interface/mesos_pb2.py, line
 4, in module
 ImportError: cannot import name enum_type_wrapper
 ../../src/tests/script.cpp:83: Failure
 Failed
 python_framework_test.sh exited with status 1
 [  FAILED  ] ExamplesTest.PythonFramework (136 ms)
 [--] 1 test from ExamplesTest (136 ms total)

 [--] Global test environment tear-down
 [==] 1 test from 1 test case ran. (169 ms total)
 [  PASSED  ] 0 tests.
 [  FAILED  ] 1 test, listed below:
 [  FAILED  ] ExamplesTest.PythonFramework

  1 FAILED TEST
   YOU HAVE 10 DISABLED TESTS

 [idownes@hostname build]$ python --version
 Python 2.6.6



 On Thu, Jul 9, 2015 at 2:53 PM, Vinod Kone vinodk...@gmail.com wrote:

 I'm assuming the 50 min Jeff mentioned was when doing a 'make check' on a
 fresh copy of mesos source code. The majority of that time should be due to
 compilation of source and test code (both of which will be sped up by -j);
 a sequential run of the test suite should be within 10 min IIRC.

 On Thu, Jul 9, 2015 at 2:40 PM, Marco Massenzio ma...@mesosphere.io
 wrote:

 @Vinod: unfortunately, the tests must be run sequentially, so (at least,
 as far as I can tell) there's virtually no speedup in 'make check' by using
 the -j switch.
 As someone else pointed out, it would be grand if we could have a 'test
 compilation' step (which can be run in parallel and speeds up) distinct
 from a 'run tests' step (which must run sequentially).

 *Marco Massenzio*
 *Distributed Systems Engineer*

 On Thu, Jul 9, 2015 at 2:28 PM, Vinod Kone vinodk...@gmail.com wrote:

 As a tangent, you can speed up the build by doing make -j#threads
 check.

 On Thu, Jul 9, 2015 at 1:35 PM, Jeff Schroeder 
 jeffschroe...@computer.org wrote:

 I'm unable to replicate the same failure on another up to date RHEL
 7.1 machine for some strange reason. Even blowing away the checkout, doing
 a fresh clone, and waiting ~50 minutes for make check to finish, it still
 pops. However on my laptop, this test passes fine. Let's chock this one up
 to works on my *other* machine.

 =
 jschroeder@omniscience:~/git/mesos (master)$ bin/mesos-tests.sh
 --gtest_filter=ExamplesTest.PythonFramework --verbose
 Source directory: /home/jschroeder/git/mesos
 Build directory: /home/jschroeder/git/mesos
 -
 We cannot run any cgroups tests that require mounting
 hierarchies because you have the following hierarchies mounted:
 /sys/fs/cgroup

Re: [VOTE] Release Apache Mesos 0.23.0 (rc2)

2015-07-09 Thread Marco Massenzio
@Vinod: unfortunately, the tests must be run sequentially, so (at least, as
far as I can tell) there's virtually no speedup in 'make check' by using
the -j switch.
As someone else pointed out, it would be grand if we could have a 'test
compilation' step (which can be run in parallel and speeds up) distinct
from a 'run tests' step (which must run sequentially).

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, Jul 9, 2015 at 2:28 PM, Vinod Kone vinodk...@gmail.com wrote:

 As a tangent, you can speed up the build by doing make -j#threads
 check.

 On Thu, Jul 9, 2015 at 1:35 PM, Jeff Schroeder jeffschroe...@computer.org
  wrote:

 I'm unable to replicate the same failure on another up to date RHEL 7.1
 machine for some strange reason. Even blowing away the checkout, doing a
 fresh clone, and waiting ~50 minutes for make check to finish, it still
 pops. However on my laptop, this test passes fine. Let's chock this one up
 to works on my *other* machine.

 =
 jschroeder@omniscience:~/git/mesos (master)$ bin/mesos-tests.sh
 --gtest_filter=ExamplesTest.PythonFramework --verbose
 Source directory: /home/jschroeder/git/mesos
 Build directory: /home/jschroeder/git/mesos
 -
 We cannot run any cgroups tests that require mounting
 hierarchies because you have the following hierarchies mounted:
 /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu,cpuacct, /sys/fs/cgroup/cpuset,
 /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, /sys/fs/cgroup/hugetlb,
 /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, /sys/fs/cgroup/perf_event,
 /sys/fs/cgroup/systemd
 We'll disable the CgroupsNoHierarchyTest test fixture for now.
 -
 /usr/bin/nc
 Note: Google Test filter =
 ExamplesTest.PythonFramework-DockerContainerizerTest.ROOT_DOCKER_Launch_Executor:DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged:DockerContainerizerTest.ROOT_DOCKER_Launch:DockerContainerizerTest.ROOT_DOCKER_Kill:DockerContainerizerTest.ROOT_DOCKER_Usage:DockerContainerizerTest.ROOT_DOCKER_Update:DockerContainerizerTest.ROOT_DOCKER_Recover:DockerContainerizerTest.ROOT_DOCKER_SkipRecoverNonDocker:DockerContainerizerTest.ROOT_DOCKER_Logs:DockerContainerizerTest.ROOT_DOCKER_Default_CMD:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Override:DockerContainerizerTest.ROOT_DOCKER_Default_CMD_Args:DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer:DockerContainerizerTest.DISABLED_ROOT_DOCKER_SlaveRecoveryExecutorContainer:DockerContainerizerTest.ROOT_DOCKER_NC_PortMapping:DockerContainerizerTest.ROOT_DOCKER_LaunchSandboxWithColon:DockerContainerizerTest.ROOT_DOCKER_DestroyWhileFetching:DockerContainerizerTest.ROOT_DOCKER_DestroyWhilePulling:DockerContainerizerTest.ROOT_DOCKER_ExecutorCleanupWhenLaunchFailed:DockerContainerizerTest.ROOT_DOCKER_FetchFailure:DockerContainerizerTest.ROOT_DOCKER_DockerPullFailure:DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard:DockerTest.ROOT_DOCKER_interface:DockerTest.ROOT_DOCKER_CheckCommandWithShell:DockerTest.ROOT_DOCKER_CheckPortResource:DockerTest.ROOT_DOCKER_CancelPull:DockerTest.ROOT_DOCKER_MountRelative:DockerTest.ROOT_DOCKER_MountAbsolute:CpuIsolatorTest/1.UserCpuUsage:CpuIsolatorTest/1.SystemCpuUsage:RevocableCpuIsolatorTest.ROOT_CGROUPS_RevocableCpu:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs:LimitedCpuIsolatorTest.ROOT_CGROUPS_Cfs_Big_Quota:LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids:MemIsolatorTest/0.MemUsage:MemIsolatorTest/1.MemUsage:MemIsolatorTest/2.MemUsage:PerfEventIsolatorTest.ROOT_CGROUPS_Sample:SharedFilesystemIsolatorTest.ROOT_RelativeVolume:SharedFilesystemIsolatorTest.ROOT_AbsoluteVolume:NamespacesPidIsolatorTest.ROOT_PidNamespace:UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup:UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward:MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward:SlaveTest.ROOT_RunTaskWithCommandInfoWithoutUser:SlaveTest.DISABLED_ROOT_RunTaskWithCommandInfoWithUser:ContainerizerTest.ROOT_CGROUPS_BalloonFramework:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Enabled:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Subsystems:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Mounted:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Get:CgroupsAnyHierarchyTest.ROOT_CGROUPS_NestedCgroups:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Tasks:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Read:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Write:CgroupsAnyHierarchyTest.ROOT_CGROUPS_Cfs_Big_Quota:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Busy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_SubsystemsHierarchy:CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FindCgroupSubsystems:CgroupsAnyHierarchyWithCpuMemoryTes

Re: Multi-mastersD

2015-07-07 Thread Marco Massenzio
(I'm sure I'm missing something here, so please forgive if I'm stating the
obvious)

This is actually very well supported right now: you can use slave
attributes (if, eg, you want to name the various clusters differently and
launch tasks according to those criteria) that would be passed on to the
Frameworks along with the resource offers: the frameworks could then decide
whether to accept the offer and launch tasks based on whatever logic you
want to implement.

You could use something like --attributes=cluster:01z99; os:ubuntu-14-04;
jdk:8 or whatever makes sense.

*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, Jul 7, 2015 at 8:55 AM, CCAAT cc...@tampabay.rr.com wrote:

 Hello team_mesos,

 Is there any reason one set of (3) masters cannot talk to and manage
 several (many) different slave clusters of (3)? These slave clusters
 would be different arch, different mixes of resources and be running
 different frameworks, but all share/use the same (3) masters.


 Ideas on how to architect this experiment, would be keenly appreciated.


 James




Re: [RESULT] [VOTE] Release Apache Mesos 0.23.0 (rc1)

2015-07-07 Thread Marco Massenzio
As a general rule, we should not include anything other than the fixes in an 
RC, to avoid introducing further bugs in a never-ending cycle.




Please keep the cherry-picking strictly limited to a very narrow set (which I'm 
sure you're already doing, but your email seemed to imply otherwise ;-)




Thanks!



—
Sent from Mailbox

On Tue, Jul 7, 2015 at 3:56 PM, Adam Bordelon a...@mesosphere.io wrote:

 In case it wasn't obvious, rc1 did not pass the vote, due to a few build
 and unit test issues.
 Most of those fixes have been committed, so we will cut rc2 when the last
 blocker is resolved.
 This is your last chance to get any recently committed patches or resolved
 issues into 0.23.0.
 I am tracking the 0.23.0-rc2 cherry picks in
 https://docs.google.com/spreadsheets/d/14yUtwfU0mGQ7x7UcjfzZg2o1TuRMkn5SvJvetARM7JQ/edit#gid=0
 Please contact me ASAP if you want anything else included.
 Thanks,
 -Adam-
 P.S. 0.23 Dashboard is still in action:
 https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12326227
 On Tue, Jul 7, 2015 at 1:59 PM, Adam Bordelon a...@mesosphere.io wrote:
  -1 (non-binding) Network isolator will not compile.
 https://issues.apache.org/jira/browse/MESOS-3002

 The changes for MESOS-2800
 https://issues.apache.org/jira/browse/MESOS-2800 to Rename
 OptionT::get(const T _t) to getOrElse() happened after the 0.23.0-rc1
 cut and are not planned for cherry-picking into the release.
 The Fix Version of MESOS-2800
 https://issues.apache.org/jira/browse/MESOS-2800 is 0.24.0, so the
 Affects Version of MESOS-3002
 https://issues.apache.org/jira/browse/MESOS-3002 is really 0.24.0, and
 hence its Target Version should also be 0.24.0.
 Please let me know otherwise if you actually saw this build error when
 building from the 0.23.0-rc1 tag.

 On Tue, Jul 7, 2015 at 11:48 AM, Paul Brett pbr...@twitter.com wrote:

 -1 (non-binding) Network isolator will not compile.
 https://issues.apache.org/jira/browse/MESOS-3002


 On Tue, Jul 7, 2015 at 11:38 AM, Alexander Rojas alexan...@mesosphere.io
  wrote:

 +1 (non-binding)

 Ubuntu Server 15.04 gcc 4.9.2 and clang 3.6.0

 OS X Yosemite clang Apple LLVM based on 3.6.0


 On 06 Jul 2015, at 21:14, Jörg Schad jo...@mesosphere.io wrote:

 After more testing:
 -1 (non-binding)
 Docker tests failing on CentOS Linux release 7.1.1503 (Core) , Tim is
 already on the issue (see MESOS-2996)


 On Mon, Jul 6, 2015 at 8:59 PM, Kapil Arya ka...@mesosphere.io wrote:

 +1 (non-binding)

 OpenSUSE Tumbleweed, Linux 4.0.3 / gcc 4.8.3

 On Mon, Jul 6, 2015 at 2:33 PM, Ben Whitehead 
 ben.whiteh...@mesosphere.io wrote:

 +1 (non-binding)

 openSUSE 13.2 Linux 3.16.7 / gcc-4.8.3
 Tested running Marathon 0.9.0-RC3 and Cassandra on Mesos
 0.1.1-SNAPSHOT.

 On Mon, Jul 6, 2015 at 6:57 AM, Till Toenshoff toensh...@me.com
 wrote:

 Even though Alex has IMHO already “busted” this vote ;) .. THANKS
 ALEX! … ,
 here are my results.

 +1

 OS 10.10.4 (14E46) + Apple LLVM version 6.1.0 (clang-602.0.53) (based
 on LLVM 3.6.0svn), make check - OK
 Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-32-generic x86_64) + gcc (Ubuntu
 4.8.2-19ubuntu1) 4.8.2, make check - OK




 On Jul 6, 2015, at 3:22 PM, Alex Rukletsov a...@mesosphere.com
 wrote:

 -1

 Compilation error on Mac OS 10.10.4 with clang 3.5, which is
 supported according to release notes.
 More details: https://issues.apache.org/jira/browse/MESOS-2991

 On Mon, Jul 6, 2015 at 11:55 AM, Jörg Schad jo...@mesosphere.io
 wrote:

 P.S. to my prior +1
 Tested on ubuntu-trusty-14.04 including docker.

 On Sun, Jul 5, 2015 at 6:44 PM, Jörg Schad jo...@mesosphere.io
 wrote:

 +1

 On Sun, Jul 5, 2015 at 4:36 PM, Nikolaos Ballas neXus 
 nikolaos.bal...@nexusgroup.com wrote:

  +1



  Sent from my Samsung device


  Original message 
 From: tommy xiao xia...@gmail.com
 Date: 05/07/2015 15:14 (GMT+01:00)
 To: user@mesos.apache.org
 Subject: Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

  +1

 2015-07-04 12:32 GMT+08:00 Weitao zhouwtl...@gmail.com:

  +1

 发自我的 iPhone

 在 2015年7月4日,09:41,Marco Massenzio ma...@mesosphere.io 写道:

   +1

  *Marco Massenzio*
 *Distributed Systems Engineer*

 On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon 
 a...@mesosphere.io wrote:

 Hello Mesos community,

 Please vote on releasing the following candidate as Apache Mesos
 0.23.0.

 0.23.0 includes the following:

 
  - Per-container network isolation
 - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
 - Dockerized slaves will properly recover Docker containers upon
 failover.

 as well as experimental support for:
  - Fetcher Caching
  - Revocable Resources
  - SSL encryption
  - Persistent Volumes
  - Dynamic Reservations

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1

Re: Java detector for mess masters and leader

2015-07-07 Thread Marco Massenzio
Hi Donald,

the information stored in the Zookeeper znode is a serialized Protocol
Buffer (see MasterInfo in mesos/mesos.proto
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob;f=include/mesos/mesos.proto;h=3dd4a5b7a4b3bc56bdc690d6adf05f88c0d28273;hb=HEAD);
here is a brief explanation of what is in there, plus an example as to how
to retrieve that info (in Python - but Java would work pretty much the
same):
http://codetrips.com/2015/06/12/apache-mesos-leader-master-discovery-using-zookeeper/

Please be aware that, as of 0.24 (currently planned for mid-August), we
plan to publish that information *only* in JSON (exactly to help all the
folks like you) so the method presented there will no longer work (for all
intents and purposes, the serialized MasterInfo to ZK is considered
deprecated as of 0.23 which is going out any day now: we're currently
testing a RC).

Note that if you intend to follow the leader you will need to set a
Watcher on the node itself or, perhaps better, on the znode path, so as
to get a callback whenever anything changes: the elected leader will always
be the lowest-numbered ephemeral znode (I am guessing you know all this,
but feel free to ping me if you need more info).

Hope this helps.


*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, Jul 7, 2015 at 6:02 AM, Donald Laidlaw donlaid...@me.com wrote:

 Has anyone ever developed Java code to detect the mesos masters and
 leader, given a zookeeper connection?

 The reason I ask is because I would like to monitor mesos to report
 various metrics reported by the master. This requires detecting and
 tracking the leading master to query its /metrics/snapshot REST endpoint.

 Thanks,
 -Don


Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

2015-07-03 Thread Marco Massenzio
+1

*Marco Massenzio*
*Distributed Systems Engineer*

On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon a...@mesosphere.io wrote:

 Hello Mesos community,

 Please vote on releasing the following candidate as Apache Mesos 0.23.0.

 0.23.0 includes the following:

 
 - Per-container network isolation
 - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
 - Dockerized slaves will properly recover Docker containers upon failover.

 as well as experimental support for:
 - Fetcher Caching
 - Revocable Resources
 - SSL encryption
 - Persistent Volumes
 - Dynamic Reservations

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1

 

 The candidate for Mesos 0.23.0 release is available at:
 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz

 The tag to be voted on is 0.23.0-rc1:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc1

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1056

 Please vote on releasing this package as Apache Mesos 0.23.0!

 The vote is open until Fri July 10th, 12:00 PDT 2015 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.23.0
 [ ] -1 Do not release this package because ...

 Thanks,
 -Adam-



Re: mesos cluster can't fit federation cluster

2015-07-02 Thread Marco Massenzio
On Wed, Jul 1, 2015 at 11:38 PM, tommy xiao xia...@gmail.com wrote:

 Hi Marco,

 I want to fault tolerance slave nodes over multi datacenter.  but i found
 the possible setup methods is not production way.


what kind of fault-tolerance are you looking for here?
Against one (or either) of the DC going away or network partitioning? or
one (or more) of the racks in one DC to go away?

Depending on what you want to protect yourself against there may be
different ways to achieve that.
I'm sorry I haven't been around Mesos long enough to really be
knowledgeable about the specifics here; but have built HA systems before
around VPCs and On-Prem solutions, and I know bi-di routing can be achieved
using gateways and/or VPN (dedicated) links (we also solved that very issue
at Google too, but I can't talk about that :).

I'm sure the Twitter folks have solved that same problem too, but I'm
guessing they may not be able to share much either?


 2015-07-02 1:38 GMT+08:00 Marco Massenzio ma...@mesosphere.io:

 Hi Tommy,

 not sure what your use-case is, but you are correct, the master/slave
 nodes need to have bi-directional connectivity.
 However, there is no fundamental reason why those have to be public IPs
 - so long as they are routable (either via DNS discovery and / or VPN or
 other network-layer mechanisms) that will work.
 (I mean, without even thinking too hard about this - so I may be entirely
 wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs,
 one visible to the Slaves, the other in the VPC subnet, and forward all
 traffic? I'm sure I'm missing something here :)

 When you launch the master nodes, you specify the NICs they need to
 listen to via the --ip option, while the slave nodes have the --master flag
 that should have either a hostname:port of ip:port argument: so long as
 they are routable, this *should* work (although, admittedly, I've never
 tried this personally).

 One concern I would have in such an arrangement though, would be about
 network partitioning: if the DC/DC connectivity were to drop, you'd
 suddenly lose all master/slave connectivity; it's also not clear to me that
 having sectioned the Masters from the Slaves would give you better
 availability and/or reliability and/or security?
 It would be great to understand the use-case, so we could see what could
 be added (if anything) to Mesos going forward.


 *Marco Massenzio*
 *Distributed Systems Engineer*

 On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote:

 Hello,

 I would like to deploy master nodes in a private zone, and setup mesos
 slaves in another datacenter. But the multi-datacenter mode can't work. it
 need slave node can reach master node in public network ip. But in
 production zone, the gateway ip is not belong to master nodes. Does anyone
 have same experience on multi-datacenter deployment case?

 I prefer kubernets cluster proposal.

 https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png


 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com





 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



Re: step-step guide for New-to-Mesos

2015-07-01 Thread Marco Massenzio
Hey Hajira,

you may find this blog entry useful:
https://mesosphere.com/blog/2015/04/02/continuous-deployment-with-mesos-marathon-docker/

A bit older, but more specific to Docker, please have a look here:
https://mesosphere.github.io/marathon/docs/native-docker.html

generally speaking, there is a lot of info available at:
http://docs.mesosphere.com you could find useful too.

Obviously, you can launch containers using a very simple framework, but
that's largely not necessary; and most certainly, don't change the
mesos.proto contents (this will prevent a lot of stuff from working): that
is meant to be a read-only file (well, unless one is doing development on
Mesos itself).

We have made RENDLER publicly available as an example framework:
https://github.com/mesosphere/RENDLER

HTH

*Marco Massenzio*
*Distributed Systems Engineer*

On Wed, Jul 1, 2015 at 9:23 AM, haosdent haosd...@gmail.com wrote:

 Sorry, marthon should be marathon https://mesosphere.github.io/marathon/

 On Wed, Jul 1, 2015 at 9:23 PM, haosdent haosd...@gmail.com wrote:

 Hi, @Hajira

 Next step is to run tasks in containers in Mesos.
 You want to run somethinng like web application in docker or others? You
 could try marthon or other exist framework first. I think you don't need to
 write a framework.


 On Wed, Jul 1, 2015 at 9:07 PM, Hajira Jabeen hajirajab...@gmail.com
 wrote:


 Hello,

 Being new to Mesos (and everything related to big data),
 I have been able to install mesos and run example frameworks.
 Next step is to run tasks in containers in Mesos.

 Do I have to write a framework for this , or just change the
 ContainerInfo etc. fields in Mesos.proto file ?

 Is there any step-step working guide ?

 Mesos documentation assumes a lot background knowledge, that I do not
 have ..

 Any help and pointers will be appreciated ..

 Regards

 Hajira





 On 30 June 2015 at 00:23, Andras Kerekes andras.kere...@ishisystems.com
  wrote:

 Hi,



 Is there a preferred way to do service discovery in Mesos via mesos-dns
 running on CoreOS? I’m trying to implement a simple app which consists of
 two docker containers and one of them (A) depends on the other (B). What
 I’d like to do is to tell container A to use a fix dns name
 (containerB.marathon.mesos in case of mesos-dns) to find the other service.
 There are at least 3 different ways I think it can be done, but the 3 I
 found all have some shortcomings.



 1.   Use SRV records to get the port along with the IP. Con: I’d
 prefer not to build the logic of handling SRV records into the app, it can
 be a legacy app that is difficult to modify

 2.   Use haproxy on slaves and connect via a well-known port on
 localhost. Cons: the Marathon provided script does not run on CoreOS, also
 I don’t know how to run haproxy on CoreOS outside of a docker container. If
 it is running in a docker container, then how can it dynamically allocate
 ports on localhost if a new service is discovered in Marathon/Mesos?

 3.   Use dedicated port to bind the containers to. Con: I can have
 only as many instances of a service as many slaves I have because they bind
 to the same port.



 What other alternatives are there?



 Thanks,

 Andras





 --
 Best Regards,
 Haosdent Huang




 --
 Best Regards,
 Haosdent Huang



Re: mesos cluster can't fit federation cluster

2015-07-01 Thread Marco Massenzio
Hi Tommy,

not sure what your use-case is, but you are correct, the master/slave nodes
need to have bi-directional connectivity.
However, there is no fundamental reason why those have to be public IPs -
so long as they are routable (either via DNS discovery and / or VPN or
other network-layer mechanisms) that will work.
(I mean, without even thinking too hard about this - so I may be entirely
wrong here - you could place a couple of Nginx/HAproxy nodes with two NICs,
one visible to the Slaves, the other in the VPC subnet, and forward all
traffic? I'm sure I'm missing something here :)

When you launch the master nodes, you specify the NICs they need to listen
to via the --ip option, while the slave nodes have the --master flag that
should have either a hostname:port of ip:port argument: so long as they are
routable, this *should* work (although, admittedly, I've never tried this
personally).

One concern I would have in such an arrangement though, would be about
network partitioning: if the DC/DC connectivity were to drop, you'd
suddenly lose all master/slave connectivity; it's also not clear to me that
having sectioned the Masters from the Slaves would give you better
availability and/or reliability and/or security?
It would be great to understand the use-case, so we could see what could be
added (if anything) to Mesos going forward.


*Marco Massenzio*
*Distributed Systems Engineer*

On Wed, Jul 1, 2015 at 9:15 AM, tommy xiao xia...@gmail.com wrote:

 Hello,

 I would like to deploy master nodes in a private zone, and setup mesos
 slaves in another datacenter. But the multi-datacenter mode can't work. it
 need slave node can reach master node in public network ip. But in
 production zone, the gateway ip is not belong to master nodes. Does anyone
 have same experience on multi-datacenter deployment case?

 I prefer kubernets cluster proposal.

 https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/federation-high-level-arch.png


 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



Re: [Breaking Change 0.24 Upgrade path] ZooKeeper MasterInfo change.

2015-06-24 Thread Marco Massenzio
Folks,

as heads-up, we are planning to convert the format of the MasterInfo
information stored in ZooKeeper from the Protocol Buffer binary format to
JSON - this is in conjunction with the HTTP API development, to allow
frameworks *not* to depend on libmesos and other binary dependencies to
interact with Mesos Master nodes.

*NOTE* - there is no change in 0.23 (so any Master/Slave/Framework that is
currently working in 0.22 *will continue to work* in 0.23 too) but as of
Mesos 0.24, frameworks and other clients relying on the binary format will
break.

The details of the design are in this Google Doc:
https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit

the actual work is detailed in MESOS-2340:
https://issues.apache.org/jira/browse/MESOS-2340

and the patch (and associated test) are here:
https://reviews.apache.org/r/35571/
https://reviews.apache.org/r/35815/

*Marco Massenzio*
*Distributed Systems Engineer*


Re: mesosphere.io broken?

2015-06-17 Thread Marco Massenzio
Just to add some color to the Elastic Mesos thing, we're working with
Google to enable deploying a complete DCOS cluster on GCP using their brand
new Deployment Manager (v2) via the Click-to-Deploy framework.

We have these working on an experimental basis: we need to conduct a bit
more testing and work on a couple of rough edges before we can release
them beta for people to have a good user experience.

I must say it's pretty exciting to click a button and see shortly aftewards
a full Mesos Cluster come to life on Google Cloud, so I'm really itching to
get the templates in a state where they can be used by other folks!



*Marco Massenzio*
*Distributed Systems Engineer*

On Wed, Jun 17, 2015 at 4:30 AM, Alex Rukletsov a...@mesosphere.com wrote:

 For downloads, use https://mesosphere.com/downloads/
 Elastic Mesos has been decommissioned, use https://google.mesosphere.com/
 or https://digitalocean.mesosphere.com/ but keep in mind they will be
 decommissioned soon (~1 month) as well. However, if you want to try DCOS
 installation on AWS, check https://mesosphere.com/product/

 On Wed, Jun 17, 2015 at 12:51 PM, Brian Candler b.cand...@pobox.com
 wrote:

 Looking for Mesos .deb packages, on Google I find links to
 http://mesosphere.io/downloads/
 http://elastic.mesosphere.io/
 but these are giving 503 Service Unavailable errors.

 Is there a problem, or have these sites gone / migrated away?





Re: Introducing BDS: A datacenter scripting language

2015-05-15 Thread Marco Massenzio
That's awesome, Pablo - will definitely be fooling around with it!
Thanks for using Mesos, BTW - always good to see folks building cool stuff
on top of it :)

*Marco Massenzio*
*Distributed Systems Engineer*

On Thu, May 14, 2015 at 6:45 PM, Pablo Cingolani 
pablo.e.cingol...@gmail.com wrote:


 Hi Everyone,
   I've been working on a simple programming language to create large
 data pipelines on Mesos. The language is called BDS which stands
 for BigDataScript (yes, the name is kind of a joke for all jargon-lovers
 out there) and here is the web page:

http://pcingola.github.io/BigDataScript/

   Needles to say, it's open source and the code is available is GitHub.
 At the moment I'm using BDS mostly for analysis of large genetic datasets
 on our 25,000 core cluster, but it should scale to large(er) clusters as
 well.

   BDS has a few interesting features:
 - Runs on Mesos (obviously) as well as SunGridEngine, Torque,
   MOAB, a large server or just your laptop.

 - You can develop on your laptop (without having to install Mesos or
any cluster management system) and then deploy your script to a
 Mesos
cluster/datacenter without modification.

 - It performs automatic task dependency and schedules tasks according
 to
   the implicit (or explicit) DAG.

 - It has lazy processing. Checks whether performing a task is
 necessary and
   skips tasks whose output does not need to be updated (make-style).

 - It performs automatic checkpointing and has absolute serialization,
 so you
   can copy the checkpoint file to another computer and continue
 running
   exactly where you left.

 - It can handle several parallel pipeline branches (threads).

 - Allows to define DAGs in a declarative form (using 'goals').

 - Cleans up stale files (and queues tasks in non-Mesos cluster).

 Other cool features:

  - Automatically parses command line options in your scripts (it also
 creates help for you)
  - Logs all processes's stdout / stderr and exit status
  - It has a built in debugger
  - It has a built in unity testing framework

   You can read more about all these features here:

http://pcingola.github.io/BigDataScript/bigDataScript_manual.html

   I hope you find it useful and please do send me any
 feedback you have.
   Yours

   Pablo






Re: Cisco is Powered By Mesos

2015-05-12 Thread Marco Massenzio
Thanks, Keith, for sharing this!
That's pretty cool stuff, I guess we'll have to check Shipped out ;)

Thanks for using Mesos!

*Marco Massenzio*
*Distributed Systems Engineer*

On Tue, May 12, 2015 at 1:38 PM, Keith Chambers (kechambe) 
kecha...@cisco.com wrote:

  Hello Adam,

  Yesterday at Cloud Foundry Summit Cisco first discussed our product
 called “Shipped” so I guess I can talk about it now.  :-)

  Our tag line is “Your idea running in production in 5 minutes.”  It’s
 developed by developers for developers.  Shipped makes it simple to create
 on-demand production like dev environments, build applications using
 microservices patterns, and deploy them to an instance of the open source
 microservices-infrastructure
 https://github.com/CiscoCloud/microservices-infrastructure container
 runtime (multi-dc Marathon).

  Shipped integrates with tool developers *actually like using*.  We
 leverage GitHub for authentication and source control, Vagrant for
 on-demand developer environments, and Bintray for wickedly fast Docker
 repos.  The Shipped CI service is powered by open source Drone, which we
 have 2 full time developers working on.  We’re also developing a Drone
 framework for Mesos that we will release to GitHub under Apache license.

  Shipped maintains a “timeline” for every project.  The timeline is a
 chronological history of high value events across Dev and Ops.  i.e., pull
 requests, failed builds, production failures, etc.  One killer feature of
 Shipped is that we automatically integrate the project timeline with a room
 in Cisco Spark http://www.webex.com/ciscospark/ (similar to Slack).
 This makes it simple for teams to work together and deliver software
 quicker — honestly it’s pretty slick!

  Shipped itself runs on top of Marathon in Docker containers.  We have a
 number of microservices, all written in Go and all using Cassandra for
 their backend DB.  We use the excellent Kafka framework from Joe Stein for
 cross service messaging and event collection.  We are interested in
 creating a multi-DC Cassandra Mesos framework, but for now Cassandra is on
 VMs.

  We’re at 50 Mesos followers nodes now and growing quickly.

  Thanks!
 Keith





   From: Adam Bordelon a...@mesosphere.io
 Reply-To: user@mesos.apache.org user@mesos.apache.org
 Date: Monday, May 11, 2015 at 10:50 PM
 To: user@mesos.apache.org user@mesos.apache.org
 Subject: Re: Cisco is Powered By Mesos

   Glad to hear it Keith! We're very excited to have you in the community.
  I've added Cisco to the adopters list, and it will go out with the next
 website update.
  Can you share any juicy details about how you're using Mesos and at what
 scale?

 On Mon, May 11, 2015 at 10:20 AM, Keith Chambers (kechambe) 
 kecha...@cisco.com wrote:

  We use Mesos in production at Cisco.

  Please add us to the “Powered By Mesos” list too!
 https://mesos.apache.org/documentation/latest/powered-by-mesos/

  Keith  :-)






Re: Writing outside the sandbox

2015-05-09 Thread Marco Massenzio
Out of my own curiousity (sorry, I have no fresh insights into the issue
here) did you try to run the script and write to a non-NFS mounted
directory? (same ownership/permissions)

This way we could at least find out whether it's something related to NFS,
or a more general permission-related issue.

*Marco Massenzio*
*Distributed Systems Engineer*

On Sat, May 9, 2015 at 5:10 AM, John Omernik j...@omernik.com wrote:

 Here is the testing I am doing. I used a simple script (run.sh)  It writes
 the user it is running as to stderr (so it's the same log as the errors
 from file writing) and then tries to make a directory in nfs, and then
 touch a file in nfs.  Note: This script directly run  works on every node.
 You can see the JSON I used in marathon, and in the sandbox results, you
 can see the user is indeed darkness and the directory cannot be created.
 However when directly run, it the script, with the same user, creates the
 directory with no issue.  Now,  I realize this COULD still be a NFS quirk
 here, however, this testing points at some restriction in how marathon
 kicks off the cmd.   Any thoughts on where to look would be very helpful!

 John



 Script:

 #!/bin/bash
 echo Writing whoami to stderr for one stop logging 12
 whoami 12
 mkdir /mapr/brewpot/mesos/storm/test/test1
 touch /mapr/brewpot/mesos/storm/test/test1/testing.go



 Run Via Marathon


 {
 cmd: /mapr/brewpot/mesos/storm/run.sh,
 cpus: 1.0,
 mem: 1024,
 id: permtest,
 user: darkness,
 instances: 1
 }


 I0509 07:02:52.457242  9562 exec.cpp:132] Version: 0.21.0
 I0509 07:02:52.462700  9570 exec.cpp:206] Executor registered on slave
 20150505-145508-1644210368-5050-8608-S0
 Writing whoami to stderr for one stop logging
 darkness
 mkdir: cannot create directory `/mapr/brewpot/mesos/storm/test/test1':
 Permission denied
 touch: cannot touch `/mapr/brewpot/mesos/storm/test/test1/testing.go': No
 such file or directory


 Run Via Shell:


 $ /mapr/brewpot/mesos/storm/run.sh
 Writing whoami to stderr for one stop logging
 darkness
 darkness@hadoopmapr1:/mapr/brewpot/mesos/storm$ ls ./test/
 test1
 darkness@hadoopmapr1:/mapr/brewpot/mesos/storm$ ls ./test/test1/
 testing.go


 On Sat, May 9, 2015 at 3:14 AM, Adam Bordelon a...@mesosphere.io wrote:

 I don't know of anything inside of Mesos that would prevent you from
 writing to NFS. Maybe examine the environment variables set when running as
 that user. Or are you running in a Docker container? Those can have
 additional restrictions.

 On Fri, May 8, 2015 at 4:44 PM, John Omernik j...@omernik.com wrote:

 I am doing something where people may recommend against my course of
 action. However, I am curious if there is a way basically I have a
 process being kicked off in marathon that is trying to write to a nfs
 location.  The permissions of the user running the task and the nfs
 location are good. So what component of mesos or marathon is keeping me
 from writing here ?  ( I am getting permission denied). Is this one of
 those things that is just not allowed, or is there an option to pass to
 marathon to allow this?  Thanks !

 --
 Sent from my iThing






Re: Google Borg paper

2015-04-17 Thread Marco Massenzio
At Google there are always to do everything: the deprecated one and the
one that's not quite ready yet

I'm sure Borg is alive and well (but deprecated) and Omega has been
deployed (but ain't quite ready yet)

They were already working on it in 2010, I'm sure they're still at it.

Will confirm soon as I find out more.
On Apr 16, 2015 9:08 PM, Christos Kozyrakis kozyr...@gmail.com wrote:

 Maxime,
 to the best of my knowledge Borg is still doing just fine at Google. It
 may have been enhanced by the Omega effort but it has not been replaced.
 Nevertheless, I will let any Googlers on the list go into details.
 Christos

 On Thu, Apr 16, 2015 at 4:19 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:

 Hi,

 Not sure if everyone noticed but Google just published a paper about the
 Borg architecture. I guess it's been replaced by Omega now internally at
 Google (if anyone from Google can confirm?)

 It might be of interest for Mesos :)

 http://research.google.com/pubs/pub43438.html

 Best,
 Maxime




 --
 Christos