Re: [VOTE] Release Apache Mesos 0.24.0 (rc2)

2015-09-02 Thread Marco Massenzio
+1 (non-binding)

All tests (including ROOT) pass on Ubuntu 14.04
All tests pass on CentOS 7.1; ROOT tests cause 1 failure:

[  FAILED  ] 1 test, listed below:
[  FAILED  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS

$ cat /etc/centos-release
CentOS Linux release 7.1.1503 (Core)

This seems to be new[0], but possibly related to some limitation/setting of
my test machine (VirtualBox VM, running 2 CPUs on Ubuntu host).
Interestingly enough, I don't see the 4 failures as Vaibhav, but in my log
it shows *YOU HAVE 11 DISABLED TESTS* (he has 12).

[0] https://issues.apache.org/jira/issues/?filter=12333150

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com *

On Tue, Sep 1, 2015 at 5:45 PM, Vinod Kone  wrote:

> Hi all,
>
>
> Please vote on releasing the following candidate as Apache Mesos 0.24.0.
>
>
> 0.24.0 includes the following:
>
>
> 
>
> Experimental support for v1 scheduler HTTP API!
>
> This release also wraps up support for fetcher.
>
> The CHANGELOG for the release is available at:
>
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.24.0-rc2
>
>
> 
>
>
> The candidate for Mesos 0.24.0 release is available at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz
>
>
> The tag to be voted on is 0.24.0-rc2:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.24.0-rc2
>
>
> The MD5 checksum of the tarball can be found at:
>
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.md5
>
>
> The signature of the tarball can be found at:
>
>
> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.asc
>
>
> The PGP key used to sign the release is here:
>
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
>
> The JAR is up in Maven in a staging repository here:
>
> https://repository.apache.org/content/repositories/orgapachemesos-1066
>
>
> Please vote on releasing this package as Apache Mesos 0.24.0!
>
>
> The vote is open until Fri Sep  4 17:33:05 PDT 2015 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
>
> [ ] +1 Release this package as Apache Mesos 0.24.0
>
> [ ] -1 Do not release this package because ...
>
>
> Thanks,
>
> Vinod
>


Re: mesos-master resource offer details

2015-09-02 Thread Haripriya Ayyalasomayajula
Alex,

The problem I am facing is that there are no allocations made.  Mesos
-master gives 5 requests to marathon. But marathon DECLINE s all the
offers. I am trying to debug the reason why it is rejecting the offers. I
traced down the source code to see that it calls the ResourceMatcher to
match the resource offered vs. Resource Available and in my case it says it
has problem with the cpu's offered (not sufficient resources ). I am trying
to get the details of the resource offer made available - the cpu's being
offered and I'm stuck there..

I really appreciate if you have any suggestions! Thanks.

On Wed, Sep 2, 2015 at 9:54 AM, Alex Rukletsov <a...@mesosphere.com> wrote:

> To what Haosdent said: you cannot get a list of offers from master logs,
> but you can get a list of allocations from the built-in allocator in you
> bump up the log level (GLOG_v=2).
>
> On Wed, Sep 2, 2015 at 7:36 AM, haosdent <haosd...@gmail.com> wrote:
>
>> If the offer is rejected by your framework, could you find this log in
>> mesos:
>>
>> ```
>> xxx Processing DECLINE call for offers xxx
>> ```
>>
>> On Wed, Sep 2, 2015 at 1:31 PM, haosdent <haosd...@gmail.com> wrote:
>>
>>> >Well, the log you mentioned above is when the resource offer is
>>> accepted and mesos-master then allocates the cpu.
>>> Hi, @Haripriya As far as i know, the log I show above is allocator
>>> allocate resource and make a offer. And then trigger Master::offer to send
>>> offer to frameworks. So the log above is not resource offer is
>>> accepted, it is before send offer to framework and it also is the details
>>> about that offer.
>>>
>>> For you problem
>>> >In my case, the offer is being rejected
>>> If you mean the offer is rejected by your framework after your framework
>>> receive it? Or you mean your framework never receive offers from mesos?
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>>> aharipriy...@gmail.com> wrote:
>>>
>>>> Well, the log you mentioned above is when the resource offer is
>>>> accepted and mesos-master then allocates the cpu. In my case, the offer is
>>>> being rejected. I am trying to debug the reason as to why the resource
>>>> offer is being rejected.
>>>>
>>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent <haosd...@gmail.com> wrote:
>>>>
>>>>> Yes, currently only print number for offers in mesos code in default
>>>>> log level. If you want get more details about it, you could start with set
>>>>> environment variable GLOG_v2=1 Then you should got some similar message
>>>>> like this:
>>>>>
>>>>> I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
>>>>> cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
>>>>> 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
>>>>> 2-16777343-5050-46447-
>>>>>
>>>>> But use GLOG_v2 would have a lot of log. If you just want to get the
>>>>> resources allocated to task or executor, you could get those informations
>>>>> from slave state.json endpoint.
>>>>>
>>>>> On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
>>>>> aharipriy...@gmail.com> wrote:
>>>>>
>>>>>> Thanks, but is there no way without tweaking the source code of the
>>>>>> framework scheduler that I get the details of the resource offer? I don't
>>>>>> see anything in my logs.
>>>>>>
>>>>>> All I can see is
>>>>>>
>>>>>> mesos-master: Sending 5 offers to framework 20150815- (marathon)
>>>>>> at scheduler-50ajaja@pqr
>>>>>>
>>>>>> I can't find any other details in the logs..
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 8:36 PM, haosdent <haosd...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, Haripriya.
>>>>>>>
>>>>>>> >1. I am trying to see the details of the resource offer made by
>>>>>>> the mesos master. I can see in the logs that there are 5 resource offers
>>>>>>> made but I am not sure where to get the details of the resource offers -
>>>>>>> the cpu, memory etc.
>>>>>>>
>>>>>>> You could print offer details in your
>>>>>>> framework Schedul

Re: mesos-master resource offer details

2015-09-02 Thread Alex Rukletsov
If my understanding of how Mesos allocation algorithm works, there are
allocations made if there are offers made. An allocator performs
allocation, which is used by the master to generate offers to frameworks,
which, in turn, may be accepted or declined. Have you tried to increase the
log level for the master as suggested?

To help you with your problem, could you please describe the setup you use?
Specifically, how "fat" your agents (aka slaves) are, what is the task
description you send to marathon, what are the available resources in the
cluster (state.json).

On Wed, Sep 2, 2015 at 7:02 PM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Alex,
>
> The problem I am facing is that there are no allocations made.  Mesos
> -master gives 5 requests to marathon. But marathon DECLINE s all the
> offers. I am trying to debug the reason why it is rejecting the offers. I
> traced down the source code to see that it calls the ResourceMatcher to
> match the resource offered vs. Resource Available and in my case it says it
> has problem with the cpu's offered (not sufficient resources ). I am trying
> to get the details of the resource offer made available - the cpu's being
> offered and I'm stuck there..
>
> I really appreciate if you have any suggestions! Thanks.
>
> On Wed, Sep 2, 2015 at 9:54 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> To what Haosdent said: you cannot get a list of offers from master logs,
>> but you can get a list of allocations from the built-in allocator in you
>> bump up the log level (GLOG_v=2).
>>
>> On Wed, Sep 2, 2015 at 7:36 AM, haosdent <haosd...@gmail.com> wrote:
>>
>>> If the offer is rejected by your framework, could you find this log in
>>> mesos:
>>>
>>> ```
>>> xxx Processing DECLINE call for offers xxx
>>> ```
>>>
>>> On Wed, Sep 2, 2015 at 1:31 PM, haosdent <haosd...@gmail.com> wrote:
>>>
>>>> >Well, the log you mentioned above is when the resource offer is
>>>> accepted and mesos-master then allocates the cpu.
>>>> Hi, @Haripriya As far as i know, the log I show above is allocator
>>>> allocate resource and make a offer. And then trigger Master::offer to send
>>>> offer to frameworks. So the log above is not resource offer is
>>>> accepted, it is before send offer to framework and it also is the details
>>>> about that offer.
>>>>
>>>> For you problem
>>>> >In my case, the offer is being rejected
>>>> If you mean the offer is rejected by your framework after your
>>>> framework receive it? Or you mean your framework never receive offers from
>>>> mesos?
>>>>
>>>>
>>>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>>>> aharipriy...@gmail.com> wrote:
>>>>
>>>>> Well, the log you mentioned above is when the resource offer is
>>>>> accepted and mesos-master then allocates the cpu. In my case, the offer is
>>>>> being rejected. I am trying to debug the reason as to why the resource
>>>>> offer is being rejected.
>>>>>
>>>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent <haosd...@gmail.com> wrote:
>>>>>
>>>>>> Yes, currently only print number for offers in mesos code in default
>>>>>> log level. If you want get more details about it, you could start with 
>>>>>> set
>>>>>> environment variable GLOG_v2=1 Then you should got some similar message
>>>>>> like this:
>>>>>>
>>>>>> I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
>>>>>> cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
>>>>>> 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
>>>>>> 2-16777343-5050-46447-
>>>>>>
>>>>>> But use GLOG_v2 would have a lot of log. If you just want to get the
>>>>>> resources allocated to task or executor, you could get those informations
>>>>>> from slave state.json endpoint.
>>>>>>
>>>>>> On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
>>>>>> aharipriy...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, but is there no way without tweaking the source code of the
>>>>>>> framework scheduler that I get the details of the resource offer? I 
>>>>>>> don't
>>>>>>> see anythin

Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Steven Schlansker
I 100% agree with your philosophy here, and I suspect it's something shared in 
the Mesos community.

I just think that we can restrict the domain of the failure to a smaller 
reasonable window -- once you are in the context of "I am doing work to launch 
a specific task", there is a well defined "success / failure / here is an error 
message" path defined already.  Users expect tasks to fail and can see the 
errors.

I think that a lot of these assertions are in fact more appropriate as task 
failures.  But I agree that they should be fatal to *some* part of the system, 
just not the whole agent entirely.

On Sep 1, 2015, at 4:33 PM, Marco Massenzio  wrote:

> That's one of those areas for discussions that is so likely to generate a 
> flame war that I'm hesitant to wade in :)
> 
> In general, I would agree with the sentiment expressed there:
> 
> > If the task fails, that is unfortunate, but not the end of the world. Other 
> > tasks should not be affected.
> 
> which is, in fact, to large extent exactly what Mesos does; the example given 
> in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as 
> if nothing had happened, is only likely to lead to further (and worse) 
> disappointment.
> 
> The general philosophy back at Google (and which certainly informs the design 
> of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
> restarted and hopefully the root cause cleared or (b) someone (who can 
> hopefully do something) will be alerted about it.
> 
> I think it's ultimately a matter of scale: up to a few tens of servers, you 
> can assume there is some sort of 'log-monitor' that looks out for errors and 
> other anomalies and alerts humans that will then take a look and possibly 
> apply some corrective action - when you're up to hundreds or thousands 
> (definitely Mesos territory) that's not practical: the system should either 
> self-heal or crash-and-restart.
> 
> All this to say, that it's difficult to come up with a general *automated* 
> approach to unequivocally decide if a failure is "fatal" or could just be 
> safely "ignored" (after appropriate error logging) - in general, when in 
> doubt it's probably safer to "noisily crash & restart" and rely on the 
> overall system's HA architecture to take care of replication and consistency.
> (and an intelligent monitoring system that only alerts when some failure 
> threshold is exceeded).
> 
> From what I've seen so far (granted, still a novice here) it seems that Mesos 
> subscribes to this notion, assuming that Agent Nodes will come and go, and 
> usually Tasks survive (for a certain amount of time anyway) a Slave restart 
> (obviously, if the physical h/w is the ultimate cause of failure, well, then 
> all bets are off).
> 
> Having said all that - if there are areas where we have been over-eager with 
> our CHECKs, we should definitely revisit that and make it more 
> crash-resistant, absolutely.
> 
> [0] http://research.google.com/pubs/pub43438.html
> 
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com
> 
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
>  wrote:
> 
> 
> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> >
> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> > CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> 
> I reported a similar bug a while back:
> 
> https://issues.apache.org/jira/browse/MESOS-2684
> 
> This seems to be a class of bugs where some filesystem operations which may 
> fail for unforeseen reasons are written as assertions which crash the 
> process, rather than failing only the task and communicating back the error 
> reason.
> 
> 
> 



Re: mesos-master resource offer details

2015-09-02 Thread Alex Rukletsov
To what Haosdent said: you cannot get a list of offers from master logs,
but you can get a list of allocations from the built-in allocator in you
bump up the log level (GLOG_v=2).

On Wed, Sep 2, 2015 at 7:36 AM, haosdent <haosd...@gmail.com> wrote:

> If the offer is rejected by your framework, could you find this log in
> mesos:
>
> ```
> xxx Processing DECLINE call for offers xxx
> ```
>
> On Wed, Sep 2, 2015 at 1:31 PM, haosdent <haosd...@gmail.com> wrote:
>
>> >Well, the log you mentioned above is when the resource offer is
>> accepted and mesos-master then allocates the cpu.
>> Hi, @Haripriya As far as i know, the log I show above is allocator
>> allocate resource and make a offer. And then trigger Master::offer to send
>> offer to frameworks. So the log above is not resource offer is accepted,
>> it is before send offer to framework and it also is the details about that
>> offer.
>>
>> For you problem
>> >In my case, the offer is being rejected
>> If you mean the offer is rejected by your framework after your framework
>> receive it? Or you mean your framework never receive offers from mesos?
>>
>>
>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Well, the log you mentioned above is when the resource offer is accepted
>>> and mesos-master then allocates the cpu. In my case, the offer is being
>>> rejected. I am trying to debug the reason as to why the resource offer is
>>> being rejected.
>>>
>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent <haosd...@gmail.com> wrote:
>>>
>>>> Yes, currently only print number for offers in mesos code in default
>>>> log level. If you want get more details about it, you could start with set
>>>> environment variable GLOG_v2=1 Then you should got some similar message
>>>> like this:
>>>>
>>>> I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
>>>> cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
>>>> 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
>>>> 2-16777343-5050-46447-
>>>>
>>>> But use GLOG_v2 would have a lot of log. If you just want to get the
>>>> resources allocated to task or executor, you could get those informations
>>>> from slave state.json endpoint.
>>>>
>>>> On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
>>>> aharipriy...@gmail.com> wrote:
>>>>
>>>>> Thanks, but is there no way without tweaking the source code of the
>>>>> framework scheduler that I get the details of the resource offer? I don't
>>>>> see anything in my logs.
>>>>>
>>>>> All I can see is
>>>>>
>>>>> mesos-master: Sending 5 offers to framework 20150815- (marathon)
>>>>> at scheduler-50ajaja@pqr
>>>>>
>>>>> I can't find any other details in the logs..
>>>>>
>>>>> On Mon, Aug 31, 2015 at 8:36 PM, haosdent <haosd...@gmail.com> wrote:
>>>>>
>>>>>> Hi, Haripriya.
>>>>>>
>>>>>> >1. I am trying to see the details of the resource offer made by the
>>>>>> mesos master. I can see in the logs that there are 5 resource offers made
>>>>>> but I am not sure where to get the details of the resource offers - the
>>>>>> cpu, memory etc.
>>>>>>
>>>>>> You could print offer details in your
>>>>>> framework Scheduler#resourceOffers methods. These offer message also 
>>>>>> could
>>>>>> find from mesos log.
>>>>>>
>>>>>> >2. How can I list the number of slaves registered with the master
>>>>>> and the details of the slaves on the command line( apart from seeing it 
>>>>>> in
>>>>>> the UI)?
>>>>>>
>>>>>> We have some endpoints(state.json and state-summary) in master and
>>>>>> slave to expose these informations, you could got this from
>>>>>>
>>>>>> ```
>>>>>> curl -s "http://localhost:5050/master/state-summary; |jq .slaves
>>>>>> ```
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 6:47 AM, Haripriya Ayyalasomayajula <
>>>>>> aharipriy...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm having trouble with some basic details:
>>>>>>>
>>>>>>> 1. I am trying to see the details of the resource offer made by the
>>>>>>> mesos master. I can see in the logs that there are 5 resource offers 
>>>>>>> made
>>>>>>> but I am not sure where to get the details of the resource offers - the
>>>>>>> cpu, memory etc.
>>>>>>>
>>>>>>> 2. How can I list the number of slaves registered with the master
>>>>>>> and the details of the slaves on the command line( apart from seeing it 
>>>>>>> in
>>>>>>> the UI)?
>>>>>>>
>>>>>>> Thanks for the help.
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Haripriya Ayyalasomayajula
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Haripriya Ayyalasomayajula
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Haosdent Huang
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Haripriya Ayyalasomayajula
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread haosdent
If could show the content of path in CHECK_SOME, it would more easy to
debug here. According the log in
https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and
0.22.1 code:

const string& path = paths::getExecutorSentinelPath(
metaDir, info.id(), framework->id, executor->id,
executor->containerId);

framework->id ==> 20141209-011108-1378273290-5050-23221-0001
executor->id ==> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab'

metaDir could get from your slave work_dir, info.id() is your slave id,
could you see the executor->containerId in complete slave log. And if you
could reproduce this problem every time, it would very helpful if you add a
trace log to slave and recompile it.

On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen  wrote:

> Hi Scott,
>
> I wonder if you can try the latest Mesos and see if you can repro this?
>
> And if it is can you put down the example task and steps? I couldn't see
> disk full in your slave log so I'm not sure if it's exactly the same
> problem of MESOS-2684.
>
> Tim
>
> On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin  wrote:
>
>> Hi Marco,
>>
>> I certainly don’t want to start a flame war, and I actually realized
>> after I added my comment to MESOS-2684 that it’s not quite the same thing.
>>
>> As far as I can tell, in our situation, there’s no underlying disk
>> issue.  It seems like this is some sort of race condition (maybe?) with
>> docker containers and executors shutting down.  I’m perfectly happy with
>> Mesos choosing to shut down in the case of a failure or unexpected
>> situation – that’s a methodology that we adopt ourselves.  I’m just trying
>> to get a little more information about what the underlying issue is so that
>> we can resolve it. I don’t know enough about Mesos internals to be able to
>> answer that question just yet.
>>
>> It’s also inconvenient because, while Mesos is well-behaved and restarts
>> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
>> mesos-slave crash also brings down applications.
>>
>> Thanks,
>> Scott
>>
>> From: Marco Massenzio
>> Reply-To: "user@mesos.apache.org"
>> Date: Tuesday, September 1, 2015 at 7:33 PM
>> To: "user@mesos.apache.org"
>> Subject: Re: mesos-slave crashing with CHECK_SOME
>>
>> That's one of those areas for discussions that is so likely to generate a
>> flame war that I'm hesitant to wade in :)
>>
>> In general, I would agree with the sentiment expressed there:
>>
>> > If the task fails, that is unfortunate, but not the end of the world.
>> Other tasks should not be affected.
>>
>> which is, in fact, to large extent exactly what Mesos does; the example
>> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
>> on as if nothing had happened, is only likely to lead to further (and
>> worse) disappointment.
>>
>> The general philosophy back at Google (and which certainly informs the
>> design of Borg[0]) was "fail early, fail hard" so that either (a) the
>> service is restarted and hopefully the root cause cleared or (b) someone
>> (who can hopefully do something) will be alerted about it.
>>
>> I think it's ultimately a matter of scale: up to a few tens of servers,
>> you can assume there is some sort of 'log-monitor' that looks out for
>> errors and other anomalies and alerts humans that will then take a look and
>> possibly apply some corrective action - when you're up to hundreds or
>> thousands (definitely Mesos territory) that's not practical: the system
>> should either self-heal or crash-and-restart.
>>
>> All this to say, that it's difficult to come up with a general
>> *automated* approach to unequivocally decide if a failure is "fatal" or
>> could just be safely "ignored" (after appropriate error logging) - in
>> general, when in doubt it's probably safer to "noisily crash & restart" and
>> rely on the overall system's HA architecture to take care of replication
>> and consistency.
>> (and an intelligent monitoring system that only alerts when some failure
>> threshold is exceeded).
>>
>> From what I've seen so far (granted, still a novice here) it seems that
>> Mesos subscribes to this notion, assuming that Agent Nodes will come and
>> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
>> restart (obviously, if the physical h/w is the ultimate cause of failure,
>> well, then all bets are off).
>>
>> Having said all that - if there are areas where we have been over-eager
>> with our CHECKs, we should definitely revisit that and make it more
>> crash-resistant, absolutely.
>>
>> [0] http://research.google.com/pubs/pub43438.html
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineer http://codetrips.com *
>>
>> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
>> sschlans...@opentable.com> wrote:
>>
>>>
>>>
>>> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
>>> >
>>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 

Re: mesos-master resource offer details

2015-09-02 Thread Vinod Kone
Sounds like you should bump up the logging level of marathon. Did you ask
in the marathon mailing list?

On Wed, Sep 2, 2015 at 10:02 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Alex,
>
> The problem I am facing is that there are no allocations made.  Mesos
> -master gives 5 requests to marathon. But marathon DECLINE s all the
> offers. I am trying to debug the reason why it is rejecting the offers. I
> traced down the source code to see that it calls the ResourceMatcher to
> match the resource offered vs. Resource Available and in my case it says it
> has problem with the cpu's offered (not sufficient resources ). I am trying
> to get the details of the resource offer made available - the cpu's being
> offered and I'm stuck there..
>
> I really appreciate if you have any suggestions! Thanks.
>
> On Wed, Sep 2, 2015 at 9:54 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> To what Haosdent said: you cannot get a list of offers from master logs,
>> but you can get a list of allocations from the built-in allocator in you
>> bump up the log level (GLOG_v=2).
>>
>> On Wed, Sep 2, 2015 at 7:36 AM, haosdent <haosd...@gmail.com> wrote:
>>
>>> If the offer is rejected by your framework, could you find this log in
>>> mesos:
>>>
>>> ```
>>> xxx Processing DECLINE call for offers xxx
>>> ```
>>>
>>> On Wed, Sep 2, 2015 at 1:31 PM, haosdent <haosd...@gmail.com> wrote:
>>>
>>>> >Well, the log you mentioned above is when the resource offer is
>>>> accepted and mesos-master then allocates the cpu.
>>>> Hi, @Haripriya As far as i know, the log I show above is allocator
>>>> allocate resource and make a offer. And then trigger Master::offer to send
>>>> offer to frameworks. So the log above is not resource offer is
>>>> accepted, it is before send offer to framework and it also is the details
>>>> about that offer.
>>>>
>>>> For you problem
>>>> >In my case, the offer is being rejected
>>>> If you mean the offer is rejected by your framework after your
>>>> framework receive it? Or you mean your framework never receive offers from
>>>> mesos?
>>>>
>>>>
>>>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>>>> aharipriy...@gmail.com> wrote:
>>>>
>>>>> Well, the log you mentioned above is when the resource offer is
>>>>> accepted and mesos-master then allocates the cpu. In my case, the offer is
>>>>> being rejected. I am trying to debug the reason as to why the resource
>>>>> offer is being rejected.
>>>>>
>>>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent <haosd...@gmail.com> wrote:
>>>>>
>>>>>> Yes, currently only print number for offers in mesos code in default
>>>>>> log level. If you want get more details about it, you could start with 
>>>>>> set
>>>>>> environment variable GLOG_v2=1 Then you should got some similar message
>>>>>> like this:
>>>>>>
>>>>>> I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
>>>>>> cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
>>>>>> 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
>>>>>> 2-16777343-5050-46447-
>>>>>>
>>>>>> But use GLOG_v2 would have a lot of log. If you just want to get the
>>>>>> resources allocated to task or executor, you could get those informations
>>>>>> from slave state.json endpoint.
>>>>>>
>>>>>> On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
>>>>>> aharipriy...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, but is there no way without tweaking the source code of the
>>>>>>> framework scheduler that I get the details of the resource offer? I 
>>>>>>> don't
>>>>>>> see anything in my logs.
>>>>>>>
>>>>>>> All I can see is
>>>>>>>
>>>>>>> mesos-master: Sending 5 offers to framework 20150815- (marathon)
>>>>>>> at scheduler-50ajaja@pqr
>>>>>>>
>>>>>>> I can't find any other details in the logs..
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 8:36 PM, haosdent 

Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Tim Chen
Hi Scott,

I wonder if you can try the latest Mesos and see if you can repro this?

And if it is can you put down the example task and steps? I couldn't see
disk full in your slave log so I'm not sure if it's exactly the same
problem of MESOS-2684.

Tim

On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin  wrote:

> Hi Marco,
>
> I certainly don’t want to start a flame war, and I actually realized after
> I added my comment to MESOS-2684 that it’s not quite the same thing.
>
> As far as I can tell, in our situation, there’s no underlying disk issue.
> It seems like this is some sort of race condition (maybe?) with docker
> containers and executors shutting down.  I’m perfectly happy with Mesos
> choosing to shut down in the case of a failure or unexpected situation –
> that’s a methodology that we adopt ourselves.  I’m just trying to get a
> little more information about what the underlying issue is so that we can
> resolve it. I don’t know enough about Mesos internals to be able to answer
> that question just yet.
>
> It’s also inconvenient because, while Mesos is well-behaved and restarts
> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
> mesos-slave crash also brings down applications.
>
> Thanks,
> Scott
>
> From: Marco Massenzio
> Reply-To: "user@mesos.apache.org"
> Date: Tuesday, September 1, 2015 at 7:33 PM
> To: "user@mesos.apache.org"
> Subject: Re: mesos-slave crashing with CHECK_SOME
>
> That's one of those areas for discussions that is so likely to generate a
> flame war that I'm hesitant to wade in :)
>
> In general, I would agree with the sentiment expressed there:
>
> > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
>
> which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
>
> The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
>
> I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
>
> All this to say, that it's difficult to come up with a general *automated*
> approach to unequivocally decide if a failure is "fatal" or could just be
> safely "ignored" (after appropriate error logging) - in general, when in
> doubt it's probably safer to "noisily crash & restart" and rely on the
> overall system's HA architecture to take care of replication and
> consistency.
> (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
>
> From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
>
> Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
>
> [0] http://research.google.com/pubs/pub43438.html
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com *
>
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
>
>>
>>
>> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
>> >
>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>>
>> I reported a similar bug a while back:
>>
>> https://issues.apache.org/jira/browse/MESOS-2684
>>
>> This seems to be a class of bugs where some filesystem operations which
>> may fail for unforeseen reasons are written as assertions which crash the
>> process, rather than failing only the task and communicating back the error
>> reason.
>>
>>
>>
> This email message contains information that Motus, LLC considers
> confidential and/or proprietary, or may later designate as confidential and
> proprietary. It is intended only for use of the individual or entity named
> above and should not be forwarded to any other persons or entities without
> the express consent of Motus, LLC, nor should it be used for any purpose
> other than in the course of any potential or actual business 

Re: Prepping for next release

2015-09-02 Thread Kevin Sweeney
I'd be in favor of setting that flag to Java 7 as well - just because
classes are compiled in Java 6 format doesn't mean the standard library
classes they reference will be available on Java 6 - your compiler
classpath contains Java 7's rt.jar, which contains classes that don't exist
in Java 6's rt.jar.

On Tue, Sep 1, 2015 at 5:08 PM, Vinod Kone  wrote:

> Actually looking at the RC1 jar more closely, it looks like the classes
> are built for 1.6 (our pom file
> actually
> sets this via maven compiler plugin).
>
> $ file ~/Downloads/Executor.class
>
> /Users/vinod/Downloads/Executor.class: compiled Java class data, version
> 50.0 (Java 1.6)
>
> The confusing part (for me) is that jar's manifest says "Build-Jdk:
> 1.7.0_60" but AFAICT that just means JDK7 was used to build the JAR. It
> has nothing to do with the version of the generated byte code.
>
> So, I think we are OK here.
>
>
> On Tue, Sep 1, 2015 at 5:03 PM, Kevin Sweeney  wrote:
>
>> I'm generally in favor of dropping support for JDK6 as it's been
>> end-of-life for years.
>>
>> On Tue, Sep 1, 2015 at 4:46 PM, Vinod Kone  wrote:
>>
>>> +user
>>>
>>> So looks like this issue is related to JDK6 and not my maven password
>>> settings.
>>>
>>> Related ASF ticket: https://issues.apache.org/jira/browse/BUILDS-85
>>>
>>> The reason it worked for me, when I tagged RC1, was because I also
>>> pointed my maven to use JDK7.
>>>
>>> So we have couple options here:
>>>
>>> #1) (Easy) Do same thing with RC2 as we did for RC1. This does mean the
>>> artifacts we upload to nexus will be compiled with JDK7. IIUC, if any JVM
>>> based frameworks are still on JDK6 they can't link in the new artifacts?
>>>
>>> #2) (Harder) As mentioned in the ticket, have maven compile Mesos jar
>>> with JDK6 but use JDK7 when uploading. Not sure how easy it is to adapt our
>>> Mesos build tool chain for this. Anyone has expertise in this area?
>>>
>>> Thoughts?
>>>
>>>
>>> On Tue, Aug 18, 2015 at 3:14 PM, Vinod Kone 
>>> wrote:
>>>
 I re-encrypted the maven passwords and that seemed to have done the
 trick. Thanks Adam!

 On Tue, Aug 18, 2015 at 1:59 PM, Adam Bordelon 
 wrote:

> Update your ~/.m2/settings.xml?
> Also check that the output of `gpg --list-keys` and `--list-sigs`
> matches
> the keypair you expect
>
> On Tue, Aug 18, 2015 at 1:48 PM, Vinod Kone 
> wrote:
>
> > I definitely had to create a new gpg key because my previous one
> expired! I
> > uploaded them id.apache and our SVN repo containing KEYS.
> >
> > Do I need to do anything specific for maven?
> >
> > On Tue, Aug 18, 2015 at 1:25 PM, Adam Bordelon 
> wrote:
> >
> > > Haven't seen that one. Are you sure you've got your gpg key
> properly set
> > up
> > > with Maven?
> > >
> > > On Tue, Aug 18, 2015 at 1:13 PM, Vinod Kone 
> > wrote:
> > >
> > > > I'm getting the following error when running ./support/tag.sh.
> Has any
> > of
> > > > the recent release managers seen this one before?
> > > >
> > > > [ERROR] Failed to execute goal
> > > > org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy
> > (default-deploy)
> > > on
> > > > project mesos: Failed to deploy artifacts: Could not transfer
> artifact
> > > > org.apache.mesos:mesos:jar:0.24.0-rc1 from/to
> apache.releases.https (
> > > >
> https://repository.apache.org/service/local/staging/deploy/maven2):
> > > > java.lang.RuntimeException: Could not generate DH keypair: Prime
> size
> > > must
> > > > be multiple of 64, and can only range from 512 to 1024
> (inclusive) ->
> > > [Help
> > > > 1]
> > > >
> > > > On Mon, Aug 17, 2015 at 11:23 AM, Vinod Kone <
> vinodk...@apache.org>
> > > wrote:
> > > >
> > > > > Update:
> > > > >
> > > > > There are 3 outstanding tickets (all related to flaky tests),
> that we
> > > are
> > > > > trying to resolve. Any help fixing those (esp. MESOS-3050
> > > > > ) would be
> > > > appreciated!
> > > > >
> > > > > Planning to cut an RC as soon as they are fixed (assuming no
> new ones
> > > > crop
> > > > > up).
> > > > >
> > > > > Thanks,
> > > > >
> > > > > On Fri, Aug 14, 2015 at 7:50 AM, James DeFelice <
> > > > james.defel...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Awesome - thanks so much!
> > > > >>
> > > > >> On Fri, Aug 14, 2015 at 9:37 AM, Bernd Mathiske <
> > be...@mesosphere.io>
> > > > >> wrote:
> > > > >>
> > > > >> > I just committed it. Thanks, James!
> > > > >> >
> > > > >> > 

How does mesos determine how much memory on a node is available for offer?

2015-09-02 Thread F21
I have 3 CoreOS nodes running in vagrant. Mesos is run natively (not in 
docker containers).


There is 1 master/slave and 2 slaves.

If I ssh into one of my slaves and run free -m, I see:

Total: 2005
Used 1342
Free 662
Shares 273
Buffers 13
Cached 1210

In the mesos web-ui, I see that the slave  has 1002 MB of memory to offer.

How is this 1002 MB determined (I am running the masters and slaves with 
stock defaults and no customizations)?


Is the 1002MB included in the used memory (1342)? If so, why is the 
662MB free? It seems to be a waste and I am sure it should be able to 
offer another 500MB making the total 1502MB.


Re: How does mesos determine how much memory on a node is available for offer?

2015-09-02 Thread Anand Mazumdar
In case you don’t specify the resources via “—resources” flag when you start 
your agent, it picks up the default values. (Example: 
--resources="cpus:4;mem:1024;disk:2”)

The default value for memory is here: 
https://github.com/apache/mesos/blob/master/src/slave/constants.cpp#L46 


-anand

> On Sep 2, 2015, at 6:12 PM, F21  wrote:
> 
> I have 3 CoreOS nodes running in vagrant. Mesos is run natively (not in 
> docker containers).
> 
> There is 1 master/slave and 2 slaves.
> 
> If I ssh into one of my slaves and run free -m, I see:
> 
> Total: 2005
> Used 1342
> Free 662
> Shares 273
> Buffers 13
> Cached 1210
> 
> In the mesos web-ui, I see that the slave  has 1002 MB of memory to offer.
> 
> How is this 1002 MB determined (I am running the masters and slaves with 
> stock defaults and no customizations)?
> 
> Is the 1002MB included in the used memory (1342)? If so, why is the 662MB 
> free? It seems to be a waste and I am sure it should be able to offer another 
> 500MB making the total 1502MB.



Re: How does mesos determine how much memory on a node is available for offer?

2015-09-02 Thread F21
There seems to be some dynamicness to it. I just bumped the memory for 
each VM up to 2.5GB and now mesos is offering 1.5GB on it's slave. Is 
there some percentage value that I can set so that more memory is 
available to mesos?


On 3/09/2015 11:23 AM, Anand Mazumdar wrote:
In case you don’t specify the resources via “—resources” flag when you 
start your agent, it picks up the default values. (Example: 
--resources="cpus:4;mem:1024;disk:2”)


The default value for memory is here: 
https://github.com/apache/mesos/blob/master/src/slave/constants.cpp#L46


-anand



API client libraries

2015-09-02 Thread Vinod Kone
Hi folks,

Now that the v1 scheduler HTTP API (beta) is on the verge of being
released, I wanted to open up the discussion about client libraries for the
API. Mainly around support and home for the libs.

One idea is that, going forward, the only supported client library would be
C++ library which will live in the mesos repo. All other client libraries
(java, python, go etc) will not be officially supported but linked to from
our webpage/docs.

Pros:
--> The PMC/committers won't have the burden to maintain client libraries
in languages we don't have expertise in.
--> Gives more control (reviews, releases) and attribution (could live in
the author's org's or personal repo) to 3rd party client library authors

Cons:
--> Might be a step backward because we would be officially dropping
support for Java and Python. This is probably a good thing?
--> No quality control of the libraries by the PMC. Need co-ordination with
library authors to incorporate API changes. Could lead to bad user
experience.

I've taken a quick look at what other major projects do and it looks like
most of them officially support a few api libs and then link to 3rdparty
libs.

Docker
:
No official library? Links to 3rd party libs.

GitHub : Official support for
Ruby, .Net, Obj-C. Links to 3rd party libs.

Google : All
official libraries? No links to 3rd party libs?

K8S : Official
support for Go. Links to 3rd party libs.

Twitter : Official
support for Java. Links to 3rd party libs.


Is this the way we want to go? This does mean we won't need a mesos/commons
repo because the project would be not be officially supporting 3rd party
libs. The supported C++ libs will live in the mesos repo.

Thoughts?


Re: API client libraries

2015-09-02 Thread Artem Harutyunyan
Thanks for bringing this up, Vinod!

We have to make sure that there are reference library implementations for
at least Python, Java, and Go. They may end up being owned and maintained
by the community, but I feel that Mesos developers should at least
kickstart the process and incubate those libraries. Once the initial
implementations of those libraries are in place we should also make sure to
have reference usage examples for them (like we do right now with Rendler).

In any case, this is a very important topic so I will go ahead and add it
to tomorrow's community sync agenda.

Cheers,
Artem.

On Wed, Sep 2, 2015 at 11:49 AM, Vinod Kone  wrote:

> Hi folks,
>
> Now that the v1 scheduler HTTP API (beta) is on the verge of being
> released, I wanted to open up the discussion about client libraries for the
> API. Mainly around support and home for the libs.
>
> One idea is that, going forward, the only supported client library would be
> C++ library which will live in the mesos repo. All other client libraries
> (java, python, go etc) will not be officially supported but linked to from
> our webpage/docs.
>
> Pros:
> --> The PMC/committers won't have the burden to maintain client libraries
> in languages we don't have expertise in.
> --> Gives more control (reviews, releases) and attribution (could live in
> the author's org's or personal repo) to 3rd party client library authors
>
> Cons:
> --> Might be a step backward because we would be officially dropping
> support for Java and Python. This is probably a good thing?
> --> No quality control of the libraries by the PMC. Need co-ordination with
> library authors to incorporate API changes. Could lead to bad user
> experience.
>
> I've taken a quick look at what other major projects do and it looks like
> most of them officially support a few api libs and then link to 3rdparty
> libs.
>
> Docker
> <
> https://docs.docker.com/reference/api/remote_api_client_libraries/#docker-remote-api-client-libraries
> >:
> No official library? Links to 3rd party libs.
>
> GitHub : Official support for
> Ruby, .Net, Obj-C. Links to 3rd party libs.
>
> Google : All
> official libraries? No links to 3rd party libs?
>
> K8S : Official
> support for Go. Links to 3rd party libs.
>
> Twitter : Official
> support for Java. Links to 3rd party libs.
>
>
> Is this the way we want to go? This does mean we won't need a mesos/commons
> repo because the project would be not be officially supporting 3rd party
> libs. The supported C++ libs will live in the mesos repo.
>
> Thoughts?
>


Re: Apache Mesos Community Sync

2015-09-02 Thread Vinod Kone
We'll have the next community sync tomorrow (Sept 3rd) at 3 PM PST.

Please add items to agenda

.


On Wed, Aug 5, 2015 at 4:12 PM, Vinod Kone  wrote:

> We'll have the next community sync tomorrow at 3 PM PST.
>
> Please add items to agenda
> 
> .
>
> Thanks,
>
> On Thu, Jul 2, 2015 at 11:24 AM, Joris Van Remoortere  > wrote:
>
>> Reminder: The Mesos Community Developer Sync will be happening today at
>> 3pm Pacific.
>>
>> To participate remotely, join the Google hangout:
>> https://plus.google.com/hangouts/_/twitter.com/mesos-sync
>>
>> On Thu, Jun 18, 2015 at 7:22 AM, Adam Bordelon 
>> wrote:
>>
>>> Reminder: We're hosting a developer community sync at Mesosphere HQ this
>>> morning from 9-11am Pacific.
>>>
>>> The agenda is pretty bare, so please add more topics you would like to
>>> discuss:
>>>
>>> https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit
>>>
>>> If you want to join in person, just show up to 88 Stevenson St, ring the
>>> buzzer, take the elevator up to 2nd floor, and then you can take the stairs
>>> up to the 3rd floor dining room, or ask somebody to let you up the elevator
>>> to the 3rd floor.
>>>
>>> To participate remotely, join the Google hangout:
>>> https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer
>>>
>>> On Mon, Jun 15, 2015 at 10:46 AM, Adam Bordelon 
>>> wrote:
>>>
 As previously mentioned, we would like to host additional Mesos
 developer syncs at our new Mesosphere HQ at 88 Stevenson St (tucked behind
 Market & 2nd), starting this Thursday from 9-11am Pacific. We opted for an
 earlier slot so that the European developer community can participate.

 Now that we are having these more frequently, it would be great to dive
 deeper into designs for upcoming features as well as discuss longstanding
 issues. While high-level status updates are useful, they should be a small
 part of these meetings so that we can address issues currently facing our
 developers.

 Please add agenda items to the same doc we've been using for previous
 meetings' Agenda/Notes:

 https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit

 Join in person if you can, or join remotely via hangout:
 https://plus.google.com/hangouts/_/mesosphere.io/mesos-developer

 Thanks,
 -Adam-


 On Thu, May 28, 2015 at 10:08 AM, Vinod Kone 
 wrote:

> Cool.
>
> Here's the agenda doc
> <
> https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit#
> >
> for next week that folks can fill in.
>
> On Thu, May 28, 2015 at 9:52 AM, Adam Bordelon 
> wrote:
>
> > Looks like next week, Thursday June 4th on my calendar.
> > I thought it was always the first Thursday of the month.
> >
> > On Thu, May 28, 2015 at 9:33 AM, Vinod Kone 
> wrote:
> >
> > > Do we have community sync today or next week? I'm a bit confused.
> > >
> > > @vinodkone
> > >
> > > > On Apr 1, 2015, at 3:18 AM, Adam Bordelon 
> wrote:
> > > >
> > > > Reminder: We're having another Mesos Developer Community Sync
> this
> > > > Thursday, April 2nd from 3-5pm Pacific.
> > > >
> > > > Agenda:
> > > >
> > >
> >
> https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
> > > > To Join: follow the BlueJeans instructions from the recurring
> meeting
> > > > invite at the start of this thread.
> > > >
> > > >> On Fri, Mar 6, 2015 at 11:11 AM, Vinod Kone <
> vinodk...@apache.org>
> > > wrote:
> > > >>
> > > >> Hi folks,
> > > >>
> > > >> We are planning to do monthly Mesos community meetings.
> Tentatively
> > > these
> > > >> are scheduled to occur on 1st Thursday of every month at 3 PM
> PST. See
> > > >> below for details to join the meeting remotely.
> > > >>
> > > >> This is a forum to ask questions/discuss about upcoming
> features,
> > > process
> > > >> etc. Everyone is welcome to join. Feel free to add items to the
> agenda
> > > for
> > > >> the next meeting here
> > > >> <
> > > >>
> > >
> >
> https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing
> > > >> .
> > > >>
> > > >> Cheers,
> > > >>
> > > >> On Thu, Mar 5, 2015 at 11:23 AM, Vinod Kone via Blue Jeans
> Network <
> > > >> inv...@bluejeans.com> wrote:
> > > >>
> > > 

Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Marco Massenzio
@Steven - agreed!
As mentioned, if we can reduce the "footprint of unnecessary CHECKs" (so to
speak) I'm all for it - let's document and add Jiras for that, by all means.

@Scott - LoL: you certainly didn't; I was more worried my email would ;-)

Thanks, guys!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com *

On Wed, Sep 2, 2015 at 10:59 AM, Steven Schlansker <
sschlans...@opentable.com> wrote:

> I 100% agree with your philosophy here, and I suspect it's something
> shared in the Mesos community.
>
> I just think that we can restrict the domain of the failure to a smaller
> reasonable window -- once you are in the context of "I am doing work to
> launch a specific task", there is a well defined "success / failure / here
> is an error message" path defined already.  Users expect tasks to fail and
> can see the errors.
>
> I think that a lot of these assertions are in fact more appropriate as
> task failures.  But I agree that they should be fatal to *some* part of the
> system, just not the whole agent entirely.
>
> On Sep 1, 2015, at 4:33 PM, Marco Massenzio  wrote:
>
> > That's one of those areas for discussions that is so likely to generate
> a flame war that I'm hesitant to wade in :)
> >
> > In general, I would agree with the sentiment expressed there:
> >
> > > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
> >
> > which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
> >
> > The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
> >
> > I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
> >
> > All this to say, that it's difficult to come up with a general
> *automated* approach to unequivocally decide if a failure is "fatal" or
> could just be safely "ignored" (after appropriate error logging) - in
> general, when in doubt it's probably safer to "noisily crash & restart" and
> rely on the overall system's HA architecture to take care of replication
> and consistency.
> > (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
> >
> > From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
> >
> > Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
> >
> > [0] http://research.google.com/pubs/pub43438.html
> >
> > Marco Massenzio
> > Distributed Systems Engineer
> > http://codetrips.com
> >
> > On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
> >
> >
> > On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
> > >
> > > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
> >
> > I reported a similar bug a while back:
> >
> > https://issues.apache.org/jira/browse/MESOS-2684
> >
> > This seems to be a class of bugs where some filesystem operations which
> may fail for unforeseen reasons are written as assertions which crash the
> process, rather than failing only the task and communicating back the error
> reason.
> >
> >
> >
>
>


Help us review #MesosCon Europe 2015 proposals

2015-09-02 Thread Dave Lester
A total of 65 proposals were submitted for #MesosCon Europe. Similar to
previous MesosCon events, the program committee is opening these
proposals up for community review/feedback to better-inform our
decisions about what should be included in the program.

In order to make it easier to review a subset of the proposals, we’ve
segmented them based upon three loose themes: Frameworks, Users / Ops,
and Mesos Internals and Extensions. We encourage you to review proposals
based upon one theme, or all three!

Frameworks (14 Proposals): bit.ly/MesosConEU2015Frameworks Talks on how
frameworks can be used, developed, and integrate with Mesos.

Users / Ops (29 Proposals): bit.ly/MesosConEU2015UsersOps A combination
of talks that are use cases (how company x uses Mesos), and operations-
focused (how we deploy x, use Docker, etc).

Mesos Internals and Extensions (22 Proposals):
bit.ly/MesosCon2015EUInternalsExt Features of the Mesos core, or
software integrations with the internals of Mesos. Some proposals have
overlap with frameworks and ops, but most are focused on the
foundational aspects of how Mesos works.

The forms above also include an opportunity to indicate which sessions
you didn't see proposed but would like to attend.

Thanks in advance for your participation! The forms will close on EOD
this upcoming Monday, September 7th.

Dave


Re: API client libraries

2015-09-02 Thread Vinod Kone
On Wed, Sep 2, 2015 at 11:49 AM, Vinod Kone  wrote:

> --> Might be a step backward because we would be officially dropping
> support for Java and Python. This is probably a good thing?
>

s/officially dropping support/dropping official support/


Re: API client libraries

2015-09-02 Thread CCAAT
@ Vinod:: An excellent idea as the code bases mature. It will force 
clear delineation of functionality and allow those 'other language" 
experts to define their codes for Mesos more clearly.


@ Artem:: Another excellent point. The mesos "core team" will have to 
still work with the other language/module teams to define things and
debug some codes that use core interfaces, API and common 
inter-operative constructs.



Furthermore this sort of code maturity will set the stage for other 
languages to bring enhanced functionality to Mesos.



Last, Separating the C/C++ will facilitate those efforts to run mesos
as close as possible to 'bare metal' on a variety of processors, gpus
and memory-types (RDMA) which are all available now with GCC-5.x This
effort will most like result in tremendous performance boosting of Mesos
and all the companion codes.


A smashingly outstanding idea


James



On 09/02/2015 02:01 PM, Artem Harutyunyan wrote:

Thanks for bringing this up, Vinod!

We have to make sure that there are reference library implementations
for at least Python, Java, and Go. They may end up being owned and
maintained by the community, but I feel that Mesos developers should at
least kickstart the process and incubate those libraries. Once the
initial implementations of those libraries are in place we should also
make sure to have reference usage examples for them (like we do right
now with Rendler).

In any case, this is a very important topic so I will go ahead and add
it to tomorrow's community sync agenda.

Cheers,
Artem.

On Wed, Sep 2, 2015 at 11:49 AM, Vinod Kone > wrote:

Hi folks,

Now that the v1 scheduler HTTP API (beta) is on the verge of being
released, I wanted to open up the discussion about client libraries
for the
API. Mainly around support and home for the libs.

One idea is that, going forward, the only supported client library
would be
C++ library which will live in the mesos repo. All other client
libraries
(java, python, go etc) will not be officially supported but linked
to from
our webpage/docs.

Pros:
--> The PMC/committers won't have the burden to maintain client
libraries
in languages we don't have expertise in.
--> Gives more control (reviews, releases) and attribution (could
live in
the author's org's or personal repo) to 3rd party client library authors

Cons:
--> Might be a step backward because we would be officially dropping
support for Java and Python. This is probably a good thing?
--> No quality control of the libraries by the PMC. Need
co-ordination with
library authors to incorporate API changes. Could lead to bad user
experience.

I've taken a quick look at what other major projects do and it looks
like
most of them officially support a few api libs and then link to 3rdparty
libs.

Docker

:
No official library? Links to 3rd party libs.

GitHub : Official support for
Ruby, .Net, Obj-C. Links to 3rd party libs.

Google : All
official libraries? No links to 3rd party libs?

K8S :
Official
support for Go. Links to 3rd party libs.

Twitter :
Official
support for Java. Links to 3rd party libs.


Is this the way we want to go? This does mean we won't need a
mesos/commons
repo because the project would be not be officially supporting 3rd party
libs. The supported C++ libs will live in the mesos repo.

Thoughts?






Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Scott Rankin
Hi Marco,

I certainly don’t want to start a flame war, and I actually realized after I 
added my comment to MESOS-2684 that it’s not quite the same thing.

As far as I can tell, in our situation, there’s no underlying disk issue.  It 
seems like this is some sort of race condition (maybe?) with docker containers 
and executors shutting down.  I’m perfectly happy with Mesos choosing to shut 
down in the case of a failure or unexpected situation – that’s a methodology 
that we adopt ourselves.  I’m just trying to get a little more information 
about what the underlying issue is so that we can resolve it. I don’t know 
enough about Mesos internals to be able to answer that question just yet.

It’s also inconvenient because, while Mesos is well-behaved and restarts 
gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a 
mesos-slave crash also brings down applications.

Thanks,
Scott

From: Marco Massenzio
Reply-To: "user@mesos.apache.org"
Date: Tuesday, September 1, 2015 at 7:33 PM
To: "user@mesos.apache.org"
Subject: Re: mesos-slave crashing with CHECK_SOME

That's one of those areas for discussions that is so likely to generate a flame 
war that I'm hesitant to wade in :)

In general, I would agree with the sentiment expressed there:

> If the task fails, that is unfortunate, but not the end of the world. Other 
> tasks should not be affected.

which is, in fact, to large extent exactly what Mesos does; the example given 
in MESOS-2684, as it happens, is for a "disk full failure" - carrying on as if 
nothing had happened, is only likely to lead to further (and worse) 
disappointment.

The general philosophy back at Google (and which certainly informs the design 
of Borg[0]) was "fail early, fail hard" so that either (a) the service is 
restarted and hopefully the root cause cleared or (b) someone (who can 
hopefully do something) will be alerted about it.

I think it's ultimately a matter of scale: up to a few tens of servers, you can 
assume there is some sort of 'log-monitor' that looks out for errors and other 
anomalies and alerts humans that will then take a look and possibly apply some 
corrective action - when you're up to hundreds or thousands (definitely Mesos 
territory) that's not practical: the system should either self-heal or 
crash-and-restart.

All this to say, that it's difficult to come up with a general *automated* 
approach to unequivocally decide if a failure is "fatal" or could just be 
safely "ignored" (after appropriate error logging) - in general, when in doubt 
it's probably safer to "noisily crash & restart" and rely on the overall 
system's HA architecture to take care of replication and consistency.
(and an intelligent monitoring system that only alerts when some failure 
threshold is exceeded).

From what I've seen so far (granted, still a novice here) it seems that Mesos 
subscribes to this notion, assuming that Agent Nodes will come and go, and 
usually Tasks survive (for a certain amount of time anyway) a Slave restart 
(obviously, if the physical h/w is the ultimate cause of failure, well, then 
all bets are off).

Having said all that - if there are areas where we have been over-eager with 
our CHECKs, we should definitely revisit that and make it more crash-resistant, 
absolutely.

[0] http://research.google.com/pubs/pub43438.html

Marco Massenzio
Distributed Systems Engineer
http://codetrips.com

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker 
> wrote:


On Aug 31, 2015, at 11:54 AM, Scott Rankin 
> wrote:
>
> tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354] 
> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory

I reported a similar bug a while back:

https://issues.apache.org/jira/browse/MESOS-2684

This seems to be a class of bugs where some filesystem operations which may 
fail for unforeseen reasons are written as assertions which crash the process, 
rather than failing only the task and communicating back the error reason.




This email message contains information that Motus, LLC considers confidential 
and/or proprietary, or may later designate as confidential and proprietary. It 
is intended only for use of the individual or entity named above and should not 
be forwarded to any other persons or entities without the express consent of 
Motus, LLC, nor should it be used for any purpose other than in the course of 
any potential or actual business relationship with Motus, LLC. If the reader of 
this message is not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution, or copying of this communication is 
strictly prohibited. If you have received this communication in error, please 
notify sender immediately and destroy the