Re: Welcome Andrei Sekretenko as a new committer and PMC member!

2020-01-21 Thread Klaus Ma
Congratulations!

On Wed, Jan 22, 2020 at 5:42 AM Benjamin Mahler  wrote:

> Please join me in welcoming Andrei Sekretenko as the newest committer and
> PMC member!
>
> Andrei has been active in the project for almost a year at this point and
> has been a productive and collaborative member of the community.
>
> He has helped out a lot with allocator work, both with code and
> investigations of issues. He made improvements to multi-role framework
> scalability (which includes the addition of the UPDATE_FRAMEWORK call), and
> exposed metrics for per-role quota consumption.
>
> He has also investigated, identified, and followed up on important bugs.
> One such example is the message re-ordering issue he is currently working
> on: https://issues.apache.org/jira/browse/MESOS-10023
>
> Thanks for all your work so far Andrei, I'm looking forward to more of your
> contributions in the project.
>
> Ben
>


Re: Welcome Benno Evers as committer and PMC member!

2019-02-03 Thread Klaus Ma
Congratulations!

-- Klaus


On Thu, Jan 31, 2019 at 7:31 PM Andrei Budnik  wrote:

> Congratulations!
>
> On Thu, Jan 31, 2019 at 2:41 AM Benjamin Mahler 
> wrote:
>
>> Welcome Benno! Thanks for all the great contributions
>>
>> On Wed, Jan 30, 2019 at 6:21 PM Alex R  wrote:
>>
>> > Folks,
>> >
>> > Please welcome Benno Evers as an Apache committer and PMC member of the
>> > Apache Mesos!
>> >
>> > Benno has been active in the project for more than a year now and has
>> made
>> > significant contributions, including:
>> >   * Agent reconfiguration, MESOS-1739
>> >   * Memory profiling, MESOS-7944
>> >   * "/state" performance improvements, MESOS-8345
>> >
>> > I have been working closely with Benno, paired up on, and shepherded
>> some
>> > of his work. Benno has very strong technical knowledge in several areas
>> and
>> > he is willing to share it with others and help his peers.
>> >
>> > Benno, thanks for all your contributions so far and looking forward to
>> > continuing to work with you on the project!
>> >
>> > Alex.
>> >
>>
>


Re: Welcome Zhitao Li as Mesos Committer and PMC Member

2018-03-13 Thread Klaus Ma
Congrats !


Da (Klaus), Ma (马达) | PMP® | Kubernetes Maintainer, Architect
IBM Cloud Private, IBM Spectrum Computing, IBM System
+86-10-8245 4084 | mad...@cn.ibm.com | @k82cn <http://github.com/k82cn>

On Tue, Mar 13, 2018 at 7:44 AM, Yan Xu <y...@jxu.me> wrote:

> Congrats!
>
> On Mon, Mar 12, 2018 at 4:40 PM Qian Zhang <zhq527...@gmail.com> wrote:
>
>> Congrats Zhitao!
>>
>>
>> Regards,
>> Qian Zhang
>>
>> On Tue, Mar 13, 2018 at 6:30 AM, Jason Lai <ja...@jasonlai.net> wrote:
>>
>>> Huge congrats, Zhitao!
>>>
>>> It is super awesome to have you represent Uber for our Mesos open source
>>> efforts! Well deserved!
>>>
>>> Jason
>>>
>>> On Mon, Mar 12, 2018 at 3:28 PM Chun-Hung Hsiao <chhs...@mesosphere.io>
>>> wrote:
>>>
>>> > Congrats Zhitao!
>>> >
>>> > On Mon, Mar 12, 2018 at 2:51 PM, Benjamin Mahler <bmah...@apache.org>
>>> > wrote:
>>> >
>>> > > Welcome Zhitao! Thanks for your contributions so far
>>> > >
>>> > > On Mon, Mar 12, 2018 at 2:02 PM, Gilbert Song <gilb...@apache.org>
>>> > wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > I am excited to announce that the PMC has voted Zhitao Li as a new
>>> > > > committer and member of PMC for the Apache Mesos project. Please
>>> join
>>> > me
>>> > > to
>>> > > > congratulate Zhitao!
>>> > > >
>>> > > > Zhitao has been an active contributor to Mesos for one and a half
>>> > years.
>>> > > > His main contributions include:
>>> > > >
>>> > > >- Designed and implemented Container Image Garbage Collection (
>>> > > >MESOS-4945 <https://issues.apache.org/jira/browse/MESOS-4945>);
>>> > > >- Designed and implemented part of the HTTP Operator API
>>> (MESOS-6007
>>> > > ><https://issues.apache.org/jira/browse/MESOS-6007>);
>>> > > >- Reported and fixed a lot of bugs
>>> > > ><https://issues.apache.org/jira/issues/?jql=type%20%3D%
>>> > > 20Bug%20AND%20(assignee%20%3D%20zhitao%20OR%20reporter%20%
>>> > > 3D%20zhitao%20)%20ORDER%20BY%20priority%20>
>>> > > >.
>>> > > >
>>> > > > Zhitao spares no effort to improve the project quality and to
>>> propose
>>> > > > ideas. Thank you Zhitao for all contributions!
>>> > > >
>>> > > > Here is his committer candidate checklist for your perusal:
>>> > > > https://docs.google.com/document/d/1HGz7iBdo1Q9z9c8fNRgNNLnj0XQ_
>>> > > > PhDhjXLAfOx139s/
>>> > > >
>>> > > > Congrats Zhitao!
>>> > > >
>>> > > > Cheers,
>>> > > > Gilbert
>>> > > >
>>> > >
>>> >
>>>
>>
>> --
> Sent from mobile
>


Re: Welcome Andrew Schwartzmeyer as a new committer and PMC member!

2017-11-28 Thread Klaus Ma
Congratulations

On Tue, Nov 28, 2017 at 3:08 PM Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> Thank you everyone for the welcome!
>
> It's been great working with you this past year, and I'm glad to
> continue making this great project even better.
>
> Thanks again,
>
> Andy
>
> On 11/27/2017 3:00 pm, Joseph Wu wrote:
> > Hi devs & users,
> >
> > I'm happy to announce that Andrew Schwartzmeyer has become a new
> > committer
> > and member of the PMC for the Apache Mesos project.  Please join me in
> > congratulating him!
> >
> > Andrew has been an active contributor to Mesos for about a year.  He
> > has
> > been the primary contributor behind our efforts to change our default
> > build
> > system to CMake and to port Mesos onto Windows.
> >
> > Here is his committer candidate checklist for your perusal:
> > https://docs.google.com/document/d/1MfJRYbxxoX2-A-
> > g8NEeryUdUi7FvIoNcdUbDbGguH1c/
> >
> > Congrats Andy!
> > ~Joseph
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Welcome James Peach as a new committer and PMC memeber!

2017-09-07 Thread Klaus Ma
Congrats !!


Da (Klaus), Ma (马达) | PMP® | R of IBM Cloud private
IBM Spectrum Computing, IBM System
+86-10-8245 4084 | mad...@cn.ibm.com | @k82cn <http://github.com/k82cn>

On Thu, Sep 7, 2017 at 3:08 PM, tommy xiao <xia...@gmail.com> wrote:

> Congrats James! Well deserved!
>
> 2017-09-07 14:54 GMT+08:00 Ben Lin <ben@mesosphere.io>:
>
>> Congrats!!
>>
>> --
>> *From:* Oucema Bellagha <oucema.bella...@hotmail.com>
>> *Sent:* Thursday, September 7, 2017 2:51:44 PM
>> *To:* user@mesos.apache.org
>> *Subject:* Re: Welcome James Peach as a new committer and PMC memeber!
>>
>> Congrats my friend !
>>
>> --
>> *From:* xuj...@apple.com <xuj...@apple.com> on behalf of Yan Xu <
>> xuj...@apple.com>
>> *Sent:* Wednesday, September 6, 2017 9:08:42 PM
>> *To:* dev; user
>> *Subject:* Welcome James Peach as a new committer and PMC memeber!
>>
>> Hi Mesos devs and users,
>>
>> Please welcome James Peach as a new Apache Mesos committer and PMC member.
>>
>> James has been an active contributor to Mesos for over two years now. He
>> has made many great contributions to the project which include XFS disk
>> isolator, improvement to Linux capabilities support and IPC namespace
>> isolator. He's super active on the mailing lists and slack channels, always
>> eager to help folks in the community and he has been helping with a lot of
>> Mesos reviews as well.
>>
>> Here is his formal committer candidate checklist:
>>
>> https://docs.google.com/document/d/19G5zSxhrRBdS6GXn9KjCznjX
>> 3cp0mUbck6Jy1Hgn3RY/edit?usp=sharing
>> <https://docs.google.com/document/d/19G5zSxhrRBdS6GXn9KjCznjX3cp0mUbck6Jy1Hgn3RY/edit?usp=sharing>
>>
>> Congrats James!
>>
>> Yan
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: Welcome Greg Mann as a new committer and PMC member!

2017-06-13 Thread Klaus Ma
Congrats!


> On 14 Jun 2017, at 06:29, Ben Lin  wrote:
> 
> Congrats Greg, well deserved!
> 
> From: Jie Yu 
> Sent: Wednesday, June 14, 2017 5:54:48 AM
> To: user
> Cc: dev
> Subject: Re: Welcome Greg Mann as a new committer and PMC member!
>  
> Congrats Greg!
> 
> On Tue, Jun 13, 2017 at 2:42 PM, Vinod Kone  > wrote:
> Hi folks,
> 
> Please welcome Greg Mann as the newest committer and PMC member of the Apache 
> Mesos project.
> 
> Greg has been an active contributor to the Mesos project for close to 2 years 
> now and has made many solid contributions. His biggest source code 
> contribution to the project has been around adding authentication support for 
> default executor. This was a major new feature that involved quite a few 
> moving parts. Additionally, he also worked on improving the scheduler and 
> executor APIs.
> 
> Here is his more formal checklist for your perusal.
> 
> https://docs.google.com/document/d/1S6U5OFVrl7ySmpJsfD4fJ3_R8JYRRc5spV0yKrpsGBw/edit
>  
> 
> 
> Thanks,
> Vinod
> 
> 



Re: Welcome Gilbert Song as a new committer and PMC member!

2017-05-25 Thread Klaus Ma
Congrats! 


> On 25 May 2017, at 03:39, Greg Mann  > wrote:
> 
> Congratulations Gilbert!! :D
> 
> On Wed, May 24, 2017 at 12:01 PM, Avinash Sridharan  > wrote:
> Congrats Gilbert !! Very well deserved !!
> 
> On Wed, May 24, 2017 at 11:56 AM, Timothy Chen  > wrote:
> 
> > Congrats! Rocking the containerizer world!
> >
> > Tim
> >
> > On Wed, May 24, 2017 at 11:23 AM, Zhitao Li  > > wrote:
> > > Congrats Gilbert!
> > >
> > > On Wed, May 24, 2017 at 11:08 AM, Yan Xu  > > > wrote:
> > >
> > >> Congrats! Well deserved!
> > >>
> > >> ---
> > >> Jiang Yan Xu > | @xujyan 
> > >> >
> > >>
> > >> On Wed, May 24, 2017 at 10:54 AM, Vinod Kone  > >> >
> > wrote:
> > >>
> > >>> Congrats Gilbert!
> > >>>
> > >>> On Wed, May 24, 2017 at 1:32 PM, Neil Conway  > >>> >
> > >>> wrote:
> > >>>
> > >>> > Congratulations Gilbert! Well-deserved!
> > >>> >
> > >>> > Neil
> > >>> >
> > >>> > On Wed, May 24, 2017 at 10:32 AM, Jie Yu  > >>> > >
> > wrote:
> > >>> > > Hi folks,
> > >>> > >
> > >>> > > I' happy to announce that the PMC has voted Gilbert Song as a new
> > >>> > committer
> > >>> > > and member of PMC for the Apache Mesos project. Please join me to
> > >>> > > congratulate him!
> > >>> > >
> > >>> > > Gilbert has been working on Mesos project for 1.5 years now. His
> > main
> > >>> > > contribution is his work on unified containerizer, nested container
> > >>> (aka
> > >>> > > Pod) support. He also helped a lot of folks in the community
> > regarding
> > >>> > their
> > >>> > > patches, questions and etc. He also played an important role
> > >>> organizing
> > >>> > > MesosCon Asia last year and this year!
> > >>> > >
> > >>> > > His formal committer checklist can be found here:
> > >>> > > https://docs.google.com/document/d/1iSiqmtdX_0CU-YgpViA6r6PU_ 
> > >>> > > 
> > >>> > aMCVuxuNUZ458FR7Qw/edit?usp=sharing
> > >>> > >
> > >>> > > Welcome, Gilbert!
> > >>> > >
> > >>> > > - Jie
> > >>> >
> > >>>
> > >>
> > >>
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> >
> 
> 
> 
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245 
> 



Re: Welcome Gilbert Song as a new committer and PMC member!

2017-05-24 Thread Klaus Ma
Congratulations Gilbert!

On Thu, May 25, 2017 at 3:39 AM Greg Mann <g...@mesosphere.io> wrote:

> Congratulations Gilbert!! :D
>
> On Wed, May 24, 2017 at 12:01 PM, Avinash Sridharan <avin...@mesosphere.io
> > wrote:
>
>> Congrats Gilbert !! Very well deserved !!
>>
>> On Wed, May 24, 2017 at 11:56 AM, Timothy Chen <tnac...@gmail.com> wrote:
>>
>> > Congrats! Rocking the containerizer world!
>> >
>> > Tim
>> >
>> > On Wed, May 24, 2017 at 11:23 AM, Zhitao Li <zhitaoli...@gmail.com>
>> wrote:
>> > > Congrats Gilbert!
>> > >
>> > > On Wed, May 24, 2017 at 11:08 AM, Yan Xu <y...@jxu.me> wrote:
>> > >
>> > >> Congrats! Well deserved!
>> > >>
>> > >> ---
>> > >> Jiang Yan Xu <y...@jxu.me> | @xujyan <https://twitter.com/xujyan>
>> > >>
>> > >> On Wed, May 24, 2017 at 10:54 AM, Vinod Kone <vinodk...@apache.org>
>> > wrote:
>> > >>
>> > >>> Congrats Gilbert!
>> > >>>
>> > >>> On Wed, May 24, 2017 at 1:32 PM, Neil Conway <neil.con...@gmail.com
>> >
>> > >>> wrote:
>> > >>>
>> > >>> > Congratulations Gilbert! Well-deserved!
>> > >>> >
>> > >>> > Neil
>> > >>> >
>> > >>> > On Wed, May 24, 2017 at 10:32 AM, Jie Yu <yujie@gmail.com>
>> > wrote:
>> > >>> > > Hi folks,
>> > >>> > >
>> > >>> > > I' happy to announce that the PMC has voted Gilbert Song as a
>> new
>> > >>> > committer
>> > >>> > > and member of PMC for the Apache Mesos project. Please join me
>> to
>> > >>> > > congratulate him!
>> > >>> > >
>> > >>> > > Gilbert has been working on Mesos project for 1.5 years now. His
>> > main
>> > >>> > > contribution is his work on unified containerizer, nested
>> container
>> > >>> (aka
>> > >>> > > Pod) support. He also helped a lot of folks in the community
>> > regarding
>> > >>> > their
>> > >>> > > patches, questions and etc. He also played an important role
>> > >>> organizing
>> > >>> > > MesosCon Asia last year and this year!
>> > >>> > >
>> > >>> > > His formal committer checklist can be found here:
>> > >>> > > https://docs.google.com/document/d/1iSiqmtdX_0CU-YgpViA6r6PU_
>> > >>> > aMCVuxuNUZ458FR7Qw/edit?usp=sharing
>> > >>> > >
>> > >>> > > Welcome, Gilbert!
>> > >>> > >
>> > >>> > > - Jie
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> > >
>> > > --
>> > > Cheers,
>> > >
>> > > Zhitao Li
>> >
>>
>>
>>
>> --
>> Avinash Sridharan, Mesosphere
>> +1 (323) 702 5245
>>
>
> --

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Could mesos be a replacement for yarn?

2016-12-20 Thread Klaus Ma
I don-t think so :(. There's Apache incubator project, named Myriad, can
run YARN on Mesos, but it's far away from production.

On Wed, Dec 21, 2016 at 12:01 AM Dima Fadeyev <linuxrem...@gmail.com> wrote:

> Hello, everyone,
>
> Is it possible to run software from hadoop ecosystem on mesos? Do these
> work: Hive, Oozie, Sqoop, MRv2, Pig?
>
> Thanks in advance and best regards
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Structured logging for Mesos (or c++ glog)

2016-12-19 Thread Klaus Ma
Why not `logstash`? I think it's the target case of `logstash`.

On Tue, Dec 20, 2016 at 7:35 AM Zhitao Li <zhitaoli...@gmail.com> wrote:

> Great.
>
> I also found this old thread
> http://search-hadoop.com/m/Mesos/0Vlr6meKs116T2k1?subj=Mapped+diagnostics+context+Adding+internal+Mesos+IDs+as+context+to+the+logs
>  on
> dev list, which seems no consensus has been made.
>
> Maybe we can talk about this in the next community sync?
>
> On Mon, Dec 19, 2016 at 3:25 PM, James Peach <jor...@gmail.com> wrote:
>
>
> > On Dec 19, 2016, at 2:54 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
> >
> > Hi James,
> >
> > Stitching events together is only one possible use cases, and I'm not
> exactly sure what you meant by directly event logging.
> >
> > Taking the hierarchical allocator for example. In a multi-framework
> cluster, sometimes I want to comb through various loggings and present a
> trace on how allocation has affected a particular framework (by its
> framework id) and/or w.r.t an agent (by its agent id).
> >
> > Being able to systematically extract structured field values like
> framework_id or agent_id, regardless of the actually logging pattern, will
> be tremendously automatically from all lo valuable in such use cases.
>
> I think we are talking about similar things. Many servers do both
> free-form error logging and structured event logging. I'm thinking of event
> logging formats are customizable by the operator and allow the
> interpolation of context-specific data item (eg. HTTP access logs from many
> different server implementations).
>
> J
>
>
>
>
> --
> Cheers,
>
> Zhitao Li
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: How to shutdown mesos-agent gracefully?

2016-10-12 Thread Klaus Ma
I'd like to notify framework to kill its tasks and then terminate the
mesos-agent. To the Maintenance feature, can not remember whether the slave
info will be clearup if that slave will not re-register back.

On Wed, Oct 12, 2016 at 10:13 PM Alex Rukletsov <a...@mesosphere.com> wrote:

> To make sure: you are aware of SIGUSR1?
>
> On Tue, Oct 11, 2016 at 5:37 PM, tommy xiao <xia...@gmail.com> wrote:
>
> > Hi Ma,
> >
> > could you please input more background, why Maintenance feature  is not
> > best option for your request?
> >
> > 2016-10-11 14:47 GMT+08:00 haosdent <haosd...@gmail.com>:
> >
> > > gracefully means not affect running tasks?
> > >
> > > On Tue, Oct 11, 2016 at 2:36 PM, Klaus Ma <klaus1982...@gmail.com>
> > wrote:
> > >
> > >> It seems there's not a way to shutdown mesos-agent gracefully.
> > >> Maintenance feature expect the agents re-register back in the future.
> > >>
> > >> Thanks
> > >> Klaus
> > >> --
> > >>
> > >> Regards,
> > >> 
> > >> Da (Klaus), Ma (马达), PMP® | Software Architect
> > >> IBM Platform Development & Support, STG, IBM GCG
> > >> +86-10-8245 4084 <+86%2010%208245%204084> | mad...@cn.ibm.com |
> http://k82.me
> > >>
> > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Haosdent Huang
> > >
> >
> >
> >
> > --
> > Deshi Xiao
> > Twitter: xds2000
> > E-mail: xiaods(AT)gmail.com
> >
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


How to shutdown mesos-agent gracefully?

2016-10-11 Thread Klaus Ma
It seems there's not a way to shutdown mesos-agent gracefully.
Maintenance feature expect the agents re-register back in the future.

Thanks
Klaus
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: How many roles are we supported?

2016-09-08 Thread Klaus Ma
@Zhitao, thanks; that's helpful :).

On Thu, Sep 8, 2016 at 10:28 PM Zhitao Li <zhitaoli...@gmail.com> wrote:

> I'll share some of our targets which we aim to support per Mesos cluster,
> which may not be representative:
> - up to about 100 roles;
> - up to low hundreds of frameworks;
> - up to low tens of thousands of agents.
>
> On Thu, Sep 8, 2016 at 12:42 AM, Klaus Ma <klaus1982...@gmail.com> wrote:
>
> > any suggestion?
> >
> > On Wed, Sep 7, 2016 at 11:35 AM Klaus Ma <klaus1982...@gmail.com> wrote:
> >
> >> + user@
> >>
> >>
> >> On Wed, Sep 7, 2016 at 11:31 AM Klaus Ma <klaus1982...@gmail.com>
> wrote:
> >>
> >>> IMO, it does not make sense to let user to try it :). It's better for
> us
> >>> (Mesos Dev) to provide suggestion :).
> >>>
> >>> On Wed, Sep 7, 2016 at 11:27 AM Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> >>>
> >>>> I think polling user group for how people uses or plan to use Mesos
> will
> >>>> help.
> >>>>
> >>>> I personally already know at least two different ways of modeling
> >>>> multiple
> >>>> workloads to roles and frameworks in Mesos, which results in quite
> >>>> different numbers for roles and frameworks even for similar sized
> >>>> cluster.
> >>>>
> >>>> On Tue, Sep 6, 2016 at 7:54 PM, Klaus Ma <klaus1982...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> > Question on Mesos's scalability of 1.0: how many roles are we going
> to
> >>>> > support? how many nodes are we going to support? how many frameworks
> >>>> are we
> >>>> > going to support? ...
> >>>> >
> >>>> > When using Mesos as resource manager, those info is important to us
> >>>> when
> >>>> > proposing solution.
> >>>> >
> >>>> > And in community, it's better for us to have a target for
> performance
> >>>> > related project; it takes time to keeping improving the performance
> >>>> :).
> >>>> >
> >>>> > Thanks
> >>>> > Klaus
> >>>> > --
> >>>> >
> >>>> > Regards,
> >>>> > 
> >>>> > Da (Klaus), Ma (马达), PMP® | Software Architect
> >>>> > IBM Platform Development & Support, STG, IBM GCG
> >>>> > +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Cheers,
> >>>>
> >>>> Zhitao Li
> >>>>
> >>> --
> >>>
> >>> Regards,
> >>> 
> >>> Da (Klaus), Ma (马达), PMP® | Software Architect
> >>> IBM Platform Development & Support, STG, IBM GCG
> >>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >>>
> >> --
> >>
> >> Regards,
> >> 
> >> Da (Klaus), Ma (马达), PMP® | Software Architect
> >> IBM Platform Development & Support, STG, IBM GCG
> >> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >>
> > --
> >
> > Regards,
> > 
> > Da (Klaus), Ma (马达), PMP® | Software Architect
> > IBM Platform Development & Support, STG, IBM GCG
> > +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: How many roles are we supported?

2016-09-08 Thread Klaus Ma
any suggestion?

On Wed, Sep 7, 2016 at 11:35 AM Klaus Ma <klaus1982...@gmail.com> wrote:

> + user@
>
>
> On Wed, Sep 7, 2016 at 11:31 AM Klaus Ma <klaus1982...@gmail.com> wrote:
>
>> IMO, it does not make sense to let user to try it :). It's better for us
>> (Mesos Dev) to provide suggestion :).
>>
>> On Wed, Sep 7, 2016 at 11:27 AM Zhitao Li <zhitaoli...@gmail.com> wrote:
>>
>>> I think polling user group for how people uses or plan to use Mesos will
>>> help.
>>>
>>> I personally already know at least two different ways of modeling
>>> multiple
>>> workloads to roles and frameworks in Mesos, which results in quite
>>> different numbers for roles and frameworks even for similar sized
>>> cluster.
>>>
>>> On Tue, Sep 6, 2016 at 7:54 PM, Klaus Ma <klaus1982...@gmail.com> wrote:
>>>
>>> > Question on Mesos's scalability of 1.0: how many roles are we going to
>>> > support? how many nodes are we going to support? how many frameworks
>>> are we
>>> > going to support? ...
>>> >
>>> > When using Mesos as resource manager, those info is important to us
>>> when
>>> > proposing solution.
>>> >
>>> > And in community, it's better for us to have a target for performance
>>> > related project; it takes time to keeping improving the performance :).
>>> >
>>> > Thanks
>>> > Klaus
>>> > --
>>> >
>>> > Regards,
>>> > 
>>> > Da (Klaus), Ma (马达), PMP® | Software Architect
>>> > IBM Platform Development & Support, STG, IBM GCG
>>> > +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>>> >
>>>
>>>
>>>
>>> --
>>> Cheers,
>>>
>>> Zhitao Li
>>>
>> --
>>
>> Regards,
>> 
>> Da (Klaus), Ma (马达), PMP® | Software Architect
>> IBM Platform Development & Support, STG, IBM GCG
>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>>
> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: How many roles are we supported?

2016-09-06 Thread Klaus Ma
+ user@

On Wed, Sep 7, 2016 at 11:31 AM Klaus Ma <klaus1982...@gmail.com> wrote:

> IMO, it does not make sense to let user to try it :). It's better for us
> (Mesos Dev) to provide suggestion :).
>
> On Wed, Sep 7, 2016 at 11:27 AM Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> I think polling user group for how people uses or plan to use Mesos will
>> help.
>>
>> I personally already know at least two different ways of modeling multiple
>> workloads to roles and frameworks in Mesos, which results in quite
>> different numbers for roles and frameworks even for similar sized cluster.
>>
>> On Tue, Sep 6, 2016 at 7:54 PM, Klaus Ma <klaus1982...@gmail.com> wrote:
>>
>> > Question on Mesos's scalability of 1.0: how many roles are we going to
>> > support? how many nodes are we going to support? how many frameworks
>> are we
>> > going to support? ...
>> >
>> > When using Mesos as resource manager, those info is important to us when
>> > proposing solution.
>> >
>> > And in community, it's better for us to have a target for performance
>> > related project; it takes time to keeping improving the performance :).
>> >
>> > Thanks
>> > Klaus
>> > --
>> >
>> > Regards,
>> > 
>> > Da (Klaus), Ma (马达), PMP® | Software Architect
>> > IBM Platform Development & Support, STG, IBM GCG
>> > +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>> >
>>
>>
>>
>> --
>> Cheers,
>>
>> Zhitao Li
>>
> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Software Architect
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Software Architect
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


RE: Request offers based on Slave Attributes

2016-05-31 Thread Klaus Ma
Hi Nihal,
Currently, the Mesos master/allocator will ignore "requestResources"; as 
Guangya said, please filter out the necessary by slave's attributes when you 
got offers.

----Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource 
Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | 
http://k82.me

Date: Tue, 31 May 2016 14:58:39 +0800
Subject: Re: Request offers based on Slave Attributes
From: gyliu...@gmail.com
To: user@mesos.apache.org

I think that you can filter the offers based on slave attribute, all of the 
offers are assigned agent attributes in the offer: 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L5637
On Tue, May 31, 2016 at 2:33 PM, Nihal Harish <nihal42har...@gmail.com> wrote:
Hi,The mesos.proto file allows us to specify request offers based on only 
slave_id and resources, is there some way to request offers based on slave 
attributes and/or hostname?Thanks.Regards,Nihal


  

RE: 1.0 Release Candidate

2016-05-29 Thread Klaus Ma
v1.0, exciting :).

Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource 
Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | 
http://k82.me

> Date: Sun, 29 May 2016 02:31:47 -0700
> Subject: Re: 1.0 Release Candidate
> From: a...@mesosphere.io
> To: user@mesos.apache.org
> CC: vinodk...@apache.org; d...@mesos.apache.org
> 
> FYI, I made an alternate 1.0 release dashboard with a longer timeframe for
> the created vs. resolved chart, and added a couple of my favorite widgets.
> Feel free to use anything you find helpful.
> 
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12328256
> 
> On Thu, May 26, 2016 at 9:44 PM, Vinod Kone <vinodk...@apache.org> wrote:
> 
> > This is the release dashboard:
> >
> > https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12328255
> >
> > *NOTE: *If you have set a Fix Version of 0.29.0 on a ticket that is not a
> > blocker for 0.29.0/1.0 release, please unset the fix version.
> >
> > On Thu, May 26, 2016 at 3:44 PM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> >> Thanks for asking the questions Zameer. Wanted to give some clarification
> >> regarding the thought process for releasing 1.0.
> >>
> >> The reason for cutting  a 1.0, is because we want to signal that the
> >> Mesos project has reached a level of maturity to the wider community. Among
> >> other things we are confident at this point that the *foundations* we laid
> >> for the new APIs are mature and could be evolved in a backwards compatible
> >> way. We laid the foundations almost an year ago (at last MesosCon) and
> >> since then have been busy implementing the backend to drive the API.  Even
> >> the newly released design doc for the operator API is built on the same
> >> foundations as the scheduler/executor APIs. While we have been tweaking the
> >> API backend for a while now the API definitions have mostly stayed the
> >> same. Part of the reason it took this long is because we really wanted to
> >> be sure the basic building blocks were solid.
> >>
> >> MesosCon is a great opportunity for us to drum up excitement about the
> >> new APIs and invite them to start using/testing it. Like any other OSS
> >> project, as people and organizations start using the new APIs in staging
> >> and production, we will make stability and implementation improvements. The
> >> long period for the RC will also help catching issues with API foundations
> >> themselves. We have had a bit of chicken and egg problem having people
> >> consume the new APIs because most don't want to use it in production unless
> >> it is declared production ready and we can't call it production ready until
> >> someone uses them in production.
> >>
> >> Having said all that stability and production readiness is paramount for
> >> the project.  That is never going to change. In the case of the new APIs,
> >> we have developed C++ frameworks using the new APIs and having been running
> >> them as part of ASF CI for months now. Mesosphere, for example, also has an
> >> internal cluster where frameworks using these new APIs have been baking for
> >> a while and had done (and doing) rigorous tests (network partitions,
> >> scaling tests, functional tests). Community members from IBM have also been
> >> instrumental in testing the new APIs. We are hoping after 1.0 more people
> >> would be willing and excited to consume these new APIs and stress test in
> >> their environments.
> >>
> >> At the end of the day, while new APIs are an important part of Mesos 1.0
> >> it's not the only reason for cutting a 1.0 release. Mesos has a slew of
> >> exciting features and a thriving eco system and we would love to have more
> >> people excited and get a taste of it. 1.0 is just a start...
> >>
> >> Hope that helps,
> >>
> >>
> >> On Wed, May 25, 2016 at 4:57 PM, Zameer Manji <zma...@apache.org> wrote:
> >>
> >>> I might be in the minority here, but I think cutting an RC for 1.0 right
> >>> now is very aggressive. Does there exist even a single framework that
> >>> uses
> >>> the Scheduler HTTP API or the Executor HTTP API? Does anyone even use
> >>> these
> >>> APIs in production? Is there a single entity that uses the Operator API
> >>> to
> >>> manage agents?
> >>>
> >>> I think cutting an RC right now is 100% premature until the community can

Re: interplay of reservations and quotas

2016-05-09 Thread Klaus Ma
Currently (0.28+), the reserved resources are also account into quota. For
your case, the framework will get the 50 reserved CPU and 50 CPU from "*";
the other 100 CPU will NOT offer to this framework even no other frameworks.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, May 10, 2016 at 1:42 AM, haosdent <haosd...@gmail.com> wrote:

> >Will it take the 50 CPUs from the reservations and just block another 50
> CPUs from other slaves within the cluster?
> Is the framework written by yourself? This depends on how your framework
> implement resourceOffers. Suppose you assign your framework with `bar` and
> reserved by this role as well. You still could accept the offer which role
> is "*" in `Scheduler::resourceOffers`.
>
> On Mon, May 9, 2016 at 11:58 PM, Sebastian Kuepers <
> sebastian.kuep...@publicispixelpark.de> wrote:
>
>> Hi,
>>
>>
>> I have a framework, wich I want to guarantee ressources on my cluster -
>> independently from the slaves.
>>
>> Let's say 100 CPU on a 200 CPU cluster.
>>
>>
>> I could very well use quotas for that. I create a role for this framework
>> and create the quota for this role with the 100 CPUs.
>>
>>
>> But in my cluster I have a couple of slaves, which are great for the
>> framework to run on, because they have a lot of memory for example.
>>
>>
>> But they can only provide 50 of the 100 CPU I want to get for sure for
>> this framework.​
>>
>> So I would do start these slaves with reservations for this role.
>>
>>
>> Does the quota mechanism, now take this into account, when trying to
>> block and allocate resources for this framework?
>>
>>
>> Will it take the 50 CPUs from the reservations and just block another 50
>> CPUs from other slaves within the cluster?
>>
>>
>> Thanks for you help,
>>
>> Sebastian
>>
>>
>>
>>
>>
>>
>> 
>> Disclaimer The information in this email and any attachments may contain
>> proprietary and confidential information that is intended for the
>> addressee(s) only. If you are not the intended recipient, you are hereby
>> notified that any disclosure, copying, distribution, retention or use of
>> the contents of this information is prohibited. When addressed to our
>> clients or vendors, any information contained in this e-mail or any
>> attachments is subject to the terms and conditions in any governing
>> contract. If you have received this e-mail in error, please immediately
>> contact the sender and delete the e-mail.
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Change the role of a framework

2016-04-29 Thread Klaus Ma
maybe we can create a document for it in FAQ umbrella :).


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Apr 29, 2016 at 11:13 AM, Vinod Kone <vinodk...@apache.org> wrote:

> I think what you did seems correct.
>
> On Thu, Apr 28, 2016 at 6:31 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
>
>> Hi list,
>>
>> For some reason I need to change the role of an existing framework
>> (marathon)  from the default role "*" to a specific role, say "services", I
>> don't find any existing documentation on this, so here are the steps that I
>> take on a staging cluster:
>>
>> - stop all HA marathon instances, only left one running
>>
>> - set the marathon role (/etc/marathon/conf/mesos_role), and restart
>> marathon
>>   - at this moment marathon is still using "*" role because master won't
>> update the role of a framework when it re-registers
>>   - for that to happen we need to do a mesos master fail over
>>
>> - stop the current active mesos-master, so marathon would use the new
>> role after the master failover
>>
>> - now: marathon is using "services" role, which means it would accept
>> resources from both slaves with default '*' role and slaves with "services"
>> role
>>
>> - for each slave:
>>   - stop the slave
>>   - change the role (/etc/mesos-slave/default_role) to "services"
>>   - remove /tmp/mesos/meta/slaves
>>   - restart docker (otherwise the old running executors/tasks won't be
>> killed)
>>   - restart the slave
>>
>> During the process all running tasks are killed and restarted, but that's
>> acceptable to me.
>>
>> Now all slaves is running with role "services" and marathon is running
>> with role "services".  So far the cluster seems to be working fine, but I'm
>> not sure if the steps I take have any un-noticed impacts, since this is a
>> somewhat un-documented procedure.
>>
>> Any comments?
>>
>> Regards,
>> Shuai
>>
>>
>>
>>
>


Re: Altering agent resrouces after startup

2016-04-20 Thread Klaus Ma
@Aaron, thanks for your info; I think MESOS-3059 cover your cases.

On Wed, Apr 20, 2016 at 5:24 PM Aaron Carey <aca...@ilm.com> wrote:

> 2 frameworks minimum, sometimes more (depends what we're doing at the
> time). Marathon is always running and we have a couple of custom frameworks
> too..
>
>
> --
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
> --
> *From:* Klaus Ma [klaus1982...@gmail.com]
> *Sent:* 20 April 2016 10:22
> *To:* user@mesos.apache.org
>
> *Subject:* Re: Altering agent resrouces after startup
> For the label/attributes, I think this a case we need to pay attention:
> the allocator did not count in label/attributes when doing allocation, so
> the resources maybe assigned to different frameworks. @Aaron, how many
> frameworks are you running?
>
> Thanks
> Klaus
>
> On Wed, Apr 20, 2016 at 5:18 PM Aaron Carey <aca...@ilm.com> wrote:
>
>> Ah thank you! I tried searching Jira but didn't find that ticket.
>>
>> Yes I think you might be right about the attributes, although I don't
>> seem to be able to get to the MESOS-3059 ticket in Jira, do you know if
>> it's on the roadmap?
>>
>> Thanks,
>> Aaron
>>
>> --
>>
>>
>> Aaron Carey
>> Production Engineer - Cloud Pipeline
>> Industrial Light & Magic
>> London
>> 020 3751 9150
>>
>> --
>> *From:* haosdent [haosd...@gmail.com]
>> *Sent:* 20 April 2016 10:12
>> *To:* user
>> *Subject:* Re: Altering agent resrouces after startup
>>
>> There is a ticket [Allow slave reconfiguration on restart](
>> https://issues.apache.org/jira/browse/MESOS-1739) related to this and
>> not implement yet. But your requirement seems not related to change
>> resources of agent dynamically. It looks more like change labels/attributes
>> dynamically of agent.
>>
>> On Wed, Apr 20, 2016 at 4:56 PM, Aaron Carey <aca...@ilm.com> wrote:
>>
>>> Hi All,
>>>
>>> I was wondering if it was possible somehow to alter an agent's resources
>>> after it has started?
>>>
>>> Example: we are dynamically attaching and detaching EBS volumes to EC2
>>> hosts running as agents. (This is part of our docker volume setup using
>>> RexRay). When a host has an EBS volume attached to it I'd like to be able
>>> to mark that as a new resource on the agent. Note that it's not the disk
>>> space we care about here, just the name of the volume itself. This would
>>> then allow us to schedule tasks that require access to the data on that EBS
>>> volume all on the same host.
>>>
>>> Anyone have any ideas?
>>>
>>> Thanks!
>>>
>>> Aaron
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
> --
>
> Regards,
> 
> Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
> IBM Platform Development & Support, STG, IBM GCG
> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Altering agent resrouces after startup

2016-04-20 Thread Klaus Ma
Hi Aaron,

Currently, the resources in slave can NOT be updated after started; the QoS
can only report revocable resources. But I think this reasonable
requirement to detect resource on the fly; would you help to open an JIRA
for this? I think there're two sub-requirement of this scenario:

1. The resources of slave will be updated on the fly; it's different with
MESOS-1739 which focus on agent restart
2. Self-defined resources which is only consumed special resources

If any comments, please let me know.

Thanks
Klaus


On Wed, Apr 20, 2016 at 4:56 PM Aaron Carey <aca...@ilm.com> wrote:

> Hi All,
>
> I was wondering if it was possible somehow to alter an agent's resources
> after it has started?
>
> Example: we are dynamically attaching and detaching EBS volumes to EC2
> hosts running as agents. (This is part of our docker volume setup using
> RexRay). When a host has an EBS volume attached to it I'd like to be able
> to mark that as a new resource on the agent. Note that it's not the disk
> space we care about here, just the name of the volume itself. This would
> then allow us to schedule tasks that require access to the data on that EBS
> volume all on the same host.
>
> Anyone have any ideas?
>
> Thanks!
>
>
> Aaron
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Checking success of resource reservations

2016-04-15 Thread Klaus Ma
Please try "curl -s http://mesos_master:5050/roles | python -m json.tool"
to get roles's info including reservation.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Sat, Apr 16, 2016 at 2:43 AM, Sammy Nguyen <sngu...@comprehend.com>
wrote:

> Hi everyone,
>
> I am making resource reservations and creating persistent volumes through
> the operator HTTP endpoints on v0.28.0. In order to see if the requests
> went through, the docs (
> http://mesos.apache.org/documentation/latest/reservation/) say to check
> at the appropriate slave's /state endpoint. However, we are not seeing
> anything in the JSON response from that endpoint which would indicate
> success of the reservation. Can anyone provide guidance on this?
>
> For context, I am working on a script to reserve or unreserve disk and
> create or destroy persistent volumes as needed, and we would like to fail
> early if the reservation or persistent volume cannot be made.
>
> Thanks,
>
> *Sammy Nguyen*
>
>


Re: Framework taking default resources even though a role is specified

2016-04-15 Thread Klaus Ma
Which version are you using? For your requirement, I think you can try
Quota; currently, the resources beyond quota will not offer to the
framework whose quota satisfied. Quota will also include reserved resources.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Sat, Apr 16, 2016 at 4:54 AM, Rodrick Brown <rodr...@orchardplatform.com>
wrote:

> You can try setting constraints on tasks in both Chronos and marathon that
> will limit deployment to only a certain set of nodes.
>
> Sent from Outlook for iPhone <https://aka.ms/wp8k5y>
>
>
>
>
> On Fri, Apr 15, 2016 at 1:35 PM -0700, "June Taylor" <j...@umn.edu> wrote:
>
> Evan,
>>
>> I'm not sure about it. We're new to the Mesos system and still learning.
>> We want to be able to classify resources so that our developers can run
>> tasks against them easily, without using more than they are permitted. It
>> seemed like resource roles were the appropriate solution, but they may not
>> go far enough if Mesos will still spill over into default resources.
>>
>>
>> Thanks,
>> June Taylor
>> System Administrator, Minnesota Population Center
>> University of Minnesota
>>
>> On Fri, Apr 15, 2016 at 3:27 PM, Evan Krall <kr...@yelp.com> wrote:
>>
>>> My understanding is that your framework would have to know not to accept
>>> offers for * resources. Marathon has an option to specify which roles to
>>> accept for a particular app, and has command line options for controlling
>>> the default. Maybe pyspark has something similar?
>>>
>>> On Fri, Apr 15, 2016 at 1:24 PM, June Taylor <j...@umn.edu> wrote:
>>>
>>>> Yep - we're waiting for it.
>>>>
>>>>
>>>> Thanks,
>>>> June Taylor
>>>> System Administrator, Minnesota Population Center
>>>> University of Minnesota
>>>>
>>>> On Fri, Apr 15, 2016 at 3:23 PM, Anand Mazumdar <an...@mesosphere.io>
>>>> wrote:
>>>>
>>>>> FWIW, we recently fixed `mesos-execute` (command scheduler) to add
>>>>> support for roles. It should be available in the next release (0.29).
>>>>>
>>>>> https://issues.apache.org/jira/browse/MESOS-4744
>>>>>
>>>>> -anand
>>>>>
>>>>> On Apr 15, 2016, at 11:41 AM, June Taylor <j...@umn.edu> wrote:
>>>>>
>>>>> Ken,
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Is there a way to ensure a framework only receives the reserved
>>>>> resources?
>>>>>
>>>>> I would go ahead and take everything out of the * role, however, the
>>>>> 'mesos-execute' command doesn't support specifying a role, so that's the
>>>>> only way we can currently get mesos-execute to co-exist with pyspark.
>>>>>
>>>>> Any other thoughts from the group?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> June Taylor
>>>>> System Administrator, Minnesota Population Center
>>>>> University of Minnesota
>>>>>
>>>>> On Fri, Apr 15, 2016 at 11:54 AM, Ken Sipe <kens...@gmail.com> wrote:
>>>>>
>>>>>> The framework with role “production” will receive production
>>>>>> resources and * resources
>>>>>> All other frameworks (assuming no role) will only receive * resources
>>>>>>
>>>>>> ken
>>>>>>
>>>>>> > On Apr 15, 2016, at 11:38 AM, June Taylor <j...@umn.edu> wrote:
>>>>>> >
>>>>>> > We have a small cluster with 3 nodes in the * resource role
>>>>>> default, and 3 nodes in a "production" resource role.
>>>>>> >
>>>>>> > Starting up a framework which requests "production" properly
>>>>>> executes on the expected nodes, however, today we noticed that this job
>>>>>> also started up executors under the * resource role as well.
>>>>>> >
>>>>>> > We expect these tasks to only go on nodes with the "production"
>>>>>> resource role. Can you advise further?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > June Taylor
>>>>>> > S

Re: [Proposal] Remove the default value for agent work_dir

2016-04-13 Thread Klaus Ma
+1


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Wed, Apr 13, 2016 at 11:14 PM, Paul <arach...@gmail.com> wrote:

> +1
>
> On Apr 13, 2016, at 11:01 AM, Ken Sipe <kens...@gmail.com> wrote:
>
> +1
>
> On Apr 12, 2016, at 5:58 PM, Greg Mann <g...@mesosphere.io> wrote:
>
> Hey folks!
> A number of situations have arisen in which the default value of the Mesos
> agent `--work_dir` flag (/tmp/mesos) has caused problems on systems in
> which the automatic cleanup of '/tmp' deletes agent metadata. To resolve
> this, we would like to eliminate the default value of the agent
> `--work_dir` flag. You can find the relevant JIRA here
> <https://issues.apache.org/jira/browse/MESOS-5064>.
>
> We considered simply changing the default value to a more appropriate
> location, but decided against this because the expected filesystem
> structure varies from platform to platform, and because it isn't guaranteed
> that the Mesos agent would have access to the default path on a particular
> platform.
>
> Eliminating the default `--work_dir` value means that the agent would exit
> immediately if the flag is not provided, whereas currently it launches
> successfully in this case. This will break existing infrastructure which
> relies on launching the Mesos agent without specifying the work directory.
> I believe this is an acceptable change because '/tmp/mesos' is not a
> suitable location for the agent work directory except for short-term local
> testing, and any production scenario that is currently using this location
> should be altered immediately.
>
> If you have any thoughts/opinions/concerns regarding this change, please
> let us know!
>
> Cheers,
> Greg
>
>
>


Re: Failed to locate systemd cgroups hierarchy

2016-04-06 Thread Klaus Ma
Try to mount /sys, /cgroup to container.

On Wed, Apr 6, 2016 at 8:34 PM DiGiorgio, Mr. Rinaldo S. <
rdigior...@pace.edu> wrote:

> Hi,
>
> I am using lxc to create a container and I must be missing a
> dependent package. Has anyone seen and resolved the following.
>
>
> I0406 12:28:53.765508 18891 main.cpp:223] Build: 2016-03-17 17:47:25 by
> root
> I0406 12:28:53.765622 18891 main.cpp:225] Version: 0.28.0
> I0406 12:28:53.765653 18891 main.cpp:228] Git tag: 0.28.0
> I0406 12:28:53.765666 18891 main.cpp:232] Git SHA:
> 961edbd82e691a619a4c171a7aadc9c32957fa73
> I0406 12:28:53.769107 18891 systemd.cpp:236] systemd version `219` detected
> I0406 12:28:53.769143 18891 main.cpp:240] Inializing systemd state
> I0406 12:28:53.773336 18891 systemd.cpp:324] Started systemd slice
> `mesos_executors.slice`
> Failed to initialize systemd: Failed to locate systemd cgroups hierarchy:
> does not exist
>
>
> Rinaldo

-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


RE: Port Resource Offers

2016-03-29 Thread Klaus Ma
Yes, all port resources must be ranges for now, e.g. 31000-35000.
There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single 
values) on that, patches are pending on review :).
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource 
Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | 
http://k82.me

Date: Tue, 29 Mar 2016 10:51:44 +0100
Subject: Port Resource Offers
From: pradeep.chhetr...@gmail.com
To: user@mesos.apache.org

Hello, 
I am running mesos slaves with the modified port announcement.
$ cat /etc/mesos-slave/resourcesports(*):[6379, 9200, 9300, 27017, 31000-35000]
I can that this is being picked up when starting the mesos slaves in ps output: 
--resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000]
However, when i hit the /state.json endpoint of mesos-master, I am seeing this:

​
I can see the tasks are being assigned ports in the range of 9300-27017. There 
are some of these ports which are already used by other applications running on 
each mesos slaves but are being announced. I am not sure if this will cause 
some issue. I am assuming that it will always check if the port is already 
binded by some other process before assigning port to a task.
By going through the code and test cases, it looks like it always expect port 
resource in ranges.
https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263

So I guess, I should always define ports in ranges rather than individual port.
It will be helpful if someone can confirm if it is the expected behaviour and 
my configuration is wrong.
-- 
Regards,
Pradeep Chhetri
  

Re: Can mesos support supports multi-datacenter and multi-region configurations for failure isolation and scalability.

2016-03-20 Thread Klaus Ma
There's some discussion at https://issues.apache.org/jira/browse/MESOS-3548
.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Sat, Mar 19, 2016 at 8:16 PM, tommy xiao <xia...@gmail.com> wrote:

> recently, i read Nomad blog's article,  Marathon> , the blog said:
> ```
> Mesos does not support federation or multiple failure isolation regions.
> Nomad supports multi-datacenter and multi-region configurations for failure
> isolation and scalability.
> ```
>
> how the  mesos support multi-datacenter and multi-region's  feature?
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: Unstability on Mesos 0.27

2016-03-18 Thread Klaus Ma
If Mesos daemon crashed, I'd suggest to log a JIRA and append more detail,
e.g. steps, master/agent log.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Mar 17, 2016 at 8:26 AM, Vinod Kone <vinodk...@apache.org> wrote:

> Hey Gabriel,
>
> Could you share more details on what the crashes are and what your setup
> is (docker containerizer?). Any logs (master, agent, application) that can
> shed light would be useful to diagnose.
>
> On Wed, Mar 16, 2016 at 5:12 PM, Alfredo Carneiro <
> alfr...@simbioseventures.com> wrote:
>
>> Hello guys,
>>
>> I am using Mesos 0.27 with different kinds of applications, such as,
>> crawlers, databases and websites. However, I have faced many crashes and I
>> couldn't find what it is the matter.
>>
>> We have 14 machines with 8Gb of ram and 4 cpu each. Usually, we run about
>> 40 instance of our crawler, which they start stopping of nowhere (but the
>> containers keep running). The day before yesterday we decided try to test
>> our entire infrastrcuture and we scaled our crawler up to 110 instances.
>> Unfortunately, today we've faced a big crash that affected mainly our
>> crawler and our databases.
>>
>> So, I am wondering if anyone else have the same problem, such as apps
>> which crashes of nowhere or something else which could be related to some
>> unstability on Mesos.
>>
>> --
>> Alfredo Miranda
>>
>
>


Re: Mesos 0.27 and docker

2016-03-11 Thread Klaus Ma
Hi Walter,

I think you're talking about Docker in Mesos without Marathon; if so,
please check the second doc:
http://mesos.apache.org/documentation/latest/docker-containerizer/


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Mar 11, 2016 at 7:24 PM, Walter Heestermans (TME) <
walter.heesterm...@external.toyota-europe.com> wrote:

> You are specifying two ways, what’s the preferred way?
>
>
>
> Walter
>
>
>
>
>
> *From:* Rad Gruchalski [mailto:ra...@gruchalski.com]
> *Sent:* 11 March 2016 12:20
> *To:* user@mesos.apache.org
> *Subject:* Re: Mesos 0.27 and docker
>
>
>
> Walter,
>
>
>
> All you need to know to start is documented here:
> https://mesosphere.github.io/marathon/docs/native-docker.html.
>
> That’s with Marathon, if you are planning on using it directly with Mesos,
> http://mesos.apache.org/documentation/latest/docker-containerizer/
>
> No problem using latest Docker, I have a 0.27.2 cluster with Docker 1.10.2
> (docker-engine). All working perfectly fine.
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality: *This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Friday, 11 March 2016 at 11:20, Walter Heestermans (TME) wrote:
>
> Hi,
>
>
>
> I’m new using mesos, and I like to make study of the docker
> containerization inside mesos.
>
>
>
> Can somebody provide me some interesting links and some links to samples
> on how to use, configure,…
>
>
>
> Walter
>
>
>
>
>
> This e-mail may contain confidential information. If you are not an
> addressee or otherwise authorised to receive this message, you should not
> use, copy, disclose or take any action based on this e-mail. If you have
> received this e-mail in error, please inform the sender promptly and delete
> this message and any attachments immediately.
>
>
>
> This e-mail may contain confidential information. If you are not an
> addressee or otherwise authorised to receive this message, you should not
> use, copy, disclose or take any action based on this e-mail. If you have
> received this e-mail in error, please inform the sender promptly and delete
> this message and any attachments immediately.
>


Re: mesos master 0.27.1 needs g++ to execute ?

2016-03-06 Thread Klaus Ma
Hi KR,

Can you share the error message?


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Sun, Mar 6, 2016 at 8:22 PM, kr di <k...@outlook.com> wrote:

> I have dockerized amy compiled version of mesos master for version 0.25
> successfully.  In moving closer to production, I am upgrading to 0.27.1.
> However, in dockerizing 0.27.1, the mesos master seems to need g++ (plus a
> few libraries) to execute which was not the case with mesos 0.25.  I am not
> speaking of compiling mesos.  I successfully downloaded mesos tar file and
> compiled mesos (0.25 and 0.27.1) on RHEL 7.2.  Mesos master 0.27.1 runs
> without issues on my system with development tools.
>
>
>
> The issue is that I am hesitate to put development tools on a production
> system without understanding why g++ is needed. Could someone briefly
> explain why I need g++ to run mesos 0.27.1 (and maybe why mesos 0.25 did
> not require g++) to run? And whether there is a work-around to not having
> development tools (g++) for mesos master?
>
>
>
> Awesome software btw.
>
>
>
> Thank you,
>
>
>
> KR
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>


Re: limiting nodes to specific frameworks

2016-02-24 Thread Klaus Ma
Static reservation feature will help; but there's a limitation that
Marathon can only manage resources from one role.
For example, if we use static reservation for two resources group "master"
(--resources="cpus(master):16") & "agent" (--resources="cpus(agent):16"),
Marathon (--mesos_roles="master" or --mesos_roles="agent") can only get
resources from one of them: "master" or "agent".

There's EPIC to support multiple roles in Mesos (MESOS-1763).


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Wed, Feb 24, 2016 at 10:21 PM, Guangya Liu <gyliu...@gmail.com> wrote:

> You can use static reservation to make your marathon framework running on
> a specified node, the steps are as following, 0.24 do not support implicit
> role, so you need to restart master to add role flags.
>
> 1) Restart mesos master with role flags, such as --roles=marathon
> 2) Start up agent on master node with static reservation, such as
> --resources="cpus(marathon):16;mem(marathon):24576
> 3) Start marathon with a role flag: --mesos_roles=marathon, refer to
> https://github.com/mesosphere/marathon/blob/master/docs/docs/command-line-flags.md
> for detail.
>
> Then you can see that your marathon framework will be starting on the node
> with static reservation, hope this helps.
>
> Thanks,
>
> Gaungya
>
> On Wed, Feb 24, 2016 at 9:41 PM, Clarke, Trevor <tcla...@ball.com> wrote:
>
>> I've got a custom framework running in mesos (0.24.1 for now). It
>> supports failover and I'd like to be able to start the framework daemons
>> (scheduler, etc.) from Marathon so I can automatically handle scaling and
>> restart. I'm running a small cluster where the mesos master is also the
>> primary master for zookeeper, marathon, my framework, and other support
>> daemons. I've got a small backup master and want to allow failover to slave
>> nodes as needed.
>>
>> My question, can I run the mesos slave on the master node but restrict it
>> to the Marathon framework only so it won't show up in my framework? Or do I
>> need to run a separate set of mesos daemons to do that? Are there other
>> ways to handle this sort of setup?
>>
>> --
>> Trevor R.H. Clarke
>> Software Engineer, Ball Aerospace
>> (937)320-7087
>>
>>
>>
>>
>> This message and any enclosures are intended only for the addressee.
>> Please
>> notify the sender by email if you are not the intended recipient.  If you
>> are
>> not the intended recipient, you may not use, copy, disclose, or
>> distribute this
>> message or its contents or enclosures to any other person and any such
>> actions
>> may be unlawful.  Ball reserves the right to monitor and review all
>> messages
>> and enclosures sent to or from this email address.
>>
>
>
>
> --
> Guangya Liu (刘光亚)
> Senior Software Engineer
> DCOS and OpenStack Development
> IBM Platform Computing
> Systems and Technology Group
>


Re: Mesos sometimes not allocating the entire cluster

2016-02-22 Thread Klaus Ma
@Tom, one more question: how about your task run time? If the task run time
is too short, e.g. 100ms, the resources will be return to allocator when
task finished and will allocate it until next allocation cycle.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, Feb 23, 2016 at 10:25 AM, Guangya Liu <gyliu...@gmail.com> wrote:

> Hi Tom,
>
> I saw that the two frameworks with roles is consuming most of the
> resources, so I think that you can do more test by removing the two
> frameworks with roles.
>
> Another I want to mention is that the DRF allocator may have some issues
> when there are plenty of frameworks and the community is trying to improve
> this by some projects, such as 'Optimistic Offer MESOS-1607', 'Quota
> Enhancement MESOS-1791' etc.
>
> The issues for allocator include the following etc:
> https://issues.apache.org/jira/browse/MESOS-4302
> https://issues.apache.org/jira/browse/MESOS-3202 << You may take a look
> at this one in detail.
> https://issues.apache.org/jira/browse/MESOS-3078
>
> Hope this helps.
>
> Thanks,
>
> Guangya
>
>
> On Tue, Feb 23, 2016 at 1:53 AM, Tom Arnfeld <t...@duedil.com> wrote:
>
>> Hi Guangya,
>>
>> Most of the agents do not have a role, so they use the default wildcard
>> role for resources. Also none of the frameworks have a role, therefore they
>> fall into the wildcard role too.
>>
>> Frameworks are being offered resources *up to a certain level of
>> fairness* but no further. The issue appears to be inside the allocator,
>> relating to how it is deciding how many resources each framework should get
>> within the role (wildcard ‘*') in relation to fairness.
>>
>> We seem to have circumvented the problem in the allocator by creating two 
>> *completely
>> new* roles and putting *one framework in each*. No agents have this role
>> assigned to any resources, but by doing this we seem to have got around the
>> bug in the allocator that’s causing strange fairness allocations, resulting
>> in no offers being sent.
>>
>> I’m going to look into defining a reproducible test case for this
>> scheduling situation to coax the allocator into behaving this way in a test
>> environment.
>>
>> Tom.
>>
>> On 22 Feb 2016, at 15:39, Guangya Liu <gyliu...@gmail.com> wrote:
>>
>> If non of the framework has role, then no framework can consume reserved
>> resources, so I think that at least the framework
>> 20160219-164457-67375276-5050-28802-0014 and
>> 20160219-164457-67375276-5050-28802-0015 should have role.
>>
>> Can you please show some detail for the following:
>> 1) Master start command or master http endpoint for flags.
>> 2) All slave start command or slave http endpoint for flags
>> 3) master http endpoint for state
>>
>> Thanks,
>>
>> Guangya
>>
>> On Mon, Feb 22, 2016 at 10:57 PM, Tom Arnfeld <t...@duedil.com> wrote:
>>
>>> Ah yes sorry my mistake, there are a couple of agents with a *dev* role
>>> and only one or two frameworks connect to the cluster with that role, but
>>> not very often. Whether they’re connected or not doesn’t seem to cause any
>>> change in allocation behaviour.
>>>
>>> No other agents have roles.
>>>
>>> 974 2420 I0219 18:08:37.504587 28808 hierarchical.hpp:941] Allocating
>>> ports(*):[3000-5000]; cpus(*):0.5; mem(*):16384; disk(*):51200 on slave
>>> 20160112-174949-84152492-5050-19807-S316 to framework
>>> 20160219-164457-67375276-5050-28802-0014
>>>
>>> This agent should have another 9.5 cpus reserved by some role and no
>>> framework is configured using resources from this role, thus the resources
>>> on this role are wasting.  I think that the following agent may also have
>>> some reserved resources configured:
>>> 20160112-174949-84152492-5050-19807-S317,
>>> 20160112-174949-84152492-5050-19807-S322 and even more agents.
>>>
>>>
>>> I don’t think that’s correct, this is likely to be an offer for a slave
>>> where 9CPUs are currently allocated to an executor.
>>>
>>> I can verify via the agent configuration and HTTP endpoints that most of
>>> the agents do not have a role, and none of the frameworks do.
>>>
>>> Tom.
>>>
>>> On 22 Feb 2016, at 14:09, Guangya Liu <gyliu...@gmail.com> wrote:
>>>
>>> Hi Tom,
>>>
>>> I think that your cluster should have some role, we

Re: Reusing Task IDs

2016-02-21 Thread Klaus Ma
Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that
Master'll crash when Master failover with duplicated TaskID.

Here's the case of *MESOS-3070*:
T1: launch task (t1) on agent (agent_1)
T2: master failover
T3: launch another task (t1) on agent (agent_2) before agent_1
re-registering back
T4: agent_1 re-registered back; master'll crash because of `CHECK` when add
task (t1) back to master

Is there any special case that framework has to re-use the TaskID; if no
special case, I think we should ask framework to avoid reuse TaskID.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <eweath...@groupon.com>
wrote:

> tldr; *Reusing TaskIDs clashes with the mesos-agent recovery feature.*
>
> Adam Bordelon wrote:
> > Reusing taskIds may work if you're guaranteed to never be running two
> instances of the same taskId simultaneously
>
> I've encountered another scenario where reusing TaskIDs is dangerous, even
> if you meet the guarantee of never running 2 task instances with the same
> TaskID simultaneously.
>
> *Scenario leading to a problem:*
>
> Say you have a task with ID "T1", which terminates for some reason, so its
> terminal status update gets recorded into the agent's current "run" in the
> task's updates file:
>
>
> MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates
>
> Then say a new task is launched with the same ID of T1, and it gets
> scheduled under the same Executor on the same agent host. In that case, the
> task will be reusing the same work_dir path, and thus have the already
> recorded "terminal status update" in its task.updates file.  So the updates
> file has a stream of updates that might look like this:
>
>- TASK_RUNNING
>- TASK_FINISHED
>- TASK_RUNNING
>
> Say you subsequently restart the mesos-slave/agent, expecting all tasks to
> survive the restart via the recovery process.  Unfortunately, T1 is
> terminated because the task recovery logic
> <https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708> 
> [1]
> looks at the current run's tasks' task.updates files, searching for tasks
> with "terminal status updates", and then terminating any such tasks.  So,
> even though T1 was actually running just fine, it gets terminated because
> at some point in its previous incarnation it had a "terminal status update"
> recorded.
>
> *Leads to inconsistent state*
>
> Compounding the problem, this termination is done without informing the
> Executor, and thus the process underlying the task continues to run, even
> though mesos thinks it's gone.  Which is really bad since it leaves the
> host with a different state than mesos thinks exists. e.g., if the task had
> a port resource, then mesos incorrectly thinks the port is now free, so a
> framework might try to launch a task/executor that uses the port, but it
> will fail because the process cannot bind to the port.
>
> *Change recovery code or just update comments in mesos.proto?*
>
> Perhaps this behavior could be considered a "bug" and the recovery logic
> that processes tasks status updates could be modified to ignore "terminal
> status updates" if there is a subsequent TASK_RUNNING update in the
> task.updates file.  If that sounds like a desirable change, I'm happy to
> file a JIRA issue for that and work on the fix myself.
>
> If we think the recovery logic is fine as it is, then we should update these
> comments
> <https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66>
>  [2]
> in mesos.proto since they are incorrect given the behavior I just
> encountered:
>
> A framework generated ID to distinguish a task. The ID must remain
>> unique while the task is active. However, a framework can reuse an
>> ID _only_ if a previous task with the same ID has reached a
>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.).
>
>
> *Conclusion*
>
> It is dangerous indeed to reuse a TaskID for separate task runs, even if
> they are guaranteed to not be running concurrently.
>
> - Erik
>
>
> P.S., I encountered this problem while trying to use mesos-agent recovery
> with the storm-mesos framework <https://github.com/mesos/storm> [3].
> Notably, this framework sets the TaskID to
> "-" for the storm worker tasks, so when a
> storm worker dies and is reborn on that host, the TaskID gets reused.  But
> then the task doesn't survive an agent restart (even though the worker
> *process* 

RE: Mesos sometimes not allocating the entire cluster

2016-02-20 Thread Klaus Ma
Hi Tom,

What's the allocation interval, can you try to reduce filter's timeout of 
framework?
According to the log, ~12 frameworks on cluster with ~42 agents; the filter 
duration is 5sec, and there're ~60 times filtered in each seconds (e.g. 65 in 
18:08:34). For example, framework (20160219-164457-67375276-5050-28802-0015) 
just get resources from 6 agents and filtered the other 36 agents at 18:08:35 
(egrep "Alloca|Filtered" mesos-master.log | grep 
"20160219-164457-67375276-5050-28802-0015" | grep "18:08:35")
ThanksKlaus
From: t...@duedil.com
Subject: Re: Mesos sometimes not allocating the entire cluster
Date: Sat, 20 Feb 2016 16:36:54 +
To: user@mesos.apache.org

Hi Guangya,
Indeed we have about ~45 agents. I’ve attached the log from the master…


Hope there’s something here that highlights the issue, we can’t find anything 
that we can’t explain.
Cheers,
Tom.


On 19 Feb 2016, at 03:02, Guangya Liu  wrote:Hi Tom,
After the patch was applied, there is no need to restart framework but only 
mesos master.
One question is that I saw from your log, seems your cluster has at least 36 
agents, right? I was asking this question because if there are more frameworks 
than agents, frameworks with low weight may not able to get resources sometimes.
Can you please enable GLOG_v=2 for mesos master for a while and put the log 
somewhere for us to check (Do not enable this for a long time as you will get 
log message flooded), this kind of log messages may give some help for your 
problem.
Another is that there is another problem trying to fix another performance 
issue for allocator but may not help you much, but you can still take a look: 
https://issues.apache.org/jira/browse/MESOS-4694
Thanks,
Guangya
On Fri, Feb 19, 2016 at 2:19 AM, Tom Arnfeld  wrote:
Hi Ben,
We've rolled that patch out (applied over 0.23.1) on our production cluster and 
have seen little change, the master is still not sending any offers to those 
frameworks. We did this upgrade online, so would there be any reason the fix 
wouldn't have helped (other than it not being the cause)? Would we need to 
restart the frameworks (so they get new IDs) to see the effect?
It's not that the master is never sending them offers, it's that it does it up 
to a certain point... for different types of frameworks (all using libmesos) 
but then no more, regardless of how much free resource is available... the free 
resources are offered to some frameworks, but not all. Is there any way for us 
to do more introspection into the state of the master / allocator to try and 
debug? Right now we're at a bit of a loss of where to start diving in...
Much appreciated as always,
Tom.
On 18 February 2016 at 10:21, Tom Arnfeld  wrote:
Hi Ben,
I've only just seen your email! Really appreciate the reply, that's certainly 
an interesting bug and we'll try that patch and see how we get on.
Cheers,
Tom.
On 29 January 2016 at 19:54, Benjamin Mahler  wrote:
Hi Tom,
I suspect you may be tripping the following 
issue:https://issues.apache.org/jira/browse/MESOS-4302

Please have a read through this and see if it applies here. You may also be 
able to apply the fix to your cluster to see if that helps things.
Ben
On Wed, Jan 20, 2016 at 10:19 AM, Tom Arnfeld  wrote:
Hey,
I've noticed some interesting behaviour recently when we have lots of different 
frameworks connected to our Mesos cluster at once, all using a variety of 
different shares. Some of the frameworks don't get offered more resources (for 
long periods of time, hours even) leaving the cluster under utilised.
Here's an example state where we see this happen..
Framework 1 - 13% (user A)Framework 2 - 22% (user B)Framework 3 - 4% (user 
C)Framework 4 - 0.5% (user C)
Framework 5 - 1% (user C)
Framework 6 - 1% (user C)
Framework 7 - 1% (user C)
Framework 8 - 0.8% (user C)
Framework 9 - 11% (user D)
Framework 10 - 7% (user C)Framework 11 - 1% (user C)Framework 12 - 1% (user C)
Framework 13 - 6% (user E)
In this example, there's another ~30% of the cluster that is unallocated, and 
it stays like this for a significant amount of time until something changes, 
perhaps another user joins and allocates the rest chunks of this spare 
resource is offered to some of the frameworks, but not all of them.
I had always assumed that when lots of frameworks were involved, eventually the 
frameworks that would keep accepting resources indefinitely would consume the 
remaining resource, as every other framework had rejected the offers.
Could someone elaborate a little on how the DRF allocator / sorter handles this 
situation, is this likely to be related to the different users being used? Is 
there a way to mitigate this?
We're running version 0.23.1.
Cheers,
Tom.








-- 
Guangya Liu (刘光亚)
Senior Software Engineer
DCOS and OpenStack Development
IBM Platform Computing
Systems and Technology Group




Re: Specifying a preferred host with a Resource Request

2016-02-08 Thread Klaus Ma
me offers for *every* host (where resources
>>>are available). Or, all available offers are broadcasted to all 
>>> frameworks.
>>>
>>> Are there alternatives that I can use to support this usecase and ensure
>>> that the wait time for an available resource is limited (say about a minute
>>> or two)? . It can still be a best-effort guarantee and not a strict one.
>>>
>>>
>>>
>>> Thanks again,
>>> Jagadish
>>>
>>> --
>>> Jagadish
>>>
>>>
>>>
>>> On Fri, Feb 5, 2016 at 6:46 PM, Guangya Liu <gyliu...@gmail.com> wrote:
>>>
>>>> Hi Jagadish,
>>>>
>>>> Even though Mesos have the interface of "requestResources", it was not
>>>> implemented in the built-in allocator at the moment, so the call of 
>>>> "driver.requestResources
>>>> (resources);" will not work.
>>>>
>>>> Is it possible that you update your framework logic as this:
>>>> 1) framework get resoruce offer from mesos master
>>>> 2) framework filter the resource offers based on its preferences
>>>>
>>>> The problem for such solution is that the framework sometimes may not
>>>> get its preferred resources if the preferred resource was offered to other
>>>> frameworks.
>>>>
>>>> Can you please file a JIRA ticket to request implement the API of 
>>>> "requestResources"?
>>>> It would be great if you can append some background for your request so
>>>> that the community can evaluate how to move this forward.
>>>>
>>>> Thanks,
>>>>
>>>> Guangya
>>>>
>>>>
>>>> On Sat, Feb 6, 2016 at 6:45 AM, Jagadish Venkatraman <
>>>> jagadish1...@gmail.com> wrote:
>>>>
>>>>> I have fair experience in writing frameworks on Yarn. In the Yarn
>>>>> world,
>>>>> the amClient supports a method where I can specify the preferredHost
>>>>> with
>>>>> the resource request.
>>>>>
>>>>> Is there a way to specify a preferred host with the resource request in
>>>>> Mesos?
>>>>>
>>>>> I currently do:
>>>>>
>>>>> driver.requestResources (resources);
>>>>>
>>>>> I don't find a way to associate a preferred hostname with a resource
>>>>> request. A code sample will be really helpful. (for example, I want 1G
>>>>> mem,
>>>>> 1cpu core preferrably on host: xyz.aws.com )
>>>>>
>>>>> Thanks,
>>>>> Jagadish
>>>>>
>>>>> --
>>>>> Jagadish V,
>>>>> Graduate Student,
>>>>> Department of Computer Science,
>>>>> Stanford University
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Jagadish V,
>>> Graduate Student,
>>> Department of Computer Science,
>>> Stanford University
>>>
>>>
>>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Managing Persistency via Frameworks (HDFS, Cassandra)

2016-02-08 Thread Klaus Ma
Hi Andreas,

I think Mesosphere has done some work on your questions, would you check
related repos at https://github.com/mesosphere ?


On Mon, Feb 8, 2016 at 9:43 PM Andreas Fritzler <andreas.fritz...@gmail.com>
wrote:

> Hi,
>
> I have a couple of questions around the persistency topic within a Mesos
> cluster:
>
> 1. Any takes on the quality of the HDFS [1] and the Cassandra [2]
> frameworks? Does anybody have any experiences in running those frameworks
> in production?
>
> 2. How well are those frameworks performing if I want to use them to
> separate tenants on one Mesos cluster? (HDFS is not dockerized yet?)
>
> 3. How about scaling out/down existing framework instances? Is that even
> possible? Couldn't find anything in the docs/github.
>
> 4. Upgrading a running instance: wondering how that is managed in those
> frameworks. There is an open issue for the HDFS [3] part. For cassandra the
> scheduler update seems to be smooth, however changing the underlying
> Cassandra version seems to be tricky [4].
>
> Regards,
> Andreas
>
> [1] https://github.com/mesosphere/hdfs
> [2] https://github.com/mesosphere/cassandra-mesos
> [3] https://github.com/mesosphere/hdfs/issues/23
> [4] https://github.com/mesosphere/cassandra-mesos/issues/137
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Specifying a preferred host with a Resource Request

2016-02-06 Thread Klaus Ma
Hi Jagadish,

This the requirement for dynamic reservation and persistent volumes :).
Here's the related document:

Reservation: http://mesos.apache.org/documentation/latest/reservation/
Persistent Volume:
http://mesos.apache.org/documentation/latest/persistent-volume/

Thanks
Klaus

On Sun, Feb 7, 2016 at 2:31 AM Jagadish Venkatraman <jagadish1...@gmail.com>
wrote:

> Hi Guangya,
>
> Thanks for the response! Let me provide more background to this request.
>
> *Background:*
> I work on Apache Samza <http://samza.apache.org> , a distributed stream
> processing framework. Currently Samza supports only Yarn as a resource
> manager. (there have been requests to run Samza with mesos). A cluster (200
> nodes 'ish) runs many Samza Jobs (about 3500). Each Samza Job has its own
> framework that requests resources (containers) for the job to run. Each
> such container uses GBs of local state
> <http://radar.oreilly.com/2014/07/why-local-state-is-a-fundamental-primitive-in-stream-processing.html>
>   .
> When such a container(resource) is started on a different host by the
> framework, the local state must be re-bootstrapped.  (this results in a
> long bootstrap time, which is essentially down time).
>
> The same is true for Apache Kafka <http://kafka.apache.org/>, a
> distributed pub-sub logging system.  When a Kafka broker must be restarted
> by the framework, it should ideally be re-started on the same host.
> (otherwise, each broker has to re-bootstrap several GBs of logs from its
> peers before it can start to service a request.)
>
> I'm sure many stateful services have similar requirements.
>
> >> Is it possible that you update your framework logic as this:
> 1) framework get resoruce offer from mesos master
> 2) framework filter the resource offers based on its preferences
>
> I can certainly do that. But, here's my concern:
>
>-  Is the offer for resources to frameworks, 'round robin' across the
>available pool across hosts? I want to ensure that the wait time for a
>resource-wait is bounded.
>- Are there tunables that we can set to be more 'fair' (in terms of
>variety of hosts) when Offers are offered? For example, every framework
>will receive atleast some offers for *every* host (where resources are
>available). Or, all available offers are broadcasted to all frameworks.
>
> Are there alternatives that I can use to support this usecase and ensure
> that the wait time for an available resource is limited (say about a minute
> or two)? . It can still be a best-effort guarantee and not a strict one.
>
>
>
> Thanks again,
> Jagadish
>
> --
> Jagadish
>
>
>
> On Fri, Feb 5, 2016 at 6:46 PM, Guangya Liu <gyliu...@gmail.com> wrote:
>
>> Hi Jagadish,
>>
>> Even though Mesos have the interface of "requestResources", it was not
>> implemented in the built-in allocator at the moment, so the call of 
>> "driver.requestResources
>> (resources);" will not work.
>>
>> Is it possible that you update your framework logic as this:
>> 1) framework get resoruce offer from mesos master
>> 2) framework filter the resource offers based on its preferences
>>
>> The problem for such solution is that the framework sometimes may not get
>> its preferred resources if the preferred resource was offered to other
>> frameworks.
>>
>> Can you please file a JIRA ticket to request implement the API of 
>> "requestResources"?
>> It would be great if you can append some background for your request so
>> that the community can evaluate how to move this forward.
>>
>> Thanks,
>>
>> Guangya
>>
>>
>> On Sat, Feb 6, 2016 at 6:45 AM, Jagadish Venkatraman <
>> jagadish1...@gmail.com> wrote:
>>
>>> I have fair experience in writing frameworks on Yarn. In the Yarn world,
>>> the amClient supports a method where I can specify the preferredHost with
>>> the resource request.
>>>
>>> Is there a way to specify a preferred host with the resource request in
>>> Mesos?
>>>
>>> I currently do:
>>>
>>> driver.requestResources (resources);
>>>
>>> I don't find a way to associate a preferred hostname with a resource
>>> request. A code sample will be really helpful. (for example, I want 1G
>>> mem,
>>> 1cpu core preferrably on host: xyz.aws.com )
>>>
>>> Thanks,
>>> Jagadish
>>>
>>> --
>>> Jagadish V,
>>> Graduate Student,
>>> Department of Computer Science,
>>> Stanford University
>>>
>>
>>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>
-- 

Regards,

Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
IBM Platform Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me


Re: Mesos sometimes not allocating the entire cluster

2016-01-22 Thread Klaus Ma
Can you share the whole log of master? I'll be helpful :).


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Jan 21, 2016 at 11:57 PM, Tom Arnfeld <t...@duedil.com> wrote:

> Guangya - Nope, there's no outstanding offers for any frameworks, the ones
> that are getting offers are responding properly.
>
> Klaus - This was just a sample of logs for a single agent, the cluster has
> at  least ~40 agents at any one time.
>
> On 21 January 2016 at 15:20, Guangya Liu <gyliu...@gmail.com> wrote:
>
>> Can you please help check if some outstanding offers in cluster which
>> does not accept by any framework? You can check this via the endpoint of
>> /master/state.json endpoint.
>>
>> If there are some outstanding offers, you can start the master with a
>> offer_timeout flag to let master rescind some offers if those offers are
>> not accepted by framework.
>>
>> Cited from
>> https://github.com/apache/mesos/blob/master/docs/configuration.md
>>
>> --offer_timeout=VALUE Duration of time before an offer is rescinded from
>> a framework.
>>
>> This helps fairness when running frameworks that hold on to offers, or
>> frameworks that accidentally drop offers.
>>
>> Thanks,
>>
>> Guangya
>>
>> On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com> wrote:
>>
>>> Hi Klaus,
>>>
>>> Sorry I think I explained this badly, these are the logs for one slave
>>> (that's empty) and we can see that it is making offers to some frameworks.
>>> In this instance, the Hadoop framework (and others) are not among those
>>> getting any offers, they get offered nothing. The allocator is deciding to
>>> send offers in a loop to a certain set of frameworks, starving others.
>>>
>>> On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com> wrote:
>>>
>>>> Yes, it seems Hadoop framework did not consume all offered resources:
>>>> if framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will
>>>> return back to master (recoverResouces).
>>>>
>>>> 
>>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>>>> Platform OpenSource Technology, STG, IBM GCG
>>>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>>>
>>>> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com> wrote:
>>>>
>>>>> Thanks everyone!
>>>>>
>>>>> Stephan - There's a couple of useful points there, will definitely
>>>>> give it a read.
>>>>>
>>>>> Klaus - Thanks, we're running a bunch of different frameworks, in that
>>>>> list there's Hadoop MRv1, Apache Spark, Marathon and a couple of home 
>>>>> grown
>>>>> frameworks we have. In this particular case the Hadoop framework is the
>>>>> major concern, as it's designed to continually accept offers until it has
>>>>> enough slots it needs. With the example I gave above, we observe that the
>>>>> master is never sending any sizeable offers to some of these frameworks
>>>>> (the ones with the larger shares), which is where my confusion stems from.
>>>>>
>>>>> I've attached a snippet of our active master logs which show the
>>>>> activity for a single slave (which has no active executors). We can see
>>>>> that it's cycling though sending and recovering declined offers from a
>>>>> selection of different frameworks (in order) but I can say that not all of
>>>>> the frameworks are receiving these offers, in this case that's the Hadoop
>>>>> framework.
>>>>>
>>>>>
>>>>> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:
>>>>>
>>>>>> Hi Tom,
>>>>>>
>>>>>> Which framework are you using, e.g. Swarm, Marathon or something
>>>>>> else? and which language package are you using?
>>>>>>
>>>>>> DRF will sort role/framework by allocation ratio, and offer all
>>>>>> "available" resources by slave; but if the resources it too small (<
>>>>>> 0.1CPU) or the resources was reject/declined by framework, the resources
>>>>>> will not offer it until filter timeout. For example, in Swarm 1.0, the
>>>>>> defa

Re: Mesos sometimes not allocating the entire cluster

2016-01-21 Thread Klaus Ma
Yes, it seems Hadoop framework did not consume all offered resources: if
framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will
return back to master (recoverResouces).


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com> wrote:

> Thanks everyone!
>
> Stephan - There's a couple of useful points there, will definitely give it
> a read.
>
> Klaus - Thanks, we're running a bunch of different frameworks, in that
> list there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown
> frameworks we have. In this particular case the Hadoop framework is the
> major concern, as it's designed to continually accept offers until it has
> enough slots it needs. With the example I gave above, we observe that the
> master is never sending any sizeable offers to some of these frameworks
> (the ones with the larger shares), which is where my confusion stems from.
>
> I've attached a snippet of our active master logs which show the activity
> for a single slave (which has no active executors). We can see that it's
> cycling though sending and recovering declined offers from a selection of
> different frameworks (in order) but I can say that not all of the
> frameworks are receiving these offers, in this case that's the Hadoop
> framework.
>
>
> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:
>
>> Hi Tom,
>>
>> Which framework are you using, e.g. Swarm, Marathon or something else?
>> and which language package are you using?
>>
>> DRF will sort role/framework by allocation ratio, and offer all
>> "available" resources by slave; but if the resources it too small (<
>> 0.1CPU) or the resources was reject/declined by framework, the resources
>> will not offer it until filter timeout. For example, in Swarm 1.0, the
>> default filter timeout 5s (because of go scheduler API); so here is case
>> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but
>> only launch one container with 1 CPUS; the other 15 CPUS will return back
>>  to master and did not re-offer until filter timeout (5s).
>> I had pull a request to make Swarm's parameters configurable, refer to
>> https://github.com/docker/swarm/pull/1585. I think you can check this
>> case by master log.
>>
>> If any comments, please let me know.
>>
>> 
>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> Platform OpenSource Technology, STG, IBM GCG
>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>
>> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote:
>>
>>> Hey,
>>>
>>> I've noticed some interesting behaviour recently when we have lots of
>>> different frameworks connected to our Mesos cluster at once, all using a
>>> variety of different shares. Some of the frameworks don't get offered more
>>> resources (for long periods of time, hours even) leaving the cluster under
>>> utilised.
>>>
>>> Here's an example state where we see this happen..
>>>
>>> Framework 1 - 13% (user A)
>>> Framework 2 - 22% (user B)
>>> Framework 3 - 4% (user C)
>>> Framework 4 - 0.5% (user C)
>>> Framework 5 - 1% (user C)
>>> Framework 6 - 1% (user C)
>>> Framework 7 - 1% (user C)
>>> Framework 8 - 0.8% (user C)
>>> Framework 9 - 11% (user D)
>>> Framework 10 - 7% (user C)
>>> Framework 11 - 1% (user C)
>>> Framework 12 - 1% (user C)
>>> Framework 13 - 6% (user E)
>>>
>>> In this example, there's another ~30% of the cluster that is
>>> unallocated, and it stays like this for a significant amount of time until
>>> something changes, perhaps another user joins and allocates the rest
>>> chunks of this spare resource is offered to some of the frameworks, but not
>>> all of them.
>>>
>>> I had always assumed that when lots of frameworks were involved,
>>> eventually the frameworks that would keep accepting resources indefinitely
>>> would consume the remaining resource, as every other framework had rejected
>>> the offers.
>>>
>>> Could someone elaborate a little on how the DRF allocator / sorter
>>> handles this situation, is this likely to be related to the different users
>>> being used? Is there a way to mitigate this?
>>>
>>> We're running version 0.23.1.
>>>
>>> Cheers,
>>>
>>> Tom.
>>>
>>
>>
>


Re: Mesos sometimes not allocating the entire cluster

2016-01-21 Thread Klaus Ma
Do you mean the only one slave is offered to some framework but the others
are starving?
Mesos allocator (DRF) offer resources by host; so if there's only one host,
the other framework can not get resources. We're have several JIRAs on how
to balance resources between frameworks.



Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Jan 21, 2016 at 9:44 PM, Tom Arnfeld <t...@duedil.com> wrote:

> Hi Klaus,
>
> Sorry I think I explained this badly, these are the logs for one slave
> (that's empty) and we can see that it is making offers to some frameworks.
> In this instance, the Hadoop framework (and others) are not among those
> getting any offers, they get offered nothing. The allocator is deciding to
> send offers in a loop to a certain set of frameworks, starving others.
>
> On 21 January 2016 at 13:17, Klaus Ma <klaus1982...@gmail.com> wrote:
>
>> Yes, it seems Hadoop framework did not consume all offered resources: if
>> framework launch task (1 CPUs) on offer (10 CPUs), the other 9 CPUs will
>> return back to master (recoverResouces).
>>
>> 
>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> Platform OpenSource Technology, STG, IBM GCG
>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>
>> On Thu, Jan 21, 2016 at 6:46 PM, Tom Arnfeld <t...@duedil.com> wrote:
>>
>>> Thanks everyone!
>>>
>>> Stephan - There's a couple of useful points there, will definitely give
>>> it a read.
>>>
>>> Klaus - Thanks, we're running a bunch of different frameworks, in that
>>> list there's Hadoop MRv1, Apache Spark, Marathon and a couple of home grown
>>> frameworks we have. In this particular case the Hadoop framework is the
>>> major concern, as it's designed to continually accept offers until it has
>>> enough slots it needs. With the example I gave above, we observe that the
>>> master is never sending any sizeable offers to some of these frameworks
>>> (the ones with the larger shares), which is where my confusion stems from.
>>>
>>> I've attached a snippet of our active master logs which show the
>>> activity for a single slave (which has no active executors). We can see
>>> that it's cycling though sending and recovering declined offers from a
>>> selection of different frameworks (in order) but I can say that not all of
>>> the frameworks are receiving these offers, in this case that's the Hadoop
>>> framework.
>>>
>>>
>>> On 21 January 2016 at 00:26, Klaus Ma <klaus1982...@gmail.com> wrote:
>>>
>>>> Hi Tom,
>>>>
>>>> Which framework are you using, e.g. Swarm, Marathon or something else?
>>>> and which language package are you using?
>>>>
>>>> DRF will sort role/framework by allocation ratio, and offer all
>>>> "available" resources by slave; but if the resources it too small (<
>>>> 0.1CPU) or the resources was reject/declined by framework, the resources
>>>> will not offer it until filter timeout. For example, in Swarm 1.0, the
>>>> default filter timeout 5s (because of go scheduler API); so here is case
>>>> that may impact the utilisation: the Swarm got one slave with 16 CPUS, but
>>>> only launch one container with 1 CPUS; the other 15 CPUS will return back
>>>>  to master and did not re-offer until filter timeout (5s).
>>>> I had pull a request to make Swarm's parameters configurable, refer to
>>>> https://github.com/docker/swarm/pull/1585. I think you can check this
>>>> case by master log.
>>>>
>>>> If any comments, please let me know.
>>>>
>>>> 
>>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>>>> Platform OpenSource Technology, STG, IBM GCG
>>>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>>>
>>>> On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> I've noticed some interesting behaviour recently when we have lots of
>>>>> different frameworks connected to our Mesos cluster at once, all using a
>>>>> variety of different shares. Some of the frameworks don't get offered more
>>>>> resources (for long periods of time, hours even) leaving the cluster under
>>>>> utilised.
>>>>>
>>>>&

Re: Mesos sometimes not allocating the entire cluster

2016-01-20 Thread Klaus Ma
Hi Tom,

Which framework are you using, e.g. Swarm, Marathon or something else? and
which language package are you using?

DRF will sort role/framework by allocation ratio, and offer all "available"
resources by slave; but if the resources it too small (< 0.1CPU) or the
resources was reject/declined by framework, the resources will not offer it
until filter timeout. For example, in Swarm 1.0, the default filter timeout
5s (because of go scheduler API); so here is case that may impact the
utilisation: the Swarm got one slave with 16 CPUS, but only launch one
container with 1 CPUS; the other 15 CPUS will return back  to master and
did not re-offer until filter timeout (5s).
I had pull a request to make Swarm's parameters configurable, refer to
https://github.com/docker/swarm/pull/1585. I think you can check this case
by master log.

If any comments, please let me know.

----
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Jan 21, 2016 at 2:19 AM, Tom Arnfeld <t...@duedil.com> wrote:

> Hey,
>
> I've noticed some interesting behaviour recently when we have lots of
> different frameworks connected to our Mesos cluster at once, all using a
> variety of different shares. Some of the frameworks don't get offered more
> resources (for long periods of time, hours even) leaving the cluster under
> utilised.
>
> Here's an example state where we see this happen..
>
> Framework 1 - 13% (user A)
> Framework 2 - 22% (user B)
> Framework 3 - 4% (user C)
> Framework 4 - 0.5% (user C)
> Framework 5 - 1% (user C)
> Framework 6 - 1% (user C)
> Framework 7 - 1% (user C)
> Framework 8 - 0.8% (user C)
> Framework 9 - 11% (user D)
> Framework 10 - 7% (user C)
> Framework 11 - 1% (user C)
> Framework 12 - 1% (user C)
> Framework 13 - 6% (user E)
>
> In this example, there's another ~30% of the cluster that is unallocated,
> and it stays like this for a significant amount of time until something
> changes, perhaps another user joins and allocates the rest chunks of
> this spare resource is offered to some of the frameworks, but not all of
> them.
>
> I had always assumed that when lots of frameworks were involved,
> eventually the frameworks that would keep accepting resources indefinitely
> would consume the remaining resource, as every other framework had rejected
> the offers.
>
> Could someone elaborate a little on how the DRF allocator / sorter handles
> this situation, is this likely to be related to the different users being
> used? Is there a way to mitigate this?
>
> We're running version 0.23.1.
>
> Cheers,
>
> Tom.
>


Re: Share GPU resources via attributes or as custom resources (INTERNAL)

2016-01-15 Thread Klaus Ma
Yes, "attributes" is the way for now.
But after Marathon supporting Mesos' Multiple Roles (MESOS-1763), you can
use role info to define resource groups.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform OpenSource Technology, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Jan 15, 2016 at 4:22 PM, <humberto.caste...@telenor.com> wrote:

> Thanks Haosdent,
>
>
>
> If what you say about Marathon is right (i.e., that Marathon’s constraints
> only work with Mesos’ attributes), then I cannot use --resources="gpu(*):4",
> since I have no way in Marathon to specify my job needs a GPU resource (at
> least using the web interface), right?
>
>
>
> I guess I will have to experiment with attributes.
>
>
>
> Cheers,
>
> Humberto
>
>
>
>
>
>
>
> *From:* haosdent [mailto:haosd...@gmail.com]
> *Sent:* 14. januar 2016 19:07
> *To:* user
> *Subject:* Re: Share GPU resources via attributes or as custom resources
> (INTERNAL)
>
>
>
> >Then, if a job is sent to the machine when the 4 GPUs are already busy,
> the job will fail to start, right?
>
> I not sure this. But if job fail, Marathon would retry as you said.
>
>
>
> >a job is sent to the machine, all 4 GPUs will become busy
>
> If you specify your task only use 1 gpu in resources field. I think Mesos
> could continue provide offers which have gpu. And I remember
> Marathon constraints only could work with --attributes.
>
>
>
> On Fri, Jan 15, 2016 at 1:02 AM, <humberto.caste...@telenor.com> wrote:
>
> I have a machine with 4 GPUs and want to use Mesos+Marathon to schedule
> the jobs to be run in the machine. Each job will use maximum 1 GPU and
> sharing 1 GPU between small jobs would be ok.
> I know Mesos does not directly support GPUs, but it seems I might use
> custom resources or attributes to do what I want. But how exactly should
> this be done?
>
> If I use --attributes="hasGpu:true", would a job be sent to the machine
> when another job is already running in the machine (and only using 1 GPU)?
> I would say all jobs requesting a machine with a hasGpu attribute would be
> sent to the machine (as long as it has free CPU and memory resources).
> Then, if a job is sent to the machine when the 4 GPUs are already busy, the
> job will fail to start, right? Could then Marathon be used to re-send the
> job after some time, until it is accepted by the machine?
>
> If I specify --resources="gpu(*):4", it is my understanding that once a
> job is sent to the machine, all 4 GPUs will become busy to the eyes of
> Mesos (even if this is not really true). If that is right, would this
> work-around work: specify 4 different resources: gpu:A, gpu:B, gpu:C and
> gpu:D; and use constraints in Marathon like this  "constraints": [["gpu",
> "LIKE", " [A-D]"]]?
>
> Cheers
>
>
>
>
>
> --
>
> Best Regards,
>
> Haosdent Huang
>


Re: How to start two slaves on a single machine?

2016-01-12 Thread Klaus Ma
The resources of slave can be defined by "--resources"; but can not define
50% cpu by default. There's a module in Agent to report how many resources
can be used by current slave.

For this case, "--resources" is enough for him :).


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, Jan 12, 2016 at 9:00 PM, Du, Fan <fan...@intel.com> wrote:

> How to make slave_a to use first half of cpu/memory and slave_b use the
> rest of it?
>
>
> On 2016/1/12 20:54, haosdent wrote:
>
>> Yes, need use different work_dir and port.
>>
>> On Tue, Jan 12, 2016 at 8:42 PM, Du, Fan <fan...@intel.com
>> <mailto:fan...@intel.com>> wrote:
>>
>> Just my 2 cents.
>>
>> I guess the spew is caused by the same work_dir.
>> Even with two different work_dir, how does cpu/mem resources are
>> partitioned for two slave instances?
>> I'm not aware how current resources parsing logic support this(
>> probably not).
>> but why not use slave docker image to do the resource partition?
>> that's what docker meant to be here.
>>
>> On 2016/1/12 19:58, Shiyao Ma wrote:
>>
>> Hi,
>>
>> When trying starting two slaves on a single host, I encoutered
>> with the
>> following error:
>>
>> paste: http://sprunge.us/bLKb
>>
>> Apparently, the second slave was *mis-understood* as the
>> recovery of the
>> first.
>>
>> The slaves are configured identically other than the ports.
>>
>>
>> Regards.
>>
>> --
>>
>> 吾輩は猫である。ホームーページはhttps://introo.me
>> <http://introo.me>。
>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>


Re: Why are FrameworkToExecutorMessage and ExecutorToFrameworkMessage transmitted along different paths?

2016-01-05 Thread Klaus Ma
re different path: that's because Master know the detail of Slave, e.g.
hostname & port; but when Executor send it to Scheduler, Slave has
Scheduler's info to send message.

re FrameworkToExeutorMessage reliability: yes, framework developer need to
guarantee its reliability, e.g. in Master/Slave failover case.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, Jan 5, 2016 at 10:47 PM, sujz.buaa <sujz.b...@qq.com> wrote:

> Hi, all:
>
>
> I am using mesos-0.22.0, I noticed that FrameworkToExecutorMessage is sent 
> along path:
> Scheduler->Master->Slave->Executor,
> while ExecutorToFrameworkMessage is sent along path:
> Executor->Slave->Scheduler,
>
>
> So is there some reason or benefit for bypassing master while transmitting 
> ExecutorToFrameworkMessage?
>
> One more question, FrameworkToExecutorMessage and ExecutorToFrameworkMessage 
> are instantiated
> in function SendFrameworkMessage,
> declaration of SendFrameworkMessage in include/mesos/scheduler.hpp and 
> include/mesos/executor.hpp:
>   // Sends a message from the framework to one of its executors. These
>   // messages are *best effort*; do not expect a framework message to be
>   // retransmitted in any reliable fashion.
>   virtual Status sendFrameworkMessage(
>   const ExecutorID& executorId,
>   const SlaveID& slaveId,
>   const std::string& data) = 0;
>
>
> I guess that protobuf message are transmitted with TCP, so does this comment 
> mean I have to guarantee reliability by myself even with TCP? What's special 
> for these
>
> two messages compared with other protobuf messages, If no, do we have to 
> guarantee reliability all by ourselves?
>
> Thank you very much and best regards !
>


Re: Running mesos slave in Docker on CoreOS

2016-01-01 Thread Klaus Ma
@ Graham, just check mesosphere/mesos-slave; it seems not include docker
binaries images; so I think you can try to mount docker binaries into this
docker image (please pay attention to dependencies), or re-build the image
yourself.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Jan 1, 2016 at 5:32 AM, Marco Massenzio <m.massen...@gmail.com>
wrote:

> Provided that I know close to nothing about CoreOS (and very little about
> docker itself) usually the 127 exit code is for a "not found" binary - are
> you sure that `docker` is in the PATH of the user/process running the Mesos
> agent?
>
> Much longer shot - but worth a try: look into the permissions around the
> /var/run folder - what happens if you try to run the very same command that
> failed, from the shell?
> (but I do see that you mount it with the -v, so that should work,
> shouldn't it?)
>
> --
> *Marco Massenzio*
> http://codetrips.com
>
> On Thu, Dec 31, 2015 at 1:17 PM, Taylor, Graham <
> graham.x.tay...@capgemini.com> wrote:
>
>> I did try removing the /proc and adding just pid=host but still no dice
>> with that. Need to have a deeper dig into the docker 1.9 changelog. Will
>> post back if I find anything.
>>
>> Thanks,
>> Graham.
>>
>> On 31 Dec 2015, at 20:27, Tim Chen <t...@mesosphere.io> wrote:
>>
>> I don't think you need to mount in /proc if you have --pid=host already,
>> can you try that?
>>
>> Tim
>>
>> On Thu, Dec 31, 2015 at 4:16 AM, Taylor, Graham <
>> graham.x.tay...@capgemini.com> wrote:
>>
>>> Hey folks,
>>> I’m trying to get Mesos slave up and running in a docker container on
>>> CoreOS. I’ve successfully got the master up and running but anytime I start
>>> the slave container I receive the following error -
>>>
>>> Failed to create a containerizer: Could not create DockerContainerizer:
>>> Failed to create docker: Failed to get docker version: Failed to execute
>>> 'docker -H unix:///var/run/docker.sock --version': exited with status 127
>>>
>>> I’m starting the slave container with the following command -
>>>
>>> /usr/bin/docker run --rm --name mesos_slave \
>>> --net=host \
>>> --privileged \
>>> --pid=host \
>>> -p 5051:5051 \
>>> -v /sys:/sys \
>>> -v /proc:/host/proc:ro \
>>> -v /usr/bin/docker:/usr/bin/docker:ro \
>>> -v /var/run/docker.sock:/var/run/docker.sock \
>>> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
>>> -e "MESOS_MASTER=zk://172.31.1.11:2181,172.31.1.12:2181,
>>> 172.31.1.13:2181/mesos" \
>>> -e "MESOS_EXECUTOR_REGISTRATION_TIMEOUT=10mins" \
>>> -e "MESOS_CONTAINERIZERS=docker" \
>>> -e "MESOS_RESOURCES=ports(*):[31000-32000]" \
>>> -e "MESOS_IP=172.31.1.14" \
>>> -e "MESOS_WORK_DIR=/tmp/mesos" \
>>> -e "MESOS_HOSTNAME=172.31.1.14" \
>>> mesosphere/mesos-slave:0.25.0-0.2.70.ubuntu1404
>>>
>>> I’ve also tried with various other versions of the Docker image
>>> (including 0.26.0) but I keep receiving the same error.
>>>
>>> I’m running on CoreOS beta channel (877.1.0) which has docker installed
>>> and the service running -
>>>
>>> docker --version
>>> Docker version 1.9.1, build 4419fdb-dirty
>>>
>>>
>>> If I change the /proc mount to be /proc:/proc I get past the docker
>>> version error but receive a different error -
>>>
>>> Error response from daemon: Cannot start container
>>> 51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758: [8]
>>> System error:
>>> "/var/lib/docker/overlay/51a9b60f702a0f13f975fd2e7f4b642180d5363565e042702665098e8761b758/merged/proc"
>>> cannot be mounted because it is located inside "/proc”
>>>
>>>
>>> I had a search on the wiki and found some similar related issues
>>> https://issues.apache.org/jira/browse/MESOS-3498?jql=project%20%3D%20MESOS%20AND%20text%20~%20%22Failed%20to%20execute%20%27docker%20version%22
>>>  but
>>> they all seem to be closed/resolved/won’t fix.
>>>
>>> Is anyone successfully running a slave on CoreOS and can help me fix up
>>> my Docker command?
>>>
>>> Thanks,
>>> Graham.
>>>
>>>
>>> --
>>>
>>> Capgemini is a trading name used by the Capgemini Group of companies
>>> which includes Capgemini UK plc, a company registered in England and Wales
>>> (number 943935) whose registered office is at No. 1, Forge End, Woking,
>>> Surrey, GU21 6DB.
>>>
>>
>>
>> --
>>
>> Capgemini is a trading name used by the Capgemini Group of companies
>> which includes Capgemini UK plc, a company registered in England and Wales
>> (number 943935) whose registered office is at No. 1, Forge End, Woking,
>> Surrey, GU21 6DB.
>>
>
>


Re: make slaves not getting tasks anymore

2015-12-30 Thread Klaus Ma
Hi Mike,

Which framework are you using? How about Maintenance's scheduling feature?
My understanding is that framework show not dispatch task to the
maintenance agent; so Operator can wait for all tasks finished before
taking any action.

For "When maintenance is triggered by the operator", it's used when
there're some tasks took too long time to finish; so Operator can task
action to shut them down.

For the agent restart with new attributes, there's a JIRA (MESOS-1739)
about it.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Wed, Dec 30, 2015 at 7:43 PM, Mike Michel <mike.mic...@mmbash.de> wrote:

> Hi,
>
>
>
> i need to update slaves from time to time and looking for a way to take
> them out of the cluster but without killing the running tasks. I need to
> wait until all tasks are done and during this time no new tasks should be
> started on this slave. My first idea was to set a constraint
> „status:online“ for every task i start and then change the attribute of the
> slave to „offline“, restart slave process while executer still runs the
> tasks but it seems if you change the attributes of a slave it can not
> connect to the cluster without rm -rf /tmp before which will kill all tasks.
>
>
>
> Also the maintenance mode seems not to be an option:
>
>
>
> „When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are subsequently removed from the master
> which causes tasks to be updated as TASK_LOST. Any agents from machines
> in maintenance are also prevented from registering with the master.“
>
>
>
> Is there another way?
>
>
>
>
>
> Cheers
>
>
>
> Mike
>


Re: More filters on /master/tasks enpoint and filters on /master/state

2015-12-28 Thread Klaus Ma
+1

It'll also reduce master's workload; but instead of label, I'd like to make
master simpler: return tasks page by page and let framework/dashboard to
filter it themself.



Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, Dec 29, 2015 at 6:09 AM, Diogo Gomes <diogo...@gmail.com> wrote:

> Hi guys, I would like your opinion about a future feature proposal.
>
> Currently, we can use HTTP API to list all our tasks running in our
> cluster using /master/tasks, but you have to list all tasks or limit/offset
> this list, we cannot filter this. I would like to filter this, using
> labels, for example. The use case will be to use mesos to fill our load
> balancer with tasks data.
>
>
> Marathon currently provides something like this, but only for his tasks,
> using /v2/apps/?label=[key]==[value]
>
>
> Diogo Gomes
>


Re: The issue of "Failed to shutdown socket with fd xx: Transport endpoint is not connected" on Mesos master

2015-12-28 Thread Klaus Ma
It seems Kubernetes is down; would you help to check kubernetes's status
(km)?


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Mon, Dec 28, 2015 at 6:35 PM, Nan Xiao <xiaonan830...@gmail.com> wrote:

> Hi all,
>
> Greetings from me!
>
> I am trying to follow this tutorial
> (
> https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/mesos.md
> )
> to deploy "k8s on Mesos" on local machines: The k8s is the newest
> master branch, and Mesos is the 0.26 edition.
>
> After running Mesos master(IP:15.242.100.56), Mesos
> slave(IP:15.242.100.16),, and the k8s(IP:15.242.100.60), I can see the
> following logs from Mesos master:
>
> ..
> I1227 22:52:34.494478  8069 master.cpp:4269] Received update of slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 at slave(1)@15.242.100.16:5051
> (pqsfc016.ftc.rdlabs.hpecorp.net) with total oversubscribed resources
> I1227 22:52:34.494940  8065 hierarchical.cpp:400] Slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0
> (pqsfc016.ftc.rdlabs.hpecorp.net) updated with oversubscribed
> resources  (total: cpus(*):32; mem(*):127878; disk(*):4336;
> ports(*):[31000-32000], allocated: )
> I1227 22:53:06.740757  8053 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56219 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:07.736419 8065 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56241 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:07.767196  8070 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56252 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:08.808171  8053 http.cpp:334] HTTP GET for
> /master/state.json from 15.242.100.60:56272 with
> User-Agent='Go-http-client/1.1'
> I1227 22:53:08.815811  8060 master.cpp:2176] Received SUBSCRIBE call
> for framework 'Kubernetes' at scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.816182 8060 master.cpp:2247] Subscribing framework
> Kubernetes with checkpointing enabled and capabilities [  ]
> I1227 22:53:08.817294  8052 hierarchical.cpp:195] Added framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-
> I1227 22:53:08.817464  8050 master.cpp:1122] Framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6- (Kubernetes) at
> scheduler(1)@15.242.100.60:59488 disconnected
> E1227 22:53:08.817497 8073 process.cpp:1911] Failed to shutdown
> socket with fd 17: Transport endpoint is not connected
> I1227 22:53:08.817533  8050 master.cpp:2472] Disconnecting framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6- (Kubernetes) at
> scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.817595 8050 master.cpp:2496] Deactivating framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6- (Kubernetes) at
> scheduler(1)@15.242.100.60:59488
> I1227 22:53:08.817797 8050 master.cpp:1146] Giving framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6- (Kubernetes) at
> scheduler(1)@15.242.100.60:59488 7625.14222623576weeks to failover
> W1227 22:53:08.818389 8062 master.cpp:4840] Master returning
> resources offered to framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6- because the framework has
> terminated or is inactive
> I1227 22:53:08.818397  8052 hierarchical.cpp:273] Deactivated
> framework 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-
> I1227 22:53:08.819046  8066 hierarchical.cpp:744] Recovered
> cpus(*):32; mem(*):127878; disk(*):4336; ports(*):[31000-32000]
> (total: cpus(*):32; mem(*):127878; disk(*):4336;
> ports(*):[31000-32000], allocated: ) on slave
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-S0 from framework
> 9c3c6c78-0b62-4eaa-b27a-498f172e7fe6-
> ..
>
> I can't figure out why Mesos master complains "Failed to shutdown
> socket with fd 17: Transport endpoint is not connected".
> Could someone give some clues on this issue?
>
> Thanks very much in advance!
>
> Best Regards
> Nan Xiao
>


Re: Can tasks from multiple frameworks simultaneously run on the same slave node?

2015-12-04 Thread Klaus Ma
re (a): Yes, it's current behavior; but MESOS-3765
<https://issues.apache.org/jira/browse/MESOS-3765> is to enhance it.
re (b): partly:
0. I'm not sure I understand your point on "actually used to launch
tasks", but my point is that Mesos did not know whether "task" is right;
Mesos depedent on framework (executor) to report correct status. But Mesos
can monitor some status such as executor crash.
1. if launchTask with 1 CPU, and the used offer include 3 CPU; the
other 2 CPU will return to allocator
2. And there is offer_timout in master to get resources back

re (c): nop, offer timeout and un-used resources will return resources; and
if the resources did not match the requirement, framework can also decline
it.

If the resources are available, other framework can launch tasks. So
multiple framework may launch tasks in the same slave. And there's several
features to enhance multi-tenant, such as Quota (MESOS-1791
<https://issues.apache.org/jira/browse/MESOS-1791>), Optimistic Offer (
MESOS-1607 <https://issues.apache.org/jira/browse/MESOS-1607>) and
Implicit/Dynamic Roles and so on.




Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Dec 4, 2015 at 5:44 PM, Daniel <dan...@abde.me> wrote:

> Hi, I'm confused with Mesos's resource offering mechanism:
>
> (a) An offer includes all available resources in a slave node.
>
> (b) A framework would occupy the resources associated with an offer,
> regardless of whether the resources were actually used to launch tasks,
> unless the offer was explicitly declined by calling declineOffer(offerId).
>
> (c) An offer must be declined in its entirety.
>
> Then it seems that if a framework does not decline an offer from a slave
> node, other frameworks have no access to the resources on the same slave
> node. Am I correct? Can tasks from mulliple frameworks simultaneously run
> on the same slave node?
>
> Thanks a lot :-)
>


Re: Multiple active masters

2015-11-30 Thread Klaus Ma
If the running two agent in slave host, the two clusters will share
resource all the time. One option I can image is to use meta framework to
connect to the two cluster, and balance the resources between the two
cluster.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Mon, Nov 30, 2015 at 8:01 PM, tommy xiao <xia...@gmail.com> wrote:

> run two slave in same node is bad idea on production env. i only say, why
> not install two cluster in same DC?
>
> 2015-11-30 14:20 GMT+08:00 Chengwei Yang <chengwei.yang...@gmail.com>:
>
>> On Fri, Nov 27, 2015 at 04:16:44PM +0100, Elouan Keryell-Even wrote:
>> > Hi,
>> >
>> > I don't think this is possible yet, but is it planned some day to be
>> able to
>> > work with multiple active Mesos masters?
>> >
>> > I mean, not just a high-availability configuration with one elected
>> master and
>> > a few others idle masters (in case of failure), but really two masters
>> > exploiting a shared pool of slaves.
>> >
>> >
>> > Our use case:
>> >
>> > We are using Mesos to manage resources of two separated clusters. For
>> now, one
>> > cluster acts as a "commandment center" (since it runs the Mesos
>> master), while
>> > the other only runs mesos slaves.
>> >
>> > We would want to have a perfectly symetric architecture, where each
>> cluster
>> > could borrow a few slaves from the other cluster to run some of his
>> jobs.
>> >
>> > In my mind that implies each slave has to be managed by two mesos
>> masters at
>> > the same time.
>>
>> Try run two mesos-slave on the same node though I couldn't buy-in your
>> use case?
>>
>> Of course, you have to configure different LIBPROCESS_PORT to avoid
>> conflict
>> and any other options you want, I didn't do this so I'm not sure if this
>> works.
>>
>>
>> --
>> Thanks,
>> Chengwei
>>
>> >
>> >
>> > I'm curious to have your opinion on this :)
>> >
>> > Elouan Keryell-Even
>> > Software Engineer @ Atos Integration
>> > Toulouse, France
>> > +33 6 64 61 29 56
>> > SECURITY NOTE: file ~/.netrc must not be accessible by others
>>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: Multiple active masters

2015-11-27 Thread Klaus Ma
That's interesting :). but as far as I known, there's no such migration
feature in Mesos. Maybe you can start an EPIC for this requirement.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Fri, Nov 27, 2015 at 11:16 PM, Elouan Keryell-Even <
elouan.kery...@gmail.com> wrote:

> Hi,
>
> I don't think this is possible yet, but is it planned some day to be able
> to work with multiple active Mesos masters?
>
> I mean, not just a high-availability configuration with one elected master
> and a few others idle masters (in case of failure), but really two masters
> exploiting a shared pool of slaves.
>
>
> Our use case:
>
> We are using Mesos to manage resources of two separated clusters. For now,
> one cluster acts as a "commandment center" (since it runs the Mesos
> master), while the other only runs mesos slaves.
>
> We would want to have a perfectly symetric architecture, where each
> cluster could borrow a few slaves from the other cluster to run some of his
> jobs.
>
> In my mind that implies each slave has to be managed by two mesos masters
> at the same time.
>
>
> I'm curious to have your opinion on this :)
>
> Elouan Keryell-Even
> Software Engineer @ Atos Integration
> Toulouse, France
> +33 6 64 61 29 56
>


Re: Failed to authenticate

2015-11-10 Thread Klaus Ma
would you help to log a JIRA to trace it?


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Tue, Nov 10, 2015 at 5:12 PM, Pradeep Kiruvale <pradeepkiruv...@gmail.com
> wrote:

> This issue is only on centos 7, On ubuntu its working fine.
>
> Any idea?
>
> Regards,
> Pradeep
>
>
> On 9 November 2015 at 17:32, Pradeep Kiruvale <pradeepkiruv...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am getting authentication issue on my mesos cluster
>>
>> Please find the slave side and master side logs.
>>
>> Regards,
>> Pradeep
>>
>> *Slave  logs *
>>
>> W1110 01:54:18.641191 111550 slave.cpp:877] Authentication timed out
>> W1110 01:54:18.641309 111550 slave.cpp:841] Failed to authenticate with
>> master master@192.168.0.102:5050: Authentication discarded
>> I1110 01:54:18.641355 111550 slave.cpp:792] Authenticating with master
>> master@192.168.0.102:5050
>> I1110 01:54:18.641369 111550 slave.cpp:797] Using default CRAM-MD5
>> authenticatee
>> I1110 01:54:18.641616 111539 authenticatee.cpp:123] Creating new client
>> SASL connection
>> W1110 01:54:23.646075 111555 slave.cpp:877] Authentication timed out
>> W1110 01:54:23.646205 111555 slave.cpp:841] Failed to authenticate with
>> master master@192.168.0.102:5050: Authentication discarded
>> I1110 01:54:23.646266 111555 slave.cpp:792] Authenticating with master
>> master@192.168.0.102:5050
>> I1110 01:54:23.646286 111555 slave.cpp:797] Using default CRAM-MD5
>> authenticatee
>> I1110 01:54:23.646406 111544 authenticatee.cpp:123] Creating new client
>> SASL connection
>> W1110 01:54:28.651070 111554 slave.cpp:877] Authentication timed out
>> W1110 01:54:28.651206 111554 slave.cpp:841] Failed to authenticate with
>> master master@192.168.0.102:5050: Authentication discarded
>> I1110 01:54:28.651257 111554 slave.cpp:792] Authenticating with master
>> master@192.168.0.102:5050
>>
>>
>> *Master logs*
>>
>> E1109 17:27:36.455260 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 11: Transport endpoint is not connected
>> W1109 17:27:36.455517 27949 master.cpp:5177] Failed to authenticate
>> slave(1)@192.168.0.169:5051: Failed to communicate with authenticatee
>> E1109 17:27:36.455602 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 12: Transport endpoint is not connected
>> I1109 17:27:41.459787 27946 master.cpp:5150] Authenticating slave(1)@
>> 192.168.0.169:5051
>> I1109 17:27:41.460211 27946 authenticator.cpp:100] Creating new server
>> SASL connection
>> E1109 17:27:41.460376 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 11: Transport endpoint is not connected
>> W1109 17:27:41.460578 27947 master.cpp:5177] Failed to authenticate
>> slave(1)@192.168.0.169:5051: Failed to communicate with authenticatee
>> E1109 17:27:41.460695 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 12: Transport endpoint is not connected
>> I1109 17:27:46.460510 27948 master.cpp:5150] Authenticating slave(1)@
>> 192.168.0.169:5051
>> I1109 17:27:46.460930 27944 authenticator.cpp:100] Creating new server
>> SASL connection
>> E1109 17:27:46.461139 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 11: Transport endpoint is not connected
>> W1109 17:27:46.461392 27944 master.cpp:5177] Failed to authenticate
>> slave(1)@192.168.0.169:5051: Failed to communicate with authenticatee
>> E1109 17:27:46.461444 27950 process.cpp:1911] Failed to shutdown socket
>> with fd 12: Transport endpoint is not connected
>> I1109 17:27:51.466349 27945 master.cpp:5150] Authenticating slave(1)@
>> 192.168.0.169:5051
>> I1109 17:27:51.466747 27945 authenticator.cpp:100] Creating new server
>> SASL connection
>>
>>
>


Re: spark mesos shuffle service failing under marathon

2015-11-07 Thread Klaus Ma
Can you share more logs? I used to start spark shuffle in Mesos + Marathon
cluster; logs will be helpful to identify issues.


Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me

On Thu, Nov 5, 2015 at 4:29 AM, Dean Wampler <dean.wamp...@typesafe.com>
wrote:

> Can you find anything in the logs that would indicate a failure?
>
> On Wed, Nov 4, 2015 at 9:23 PM, Rodrick Brown <rodr...@orchard-app.com>
> wrote:
>
>> Starting the mesos shuffle service seems to background the process so
>> when ever marathon tries to bring up this process it constantly keeps
>> trying to start and never registers as started? Is there a fix for this?
>>
>>
>> --
>>
>> [image: Orchard Platform] <http://www.orchardplatform.com/>
>>
>> Rodrick Brown / DevOPs Engineer
>> +1 917 445 6839 / rodr...@orchardplatform.com
>> <char...@orchardplatform.com>
>>
>> Orchard Platform
>> 101 5th Avenue, 4th Floor, New York, NY 10003
>> http://www.orchardplatform.com
>>
>> Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace
>> Lending Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/>
>>
>>
>> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
>> for the use of the addressee only. If you are not an intended recipient of
>> this communication, please delete it immediately and notify the sender
>> by return email. Unauthorized reading, dissemination, distribution or
>> copying of this communication is prohibited. This communication does not 
>> constitute
>> an offer to sell or a solicitation of an indication of interest to purchase
>> any loan, security or any other financial product or instrument, nor is it
>> an offer to sell or a solicitation of an indication of interest to purchase
>> any products or services to any persons who are prohibited from receiving
>> such information under applicable law. The contents of this communication
>> may not be accurate or complete and are subject to change without notice.
>> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
>> "Orchard") makes no representation regarding the accuracy or
>> completeness of the information contained herein. The intended recipient is
>> advised to consult its own professional advisors, including those
>> specializing in legal, tax and accounting matters. Orchard does not
>> provide legal, tax or accounting advice.
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> Typesafe <http://typesafe.com>
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> @deanwampler <http://twitter.com/deanwampler>
>
>
>


How to enable docker container in Mesos-Agent within a Docker

2015-10-27 Thread Klaus Ma
Hi team,

I'd like to start a mesos-agent in Docker with docker container, I can run
sleep docker within a ubuntu docker with following args; but I can not
start an Mesos Agent. Refer to the following for the detail of output:

$sudo docker run --privileged --net=host -v /cgroup:/cgroup -v
/var/run/docker.sock:/var/run/docker.sock -v `which
docker`:/usr/local/bin/docker mesostest/mesos mesos-slave
--master=zk://x:2181/mesos --containerizers=docker,mesos

​I1028 01:43:38.415841 1 main.cpp:190] Build: 2015-10-26 19:18:04 by
engbuild
I1028 01:43:38.416102 1 main.cpp:192] Version: 0.26.0
I1028 01:43:38.416353 1 main.cpp:199] Git SHA:
e3e24f32a0e4b89dee416969c5bc67ae9742007c
Failed to create a containerizer: Could not create DockerContainerizer:
Failed to create docker: Failed to find a mounted cgroups hierarchy for the
'cpu' subsystem; you probably need to mount cgroups manually
​

-- 
​Klaus​




Re: How to enable docker container in Mesos-Agent within a Docker

2015-10-27 Thread Klaus Ma
:(. It does not work. Both linux & posix are tested. And according to the
description of launcher, linux is necessary for docker. It's strange that
/cgroup is empty, but I can start mesos-slave with docker container outside
the docker.

On Wed, Oct 28, 2015 at 9:54 AM, haosdent <haosd...@gmail.com> wrote:

> try add -e MESOS_LAUNCHER=posix
> On Oct 28, 2015 9:49 AM, "Klaus Ma" <klaus1982...@gmail.com> wrote:
>
>> Hi team,
>>
>> I'd like to start a mesos-agent in Docker with docker container, I can
>> run sleep docker within a ubuntu docker with following args; but I can not
>> start an Mesos Agent. Refer to the following for the detail of output:
>>
>> $sudo docker run --privileged --net=host -v /cgroup:/cgroup -v
>> /var/run/docker.sock:/var/run/docker.sock -v `which
>> docker`:/usr/local/bin/docker mesostest/mesos mesos-slave
>> --master=zk://x:2181/mesos --containerizers=docker,mesos
>>
>> ​I1028 01:43:38.415841 1 main.cpp:190] Build: 2015-10-26 19:18:04 by
>> engbuild
>> I1028 01:43:38.416102 1 main.cpp:192] Version: 0.26.0
>> I1028 01:43:38.416353 1 main.cpp:199] Git SHA:
>> e3e24f32a0e4b89dee416969c5bc67ae9742007c
>> Failed to create a containerizer: Could not create DockerContainerizer:
>> Failed to create docker: Failed to find a mounted cgroups hierarchy for the
>> 'cpu' subsystem; you probably need to mount cgroups manually
>> ​
>>
>> --
>> ​Klaus​
>>
>> <http://k82.me>
>>
>


-- 
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me


Re: How to enable docker container in Mesos-Agent within a Docker

2015-10-27 Thread Klaus Ma
It worked by adding -v /sys:/sys, but not sure whether any security concern.

On Wed, Oct 28, 2015 at 10:22 AM, Klaus Ma <klaus1982...@gmail.com> wrote:

> :(. It does not work. Both linux & posix are tested. And according to the
> description of launcher, linux is necessary for docker. It's strange that
> /cgroup is empty, but I can start mesos-slave with docker container outside
> the docker.
>
> On Wed, Oct 28, 2015 at 9:54 AM, haosdent <haosd...@gmail.com> wrote:
>
>> try add -e MESOS_LAUNCHER=posix
>> On Oct 28, 2015 9:49 AM, "Klaus Ma" <klaus1982...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I'd like to start a mesos-agent in Docker with docker container, I can
>>> run sleep docker within a ubuntu docker with following args; but I can not
>>> start an Mesos Agent. Refer to the following for the detail of output:
>>>
>>> $sudo docker run --privileged --net=host -v /cgroup:/cgroup -v
>>> /var/run/docker.sock:/var/run/docker.sock -v `which
>>> docker`:/usr/local/bin/docker mesostest/mesos mesos-slave
>>> --master=zk://x:2181/mesos --containerizers=docker,mesos
>>>
>>> ​I1028 01:43:38.415841 1 main.cpp:190] Build: 2015-10-26 19:18:04 by
>>> engbuild
>>> I1028 01:43:38.416102 1 main.cpp:192] Version: 0.26.0
>>> I1028 01:43:38.416353 1 main.cpp:199] Git SHA:
>>> e3e24f32a0e4b89dee416969c5bc67ae9742007c
>>> Failed to create a containerizer: Could not create DockerContainerizer:
>>> Failed to create docker: Failed to find a mounted cgroups hierarchy for the
>>> 'cpu' subsystem; you probably need to mount cgroups manually
>>> ​
>>>
>>> --
>>> ​Klaus​
>>>
>>> <http://k82.me>
>>>
>>
>
>
> --
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>



-- 
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me


Any ready deb package of Mesos

2015-10-20 Thread Klaus Ma
Hi team,

Is there any ready deb package of Mesos to download?

-- 
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net


RE: Is there any APIs for status monitering, how did the Webui got the status of mesos?

2015-10-07 Thread Klaus Ma
Hi Chong,I think you can use Mesos’s REST API to achieve that; please refer to 
the following URL for more 
detail:http://mesos.apache.org/documentation/latest/monitoring/
 
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer 
Platform Symphony/DCOS Development & Support, STG, IBM GCG 
+86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net

On Oct 8, 2015, at 09:04, Chong Chen <chong.ch...@huawei.com> wrote:Hi, I want 
to implement a program to monitoring mesos. Is there exist any APIs already 
implemented in mesos that I can use to get the status of the Mesos? just like 
what  webui did: acquire the information about  the amount of total resources, 
allocated resources, dispatched tasks, finished/lost tasks…. How did the webui 
of mesos got this information? I think the fast way for me  is using the same 
method as webui did. Thanks! Best Regards,Chong
  

[Behavior Update] Command-line flags will take precedence over OS Env variables

2015-09-15 Thread Klaus Ma
Hi team,
As we known, the cli can be configured by environment vars & client paramaters; 
currently, if there's any conflict/dupliated configuration between env and cli 
or on cli. After the discussion on JIRA (MESOS-3340), we are planning to update 
this behavior to overwrite env vars by cli flags; and I'd like to refer 
Michael's case to explain why it is:
Imagining people typically have export MESOS_IP=127.0.0.1 in their bashrc which 
they use in most cases by default, and provide the --ip=127.168.1.2
 on occasions (e.g. testing) when they want to override it. 
If any comments, please let me know.

Regards,Klaus Ma (马达), PMP® | http://www.cguru.net
CallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
SkypeCallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
Skype  

Re: Setting maximum per-node resources in offers

2015-09-08 Thread Klaus Ma
If it's the only framework, you will receive all nodes from Mesos as 
offers. You can re-schedule those resources to run tasks on each node.


On 2015年09月09日 03:03, RJ Nowling wrote:

Hi all,

I have a smallish cluster with a lot of cores and RAM per node.  I 
want to support multiple users so I'd like to set up Mesos to provide 
a maximum of 8 cores per node in the resource offers.  Resource offers 
should include multiple nodes to reach the requirements of the user.  
For example, if the user requests 32 cores, I would like 8 cores from 
each of 4 nodes.


Is this possible?  Or can someone suggest alternatives?

Thanks,
RJ


--
Klaus Ma (马达), PMP® | http://www.cguru.net



RE: Basic installation question

2015-09-05 Thread Klaus Ma
Can you share the command line of master & slave? According the following 
information, it seems master run without "--zk" option.



Regards,Klaus Ma (马达), PMP® | http://www.cguru.net
CallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
SkypeCallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
SkypeDate: Fri, 4 Sep 2015 17:33:45 -0700
Subject: Re: Basic installation question
From: java...@gmail.com
To: user@mesos.apache.org


I installed using yum -y install mesos. That did work.   
Now the master and slaves do not see each other.

Here is the master:
$ ps -ef | grep mesos | grep -v grepstack30236 17902  0 00:09 pts/4
00:00:04 /mnt/mesos/build/src/.libs/lt-mesos-master --work_dir=/tmp/mesos 
--ip=10.xx.xx.124

Here is one of the 20 slaves:
 ps -ef | grep mesos | grep -v greproot 26086 1  0 00:10 ?
00:00:00 /usr/sbin/mesos-slave --master=zk://10.xx.xx.124:2181/mesos 
--log_dir=/var/log/mesosroot 26092 26086  0 00:10 ?00:00:00 logger 
-p user.info -t mesos-slave[26086]root 26093 26086  0 00:10 ?
00:00:00 logger -p user.err -t mesos-slave[26086]

Note the slave and master are on correct same ip address
The /etc/mesos/zk seems to be set properly : and I do see the /mesos node in 
zookeeper is updated after restarting the master
However the zookeeper node is empty:
[zk: localhost:2181(CONNECTED) 10] ls /mesos[]
The node is world accessible so no permission issue:
[zk: localhost:2181(CONNECTED) 12] getAcl /mesos
'world,'anyone: cdrwa
Why is the zookeeper node empty?  Is this the reason the  master and slaves are 
not connecting?
2015-09-04 14:56 GMT-07:00 craig w <codecr...@gmail.com>:
No problem, they have a "downloads" link inn their menu: 
https://mesosphere.com/downloads/
On Sep 4, 2015 5:43 PM, "Stephen Boesch" <java...@gmail.com> wrote:
@Craig . That is an incomplete answer - given that such links are not presented 
in an obvious manner .  Maybe you managed to find  a link on their site that 
provides prebuilt for Centos7: if so then please share it.   
I had previously found a link on their site for prebuilt binaries but is based 
on using CDH4 (which is not possible for my company). It is also old. 
https://docs.mesosphere.com/tutorials/install_centos_rhel/


2015-09-04 14:27 GMT-07:00 craig w <codecr...@gmail.com>:
Mesosphere has packages prebuilt, go to their site to find how to install
On Sep 4, 2015 5:11 PM, "Stephen Boesch" <java...@gmail.com> wrote:

After following the directions here:   http://mesos.apache.org/gettingstarted/
Which for centos7 includes the following:



  # Change working directory.
$ cd mesos

# Bootstrap (Only required if building from git repository).
$ ./bootstrap

# Configure and build.
$ mkdir build
$ cd build
$ ../configure
$ make
In order to speed up the build and reduce verbosity of the logs, you can 
append-j  V=0 to make.# Run test suite.
$ make check

# Install (Optional).
$ make install

But the installation is not correct afterwards: here is the bin directory:
$ ll bintotal 92-rw-r--r--.  1 stack stack 1769 Jul 17 23:14 
valgrind-mesos-tests.sh.in-rw-r--r--.  1 stack stack 1769 Jul 17 23:14 
valgrind-mesos-slave.sh.in-rw-r--r--.  1 stack stack 1772 Jul 17 23:14 
valgrind-mesos-master.sh.in-rw-r--r--.  1 stack stack 1769 Jul 17 23:14 
valgrind-mesos-local.sh.in-rw-r--r--.  1 stack stack 1026 Jul 17 23:14 
mesos-tests.sh.in-rw-r--r--.  1 stack stack  901 Jul 17 23:14 
mesos-tests-flags.sh.in-rw-r--r--.  1 stack stack 1019 Jul 17 23:14 
mesos-slave.sh.in-rw-r--r--.  1 stack stack 1721 Jul 17 23:14 
mesos-slave-flags.sh.in-rw-r--r--.  1 stack stack 1366 Jul 17 23:14 
mesos.sh.in-rw-r--r--.  1 stack stack 1026 Jul 17 23:14 
mesos-master.sh.in-rw-r--r--.  1 stack stack  858 Jul 17 23:14 
mesos-master-flags.sh.in-rw-r--r--.  1 stack stack 1023 Jul 17 23:14 
mesos-local.sh.in-rw-r--r--.  1 stack stack  935 Jul 17 23:14 
mesos-local-flags.sh.in-rw-r--r--.  1 stack stack 1466 Jul 17 23:14 
lldb-mesos-tests.sh.in-rw-r--r--.  1 stack stack 1489 Jul 17 23:14 
lldb-mesos-slave.sh.in-rw-r--r--.  1 stack stack 1492 Jul 17 23:14 
lldb-mesos-master.sh.in-rw-r--r--.  1 stack stack 1489 Jul 17 23:14 
lldb-mesos-local.sh.in-rw-r--r--.  1 stack stack 1498 Jul 17 23:14 
gdb-mesos-tests.sh.in-rw-r--r--.  1 stack stack 1527 Jul 17 23:14 
gdb-mesos-slave.sh.in-rw-r--r--.  1 stack stack 1530 Jul 17 23:14 
gdb-mesos-master.sh.in-rw-r--r--.  1 stack stack 1521 Jul 17 23:14 
gdb-mesos-local.sh.indrwxr-xr-x.  2 stack stack 4096 Jul 17 23:21 .drwxr-xr-x. 
11 stack stack 4096 Sep  4 20:08 ..
So .. two things:
(a) what is missing from the installation instructions?
(b) Is there an up to date rpm/yum installation for centos7?











  

Re: How does mesos determine how much memory on a node is available for offer?

2015-09-03 Thread Klaus Ma
Yes; if totalMemory > 2G, report totalMemory - 1G; otherwise, report 
totalMemory/2.


On 2015年09月03日 20:11, Alex Rukletsov wrote:
Mesos agent (aka slave) estimates the memory available and advertises 
all of it minus 1GB. If there is less than 2GB available, only half is 
advertised [1].


[1]: 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L98


On Thu, Sep 3, 2015 at 4:01 AM, Anand Mazumdar <an...@mesosphere.io 
<mailto:an...@mesosphere.io>> wrote:


My bad, Seeing the 1002mb(~1024) number made me think the agent
was not able to get the memory estimates from the OS and
defaulting to the constant values.

The slave executes a `sysinfo` system call and populates the
memory numbers based on it. If you want a more fine-grained
control, try to specify it directly using the —resources flag as I
had mentioned earlier.

-anand


On Sep 2, 2015, at 6:48 PM, F21 <f21.gro...@gmail.com
<mailto:f21.gro...@gmail.com>> wrote:

There seems to be some dynamicness to it. I just bumped the
memory for each VM up to 2.5GB and now mesos is offering 1.5GB on
it's slave. Is there some percentage value that I can set so that
more memory is available to mesos?

On 3/09/2015 11:23 AM, Anand Mazumdar wrote:

In case you don’t specify the resources via “—resources” flag
when you start your agent, it picks up the default values.
(Example: --resources="cpus:4;mem:1024;disk:2”)

The default value for memory is here:
https://github.com/apache/mesos/blob/master/src/slave/constants.cpp#L46

-anand






--
Klaus Ma (马达), PMP® | http://www.cguru.net



RE: Can't start master properly (stale state issue?); help!

2015-08-13 Thread Klaus Ma
I used to meet a similar issue with Zookeeper + Messo; I resolved it by remove 
127.0.1.1 from /etc/hosts; here is an example:
klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
127.0.0.1   localhost
127.0.1.1   klaus-OptiPlex-780   = remove this line, and a new line: 
mapping IP (e.g. 192.168.1.100) with hostname
...
BTW, please also clear-up the log directory and re-start ZK  Mesos.

If any more comments, please let me know.

Regards,Klaus Ma (马达), PMP® | http://www.cguru.net
CallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
SkypeCallSend SMSCall from mobileAdd to SkypeYou'll need Skype CreditFree via 
SkypeDate: Thu, 13 Aug 2015 12:20:34 -0700
Subject: Re: Can't start master properly (stale state issue?); help!
From: ma...@mesosphere.io
To: user@mesos.apache.org


On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell arach...@gmail.com wrote:
Marco  hasodent,
This is just a quick note to say thank you for your replies.
No problem, you're welcome. I will answer you much more fully tomorrow, but for 
now can only manage a few quick observations  questions:
1. Having some months ago encountered a known problem with the IP@ 127.0.1.1 
(I'll provide references tomorrow), I early on configured /etc/hosts, replacing 
myHostName 127.0.1.1 with myHostName Real_IP. That said, I can't rule out 
a race condition whereby ZK | mesos-master saw the original unchanged 
/etc/hosts before I zapped it.
2. What is a znode and how would I drop it?
so, the znode is the fancy name that ZK gives to the nodes in its tree 
(trivially, the path) - assuming that you give Mesos the following ZK 
URL:zk://10.10.0.5:2181/mesos/prod
the 'znode' would be `/mesos/prod` and you could go inspect it (using zkCli.sh) 
by doing: ls /mesos/prod
you should see at least one (with the Master running) file: info_001 or 
json.info_0001 (depending on whether you're running 0.23 or 0.24) and you 
could then inspect its contents with: get /mesos/prod/info_001
For example, if I run a Mesos 0.23 on my localhost, against ZK on the same:







$ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1 
--work_dir=/tmp/m23-2 --port=5053I can connect to ZK via zkCli.sh and:
[zk: localhost:2181(CONNECTED) 4] ls /mesos/test
[info_06, log_replicas]
[zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_06
#20150813-120952-18983104-5053-14072ц 'master@192.168.33.1:5053* 
192.168.33.120.23.0

cZxid = 0x314
dataLength = 93
 // a bunch of other metadata
numChildren = 0
(you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI 
prompt - stop Mesos first, or it will be a very unhappy Master :).
in the corresponding logs I see (note the new leader here too, even though 
this was the one and only):
I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get 
'/mesos/test/info_06' in ZooKeeper
W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master 
master@192.168.33.1:5053 is using a Protobuf binary format when registering 
with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master 
(UPID=master@192.168.33.1:5053) is detected
I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader is 
master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072
I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master!

At this point, I'm almost sure you're running up against some issue with the 
log-replica; but I'm the least competent guy here to help you on that one, 
hopefully someone else will be able to add insight here.
I start the services (zk, master, marathon; all on same host) by SSHing into 
the host  doing service  start commands.
Again, thanks very much; and more tomorrow.
Cordially,
Paul
On Thu, Aug 13, 2015 at 1:08 PM, haosdent haosd...@gmail.com wrote:
Hello, how you start the master? And could you try use netstat -antp|grep 
5050 to find whether there are multi master processes run at a same machine or 
not?
On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell arach...@gmail.com wrote:
Hi All,
I hope someone can shed some light on this because I'm getting desperate!
I try to start components zk, mesos-master, and marathon in that order. They 
are started via a program that SSHs to the sole host and does service xxx 
start. Everyone starts happily enough. But the Mesos UI shows me:
This master is not the leader, redirecting in 0 seconds ... go now

The pattern seen in all of the mesos-master.INFO logs (one of which shown 
below) is that the mesos-master with the correct IP@ starts. But then a new 
leader is detected and becomes leading master. This new leader shows UPID 
(UPID=master@127.0.1.1:5050
I've tried clearing what ZK and mesos-master state I can find, but this problem 
will not go away.
Would someone be so kind as to a) explain what is happening here and b) suggest 
remedies?
Thanks very much

Re: Ambari and Mesos

2015-08-12 Thread Klaus Ma
Ambari provides several features, such as monitor/alert and cluster 
management:
- regarding monitor/alert, Mesos has provided metrics API; we can 
follow Ambari's Sink protocol to show them in Ambari
- regarding cluster management, suggest to ask Ambari acquire 
resources from Mesos and start related daemon by Mesos; not only YARN, 
but also Kafka, Storm and others


Regards
--
Klaus Ma (马达), PMP® | Advisory Software Engineer
Platform Symphony  MapReduce Development  Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net

On 08/12/2015 03:20 PM, haosdent wrote:

Agree, there are need more works to integrate them.

On Wed, Aug 12, 2015 at 3:14 PM, Stephen Knight skni...@pivotal.io 
mailto:skni...@pivotal.io wrote:


Well essentially I want to bring Mesos and Hadoop together,
particularly MRV2. Instead of relying on 2 different schedulers,
it would be great to have a framework that allows the datanode to
be a docker container provisioned in Mesos as an Ambari target.

I see there is some work from Twitter/Ebay on Project Myriad which
would bring the two together. Have to wait and see how that turns
out i guess.

On Tue, Aug 11, 2015 at 11:42 PM, Tim St Clair
tstcl...@redhat.com mailto:tstcl...@redhat.com wrote:

Depends entirely what you are trying to do, b/c Ambari has a
host of capabilities which make it a full management tool for
HDP.

So it may make more sense to break-down what features you are
looking to integrate...

Cheers,
Tim



*From: *Stephen Knight skni...@pivotal.io
mailto:skni...@pivotal.io
*To: *user@mesos.apache.org mailto:user@mesos.apache.org
*Sent: *Tuesday, August 11, 2015 8:56:35 AM
*Subject: *Ambari and Mesos


Hi guys,

Is there a way to integrate Ambari and Mesos? I see that
Mesos is capable of managing Hadoop in the sense that it
will distribute out resource but how would you couple that
with Ambari?

Any advice is greatly appreciated.

Thanks




-- 
Cheers,

Timothy St. Clair
Red Hat Inc.





--
Best Regards,
Haosdent Huang


--
Klaus Ma (马达), PMP® | Advisory Software Engineer
Platform Symphony  MapReduce Development  Support, STG, IBM GCG
+86-10-8245 4084 | mad...@cn.ibm.com | http://www.cguru.net



Re: Mesos slave help

2015-08-06 Thread Klaus Ma

Hi Stephen,

Would you share the log of master  slave?

Thanks
Klaus

On 2015年08月06日 16:07, Stephen Knight wrote:

Hi,

I was wondering if anyone can help me. I have a test setup, 1 
master/zookeeper and 2 slaves on Ubuntu 14.04.


When I initialize the slaves the first time it all works and they 
register with the master (I can see it on x.x.x.x:5050) but when I 
reboot those slaves for any reason, they never re-register. Am I 
missing something?


Thx


--
---
Stephen Knight
Infrastructure Consultant

Pivotal Services @ EMC
+971 (0)56 538 2071

skni...@pivotal.io mailto:skni...@pivotal.io
stephen.knig...@emc.com mailto:stephen.knig...@emc.com

Pivotal.io

Notice of Confidentiality - This email message is for the sole use of 
the intended recipient and may contain confidential and privileged 
information. Any unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact 
the sender by reply email and destroy all copies of the original message.