Re: Release policy and 1.6 release schedule

2018-03-16 Thread Jie Yu
Thanks Greg for starting this thread!


> My primary motivation here is to bring our documented policy in line with
> our practice, whatever that may be


+100

Do people think that we should attempt to bring our release cadence more in
> line with our current stated policy, or should the policy be changed to
> reflect our current practice?


I think a minor release every 2 months is probably too aggressive. I don't
have concrete data, but my feeling is that the frequency that folks upgrade
Mesos is low. I know that many users are still on 1.2.x.

I'd actually suggest that we have a *minimal* interval between two releases
(e.g., 3 months), and provide some buffer for the release process. (so
we're expecting about 3 releases per year, this matches what we did last
year).

And we use our dev sync to coordinate on a release after the minimal
release interval has elapsed (and elect a release manager).

- Jie

On Wed, Mar 14, 2018 at 9:51 AM, Zhitao Li  wrote:

> An additional data point is how long it takes from first RC being cut to
> the final release tag vote passes. That probably indicates smoothness of
> the release process and how good the quality control measures.
>
> I would argue for not delaying release for new features and align with the
> schedule we declared on policy. That makes upstream projects easier to
> gauge when a feature will be ready and when they can try it out.
>
> On Tue, Mar 13, 2018 at 3:10 PM, Greg Mann  wrote:
>
> > Hi folks,
> > During the recent API working group meeting [1], we discussed the release
> > schedule. This has been a recurring topic of discussion in the developer
> > sync meetings, and while our official policy still specifies time-based
> > releases at a bi-monthly cadence, in practice we tend to gate our
> releases
> > on the completion of certain features, and our releases go out on a
> > less-frequent basis. Here are the dates of our last few release blog
> posts,
> > which I'm assuming correlate pretty well with the actual release dates:
> >
> > 1.5.0: 2/8/18
> > 1.4.0: 9/18/17
> > 1.3.0: 6/7/17
> > 1.2.0: 3/8/17
> > 1.1.0: 11/10/16
> >
> > Our current cadence seems to be around 3-4 months between releases, while
> > our documentation states that we release every two months [2]. My primary
> > motivation here is to bring our documented policy in line with our
> > practice, whatever that may be. Do people think that we should attempt to
> > bring our release cadence more in line with our current stated policy, or
> > should the policy be changed to reflect our current practice?
> >
> > If we were to attempt to align with our stated policy for 1.6.0, then we
> > would release around April 8, which would probably mean cutting an RC
> > sometime around the end of March or beginning of April. This is very
> soon!
> > :)
> >
>
> > I'm currently working with Gastón on offer operation feedback, and I'm
> not
> > sure that we would have it ready in time for an early April release date.
> > Personally, I would be OK with this, since we could land the feature in
> > 1.7.0 in June. However, I'm not sure how well this schedule would work
> for
> > the features that other people are currently working on.
> >
>
> A highly important feature our org need is resizing of persistent volume. I
> think it has a good chance to make the stated 1.6 schedule.
>
>
> >
> > I'm curious to hear people's thoughts on this, developers and users
> alike!
> >
> > Cheers,
> > Greg
> >
> >
> > [1] https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgD
> > G62ifn0cZIBWw1f_Ler6fLM/edit#
> > [2] http://mesos.apache.org/documentation/latest/versioning/
> > #release-schedule
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>


Re: 答复: 答复: Status update: task 1 is in state TASK_ERROR

2018-03-16 Thread Benjamin Mahler
What kind of tasks are you trying to run?

If you want to run commands or containers, you can just use the built-in
DEFAULT executor:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/mesos.proto#L713-L725

If you need a custom executor because your tasks are not commands or
containers, then you can implement your own custom executor:
https://github.com/apache/mesos/blob/1.5.0/include/mesos/v1/mesos.proto#L727-L730

In the latter case, you will have to implement your own executor or use an
existing third party executor. If implementing your own, you need to speak
the v1 protocol to the agent. We maintain a listing of known executor API
libraries here:
http://mesos.apache.org/documentation/latest/api-client-libraries/#executor-api

On Thu, Mar 15, 2018 at 2:32 AM, 罗 辉  wrote:

> Hi guys:
>
> For more info, my framework app’s log and master/agent logs are attached.
>
> My app fails as the end of log described:
>
> The message of current task is :Executor did not register within 1mins
>
> Status update: task 1 is in state TASK_FAILED
>
> Aborting because task 1 is in unexpected state TASK_FAILED with reason
> 'REASON_EXECUTOR_REGISTRATION_TIMEOUT' from source 'SOURCE_AGENT' with
> message 'Executor did not register within 1mins'
>
>
>
> My opinion about this failure:
>
> 1.I guess there should be an V1 version executor class , with a register
> method to register the executor onto the agent?
>
> 2.I studied V0’s executor implementation and tried to implement a V1
> version executor ,which supposed to extend from executor interface, and
> implement the abstract methods including register, reregister and etc.
> However I didn’t find the V1 executor interface java API. Does that mean I
> am in the wrong direction?
>
>
>
> In one word, any ideas about the REASON_EXECUTOR_REGISTRATION_TIMEOUT
> failure?
>
>
>
> San
>
>
>
> *发件人:* 罗 辉 
> *发送时间:* 2018年3月14日 15:29
> *收件人:* user 
> *主题:* 答复: 答复: Status update: task 1 is in state TASK_ERROR
>
>
>
> Thanks Benjamin,
>
> I tried to understand the missing reservation metadata and look up
> relative docs about resource reservation, however i didn't find to much
> document about it.
>
> I solved this problem by adding a method like below in my scheduler:
>
>   def luanchtask(offer: Offer, task: TaskInfo): Call = {
> Call.newBuilder()
>   .setFrameworkId(frameworkId)
>   .setType(Call.Type.ACCEPT)
>   .setAccept(
> Call.Accept.newBuilder()
>   .addOfferIds(offer.getId)
>   .addOperations(
> Offer.Operation.newBuilder()
>   .setType(Offer.Operation.Type.LAUNCH)
>   .setLaunch(
> Offer.Operation.Launch.newBuilder()
>   .addTaskInfos(task.build()
>   }
>
>
>
> And after that I met another problem: my task is always in staging, and
> terminates after 1min due to timeout. I think there are many mini process
> in a scheduler app including callbacks, such as connect, register, get
> offers list,accpet offer and etc. Is there a detail programming guide in V1
> framework developing?
>
>
>
> Thank you.
>
>
>
>
>
> San
>
>
> --
>
> *发件人**:* Benjamin Mahler 
> *发送时间**:* 2018年3月10日 9:00:55
> *收件人**:* user
> *主题**:* Re: 答复: Status update: task 1 is in state TASK_ERROR
>
>
>
> The message clarifies it, the task+executor have some unreserved
> resources:
>
> cpus(allocated: controller):6; mem(allocated: controller):8000
>
>
>
> But the resources offered were reserved:
>
> cpus(allocated: controller)(reservations: [(STATIC,controller)]):6;
> mem(allocated: controller)(reservations: [(STATIC,controller)]):8000; +
> disk + ports
>
>
>
> The scheduler needs to provide resources that are contained in the offer,
> in this case it needs to include the missing reservation metadata.
>
>
>
> On Thu, Mar 8, 2018 at 6:57 PM, 罗 辉  wrote:
>
> yes,I modified my code like below:
>
>   def acknowledgeTaskMessage(taskStatus: TaskStatus): String = {
> taskStatus.getMessage
>   }
>
> def update(mesos: Mesos, status: TaskStatus) = {
> val message = acknowledgeTaskMessage(status)
> println("The message of current task is :" + message)
> println("Status update: task " + status.getTaskId().getValue() + " is
> in state " + status.getState().getValueDescriptor().getName())
>
>
> ..
>
>
>
> And I got below log as attched file line 231:
>
> 231 Received an UPDATE event
> 232 The message of current task is :Total resources cpus(allocated:
> controller):6; mem(allocated: controller):8000 required by task and its
> executor is more than available cpus(allocated: controller)(reservations:
> [(STATIC,controller)]):6; mem(allocated: controller)(reservations:
> [(STATIC,controller)]):8000; disk(allocated: controller)(reservations:
> [(STATIC,controller)]):550264; ports(allocated:
> controller):[31000-32000]
> 233 Status update: task 1 is in state TASK_ERROR
>
>
>
>
>
> 罗辉
>
> 基础架构
> --
>
> *发件人**:* Benjamin Mahler 
> *

Re: Troubleshooting Mesos SSL setup

2018-03-16 Thread Renan DelValle
Follow up,  we weren't able to get our wildcard certificate working but we
did get it to work when we used a certificate for a single hostname.

Also our hostname was too long (over 64 bytes).

Hope that helps someone else who runs into this issue.

-Renan

On Fri, Mar 16, 2018 at 10:36 AM, Renan DelValle 
wrote:

> Hi all,
>
> We're trying to set up Mesos with SSL. We've compiled Mesos with SSL
> support and deployed it to the right boxes.
>
> Unfortunately, after setting up all the correct environmental variables,
> we get the following error:
>
> I0315 17:48:30.54186520 libevent_ssl_socket.cpp:1105] Could not
>> determine hostname of peer: Unknown error
>> I0315 17:48:30.54193720 libevent_ssl_socket.cpp:1120] Failed accept,
>> verification error: Cannot verify peer certificate: peer hostname unknown
>> * GnuTLS recv error (-110): The TLS connection was non-properly
>> terminated.
>> * Closing connection 0
>> curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly
>> terminated.
>
>
> Any chance someone knows what these errors mean and how we can fix the
> underlying issue?
>
> Thanks!
>
> -Renan
>


Troubleshooting Mesos SSL setup

2018-03-16 Thread Renan DelValle
Hi all,

We're trying to set up Mesos with SSL. We've compiled Mesos with SSL
support and deployed it to the right boxes.

Unfortunately, after setting up all the correct environmental variables, we
get the following error:

I0315 17:48:30.54186520 libevent_ssl_socket.cpp:1105] Could not
> determine hostname of peer: Unknown error
> I0315 17:48:30.54193720 libevent_ssl_socket.cpp:1120] Failed accept,
> verification error: Cannot verify peer certificate: peer hostname unknown
> * GnuTLS recv error (-110): The TLS connection was non-properly terminated.
> * Closing connection 0
> curl: (56) GnuTLS recv error (-110): The TLS connection was non-properly
> terminated.


Any chance someone knows what these errors mean and how we can fix the
underlying issue?

Thanks!

-Renan


Re: Release checksum file distribution change

2018-03-16 Thread Harold Dost
Would it be prudent to provide both for a while before completely removing
it?



Harold Dost | @hdost



On Mon, Mar 12, 2018, 10:56 Benjamin Bannier  wrote:

> Hi,
>
> this is a heads-up that future Mesos release checksum files will be SHA512,
> e.g., `mesos-1.6.0.tar.gz.sha512`. The previously used MD5 checksum files
> will
> not be used anymore for future releases.
>
> Please update any dependent tooling you have on your side accordingly.
>
>
> Best,
>
> Benjamin
>