+1 to both A to B. Do we plan to eventually drop non-checkpionted framework support (possibly in v2) and declare that all frameworks has to operate in this assumption?
On Mon, Oct 17, 2016 at 1:36 AM, Aaron Carey <aca...@ilm.com> wrote: > +1 to A and B > > Aaron Carey > Production Engineer - Cloud Pipeline > Industrial Light & Magic > London > 020 3751 9150 > > > On 17 October 2016 at 00:38, Qian Zhang <zhq527...@gmail.com> wrote: > >> and requires operators to enable checkpointing on the slaves. >> >> >> Just curious why operator needs to enable checkpointing on the slaves (I >> do not see an agent flag for that), I think checkpointing should be enabled >> in framework level rather than slave. >> >> >> Thanks, >> Qian Zhang >> >> On Sun, Oct 16, 2016 at 10:18 AM, Zameer Manji <zma...@apache.org> wrote: >> >>> +1 to A and B >>> >>> Aurora has enabled checkpointing for years and requires operators to >>> enable >>> checkpointing on the slaves. >>> >>> On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere < >>> jo...@mesosphere.io> >>> wrote: >>> >>> > I'm in favor of A & B. I find it provides a better "first experience" >>> to >>> > users. >>> > From my experience you usually have to have an explicit reason to not >>> want >>> > to checkpoint. Most people assume the semantics provided by the >>> checkpoint >>> > behavior is default and it can be a frustrating experience for them to >>> find >>> > out that is not the case. >>> > >>> > — >>> > *Joris Van Remoortere* >>> >>> > Mesosphere >>> > >>> > On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway <neil.con...@gmail.com> >>> > wrote: >>> > >>> >> Hi folks, >>> >> >>> >> I'd like input from individuals who currently use frameworks but do >>> >> not enable checkpointing. >>> >> >>> >> Background: "checkpointing" is a parameter that can be enabled in >>> >> FrameworkInfo; if enabled, the agent will write the framework pid, >>> >> executor PIDs, and status updates to disk for any tasks started by >>> >> that framework. This checkpointed information means that these tasks >>> >> can survive an agent crash: if the agent exits (whether due to >>> >> crashing or as part of an upgrade procedure), a restarted agent can >>> >> use this information to reconnect to executors started by the previous >>> >> instance of the agent. The downside is that checkpointing requires >>> >> some additional disk I/O at the agent. >>> >> >>> >> Checkpointing is not currently the default, but in my experience it is >>> >> often enabled for production frameworks. As part of the work on >>> >> supporting partition-aware Mesos frameworks (see MESOS-4049), we are >>> >> considering: >>> >> >>> >> (a) requiring that partition-aware frameworks must also enable >>> >> checkpointing, and/or >>> >> (b) enabling checkpointing by default >>> >> >>> >> If you have intentionally decided to disable checkpointing for your >>> >> Mesos framework, I'd be curious to hear more about your use-case and >>> >> why you haven't enabled it. >>> >> >>> >> Thanks! >>> >> >>> >> Neil >>> >> >>> >> -- >>> >> Zameer Manji >>> >> >>> > >>> >> >> > -- Cheers, Zhitao Li