Re: Dedup mesos agent status updates at framework

2018-10-29 Thread Benjamin Mahler
The timeout behavior sounds like a dangerous scalability tripwire. Consider revisiting that approach. On Sun, Oct 28, 2018 at 10:42 PM Varun Gupta wrote: > Mesos Version: 1.6 > > scheduler has 250k events in its queue: Master master sends status updates > to scheduler, and scheduler stores them

Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Varun Gupta
Mesos Version: 1.6 scheduler has 250k events in its queue: Master master sends status updates to scheduler, and scheduler stores them in the queue. Scheduler process in FIFO, and once processed (includes persisting to DB) it ack the update. These updates are processed asynchronously with a thread

Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Benjamin Mahler
Which version of mesos are you running? > In framework, event updates grow up to 250k What does this mean? The scheduler has 250k events in its queue? > which leads to cascading effect on higher latency at Mesos Master (ack requests with 10s timeout) Can you send us perf stacks of the master

Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Varun Gupta
Hi Benjamin, In our batch workload use case, number of tasks churn is pretty high. We have seen 20-30k tasks launch within 10 second window and 100k+ tasks running. In framework, event updates grow up to 250k, which leads to cascading effect on higher latency at Mesos Master (ack requests with

Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Benjamin Mahler
Hi Varun, What problem are you trying to solve precisely? There seems to be an implication that the duplicate acknowledgements are expensive. They should be low cost, so that's rather surprising. Do you have any data related to this? You can also tune the backoff rate on the agents, if the

Re: Dedup mesos agent status updates at framework

2018-10-28 Thread Varun Gupta
> Hi, > > Mesos agent will send status updates with exponential backoff until ack is > received. > > If processing of events at framework and sending ack to Master is running > slow then it builds a back pressure at framework due to duplicate updates > for same status. > > Has someone explored the