I have my own framework. Sometimes I get TASK_LOST status with message slave 
lost during health check.

Also I found sometimes Mesos agent is launched but master doesn’t show them. 
From agent I see that it found master and connected. After agent restart it 
start working.  

-Kiril


> On Dec 16, 2016, at 21:58, Zameer Manji <[email protected]> wrote:
> 
> Hey,
> 
> Could you detail on what you mean by "delays and health check problems"? Are 
> you using your own framework or an existing one? How are you launching the 
> tasks?
> 
> Could you share logs from Mesos that show timeouts to ZK?
> 
> For reference, I operate a large Mesos cluster and I have never encountered 
> problems when running 1k tasks concurrently so I think sharing data would 
> help everyone debug this problem.
> 
> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <[email protected] 
> <mailto:[email protected]>> wrote:
> ​Hi,
> 
> Does any body try to run Mesos on AWS instances? Can you give me 
> recommendations.
> 
> I am developing elastic (scale aws instances on demand) Mesos cluster. 
> Currently I have 3 master instances. I run about 1000 tasks simultaneously. I 
> see delays and health check problems. 
> 
> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
> 
> At the moment I increase time out in ZooKeeper cluster. What can I do to 
> decrease timeouts?
> 
> Also how can I increase performance? The main bottleneck is what I have the 
> big amount of tasks(run simultaneously) for an hour after I shutdown them or 
> restart (depends how good them perform).
> 
> -Kiril​
> 
> -- 
> Zameer Manji

Reply via email to