I have my own framework. Sometimes I get TASK_LOST status with message slave lost during health check.
Also I found sometimes Mesos agent is launched but master doesn’t show them. From agent I see that it found master and connected. After agent restart it start working. -Kiril > On Dec 16, 2016, at 21:58, Zameer Manji <[email protected]> wrote: > > Hey, > > Could you detail on what you mean by "delays and health check problems"? Are > you using your own framework or an existing one? How are you launching the > tasks? > > Could you share logs from Mesos that show timeouts to ZK? > > For reference, I operate a large Mesos cluster and I have never encountered > problems when running 1k tasks concurrently so I think sharing data would > help everyone debug this problem. > > On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <[email protected] > <mailto:[email protected]>> wrote: > Hi, > > Does any body try to run Mesos on AWS instances? Can you give me > recommendations. > > I am developing elastic (scale aws instances on demand) Mesos cluster. > Currently I have 3 master instances. I run about 1000 tasks simultaneously. I > see delays and health check problems. > > ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). > > At the moment I increase time out in ZooKeeper cluster. What can I do to > decrease timeouts? > > Also how can I increase performance? The main bottleneck is what I have the > big amount of tasks(run simultaneously) for an hour after I shutdown them or > restart (depends how good them perform). > > -Kiril > > -- > Zameer Manji

