Alex, unfortunately I lost my previous logs. But I plan to make heavy performance test of the systems. So if I get some thing I’ll come back.
Actually, I have old setup and can do separate test on the holidays. Thanks, -Kiril > On Dec 21, 2016, at 22:57, Alex Rukletsov <[email protected]> wrote: > > Kiril— > > from what you described it does not sound like the problem is the Linux > distribution. It may be your AWS configuration. However, if a combination of > health checks and heavy loaded agent leads to the agent termination — I would > like to investigate this issue. Please come back—with logs!—if you see the > issue again. > > On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <[email protected] > <mailto:[email protected]>> wrote: > Hey, > > Sorry for delayed response. I reinstalled my AWS infrastructure. Now I > install everything on RedHat linux. Before I use Amazon Linux. > > I tested with single master (m4.large). Everything works perfect. I am not > sure if it was Amazon Linux or my old configurations. > > Thanks, > -Kirils > > On 18 December 2016 at 14:03, Guillermo Rodriguez <[email protected] > <mailto:[email protected]>> wrote: > Hi, > I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any > time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. > > So, the only moment I get a TASK_LOST is when I lose a spot instance due to > being outbid. > > I guess you may also lose instances due to an AWS autoscaler scale-in > procedure, for example, if it decides the cluster is inderutilised then it > can kill any instane in your cluster, not necessarilly the least used one. > That's the reason we decided to develop our customised autoscaler that > detects and kills specific instances based on our own rules. > > So, are you using spot fleets or spot innstances? Have you setup your > scale-in procedures correctly? > > Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means > 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance > and run xlarge instances instead. Same price and if you lose one you just > lose 1/10th of your jobs. > > Luck! > > > > > > From: "haosdent" <[email protected] <mailto:[email protected]>> > Sent: Saturday, December 17, 2016 6:12 PM > To: "user" <[email protected] <mailto:[email protected]>> > Subject: Re: Mesos on AWS > > > sometimes Mesos agent is launched but master doesn’t show them. > It sounds like the Master Master could not connect to your Agents. May you > mind paste your Mesos Master log? Any information show Mesos agents are > disconnected in it? > > On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <[email protected] > <mailto:[email protected]>> wrote: > I have my own framework. Sometimes I get TASK_LOST status with message slave > lost during health check. > > Also I found sometimes Mesos agent is launched but master doesn’t show them. > From agent I see that it found master and connected. After agent restart it > start working. > > -Kiril > > >> On Dec 16, 2016, at 21:58, Zameer Manji <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hey, >> >> Could you detail on what you mean by "delays and health check problems"? Are >> you using your own framework or an existing one? How are you launching the >> tasks? >> >> Could you share logs from Mesos that show timeouts to ZK? >> >> For reference, I operate a large Mesos cluster and I have never encountered >> problems when running 1k tasks concurrently so I think sharing data would >> help everyone debug this problem. >> >> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <[email protected] >> <mailto:[email protected]>> wrote: >> ?Hi, >> >> Does any body try to run Mesos on AWS instances? Can you give me >> recommendations. >> >> I am developing elastic (scale aws instances on demand) Mesos cluster. >> Currently I have 3 master instances. I run about 1000 tasks simultaneously. >> I see delays and health check problems. >> >> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU). >> >> At the moment I increase time out in ZooKeeper cluster. What can I do to >> decrease timeouts? >> >> Also how can I increase performance? The main bottleneck is what I have the >> big amount of tasks(run simultaneously) for an hour after I shutdown them or >> restart (depends how good them perform). >> >> -Kiril? >> >> -- >> Zameer Manji > > > -- > Best Regards, > Haosdent Huang > > > > -- > Thanks, > -Kiril > Phone +37126409291 <tel:+371%2026%20409%20291> > Riga, Latvia > Skype perimetr122 >

