Re: Mesos on AWS

Kiril Menshikov Wed, 21 Dec 2016 13:11:34 -0800

Alex, unfortunately I lost my previous logs. But I plan to make heavy 
performance test of the systems. So if I get some thing I’ll come back.


Actually, I have old setup and can do separate test on the holidays.

Thanks,
-Kiril

> On Dec 21, 2016, at 22:57, Alex Rukletsov <[email protected]> wrote:
> 
> Kiril—
> 
> from what you described it does not sound like the problem is the Linux 
> distribution. It may be your AWS configuration. However, if a combination of 
> health checks and heavy loaded agent leads to the agent termination — I would 
> like to investigate this issue. Please come back—with logs!—if you see the 
> issue again.
> 
> On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey,
> 
> Sorry for delayed response. I reinstalled my AWS infrastructure. Now I 
> install everything on RedHat linux. Before I use Amazon Linux.
> 
> I tested with single master (m4.large). Everything works perfect. I am not 
> sure if it was Amazon Linux or my old configurations.
> 
> Thanks,
> -Kirils
> 
> On 18 December 2016 at 14:03, Guillermo Rodriguez <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any 
> time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. 
>  
> So, the only moment I get a TASK_LOST is when I lose a spot instance due to 
> being outbid.
>  
> I guess you may also lose instances due to an AWS autoscaler scale-in 
> procedure, for example, if it decides the cluster is inderutilised then it 
> can kill any instane in your cluster, not necessarilly the least used one. 
> That's the reason we decided to develop our customised autoscaler that 
> detects and kills specific instances based on our own rules.
>  
> So, are you using spot fleets or spot innstances? Have you setup your 
> scale-in procedures correctly?
>  
> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means 
> 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance 
> and run xlarge instances instead. Same price and if you lose one you just 
> lose 1/10th of your jobs.
>  
> Luck!
>  
>  
>  
>  
>  
> From: "haosdent" <[email protected] <mailto:[email protected]>>
> Sent: Saturday, December 17, 2016 6:12 PM
> To: "user" <[email protected] <mailto:[email protected]>>
> Subject: Re: Mesos on AWS
>  
> >  sometimes Mesos agent is launched but master doesn’t show them. 
> It sounds like the Master Master could not connect to your Agents. May you 
> mind paste your Mesos Master log? Any information show Mesos agents are 
> disconnected in it?
>  
> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <[email protected] 
> <mailto:[email protected]>> wrote:
> I have my own framework. Sometimes I get TASK_LOST status with message slave 
> lost during health check.
>  
> Also I found sometimes Mesos agent is launched but master doesn’t show them. 
> From agent I see that it found master and connected. After agent restart it 
> start working.  
>  
> -Kiril
>  
>  
>> On Dec 16, 2016, at 21:58, Zameer Manji <[email protected] 
>> <mailto:[email protected]>> wrote:
>>  
>> Hey,
>>  
>> Could you detail on what you mean by "delays and health check problems"? Are 
>> you using your own framework or an existing one? How are you launching the 
>> tasks?
>>  
>> Could you share logs from Mesos that show timeouts to ZK?
>>  
>> For reference, I operate a large Mesos cluster and I have never encountered 
>> problems when running 1k tasks concurrently so I think sharing data would 
>> help everyone debug this problem.
>>  
>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <[email protected] 
>> <mailto:[email protected]>> wrote:
>> ?Hi,
>>  
>> Does any body try to run Mesos on AWS instances? Can you give me 
>> recommendations.
>>  
>> I am developing elastic (scale aws instances on demand) Mesos cluster. 
>> Currently I have 3 master instances. I run about 1000 tasks simultaneously. 
>> I see delays and health check problems. 
>>  
>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>  
>> At the moment I increase time out in ZooKeeper cluster. What can I do to 
>> decrease timeouts?
>>  
>> Also how can I increase performance? The main bottleneck is what I have the 
>> big amount of tasks(run simultaneously) for an hour after I shutdown them or 
>> restart (depends how good them perform).
>>  
>> -Kiril?
>>  
>> --
>> Zameer Manji
>  
>  
> --
> Best Regards,
> Haosdent Huang
> 
> 
> 
> -- 
> Thanks,
> -Kiril
> Phone +37126409291 <tel:+371%2026%20409%20291>
> Riga, Latvia
> Skype perimetr122
>

Re: Mesos on AWS

Reply via email to