Re: Question regarding docker containerizer

2016-02-25 Thread Jojy Varghese
Hi Pradeep,

The relevant code if you are interested is at :

https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L3561 

 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L3672 


-jojy


> On Feb 25, 2016, at 4:10 PM, Pradeep Chhetri  
> wrote:
> 
> Hi Jie,
> 
> I see executor_id as empty. Does this means that it is running as task and 
> there is no executor for it ? Although, in the ps output, I see mesos-slave 
> is spawning a mesos-docker-executor which in turn spawning the docker 
> command. Here is the detail of one of the marathon task:
> 
> {
> "id": "logs_role_kibana.3b024b6e-dbf3-11e5-bc67-56847afe9799",
> "name": "kibana.role.logs",
> "framework_id": "20150904-093718-2198675372-5050-22379-",
> "executor_id": "",
> "slave_id": "a4f30c35-eee7-4110-a82f-cab5522c9b1b-S4",
> "state": "TASK_RUNNING",
> "resources": {
> "cpus": 0.5,
> "disk": 0,
> "mem": 1024,
> "ports": "[31050-31050]"
> },
> "statuses": [
> {
> "state": "TASK_RUNNING",
> "timestamp": 1456427344.63661,
> "labels": [
> {
> "key": "Docker.NetworkSettings.IPAddress",
> "value": "172.17.0.11"
> }
> ],
> "container_status": {
> "network_infos": [
> {
> "ip_address": "172.17.0.11",
> "ip_addresses": [
> {
> "ip_address": "172.17.0.11"
> }
> ]
> }
> ]
> }
> }
> ],
> "container": {
> "type": "DOCKER",
> "docker": {
> "image": "kibana:4.3.1",
> "network": "BRIDGE",
> "port_mappings": [
> {
> "host_port": 31050,
> "container_port": 5601,
> "protocol": "tcp"
> }
> ],
> "privileged": false,
> "parameters": [
> {
> "key": "publish-all",
> "value": "true"
> }
> ],
> "force_pull_image": true
> }
> }
> },
> 
> 
> 
> On Thu, Feb 25, 2016 at 11:59 PM, Jie Yu  > wrote:
> You can checkout the state.json endpoint on the master
> 
> On Thu, Feb 25, 2016 at 3:53 PM, Pradeep Chhetri  > wrote:
> Hello Jie,
> 
> Thank you for the quick reply. Sorry for asking silly question. How can i 
> look for the taskinfo of a running container ? Can i see the TaskInfo details 
> in the mesos master UI for a task?
> 
> On Thu, Feb 25, 2016 at 10:52 PM, Jie Yu  > wrote:
> You can take a look at the TaskInfo. If the TaskInfo does not have 
> ExecutorInfo set, then it's a task. Otherwise, Mesos will launch the executor 
> and send the task to the executor.
> 
> - Jie
> 
> On Thu, Feb 25, 2016 at 2:50 PM, Pradeep Chhetri  > wrote:
> Hello,
> 
> From docker containerizer documentation 
> (http://mesos.apache.org/documentation/latest/docker-containerizer/ 
> ), 
> 
> "Users can either launch a Docker image as a Task, or as an Executor."
> 
> How can i identify whether a docker container started by lets say marathon is 
> running as a task or as an executor ?
> 
> 
> Thank you.
> 
> -Pradeep
> 
> 
> 
> 
> 
> 
> -- 
> Pradeep Chhetri
> 
> In the world of Linux, who needs Windows and Gates...
> 
> 
> 
> 
> -- 
> Pradeep Chhetri
> 
> In the world of Linux, who needs Windows and Gates...



Re: Mesos and Zookeeper TCP keepalive

2015-11-12 Thread Jojy Varghese
Sorry for confusing you. I meant that you could maybe change your 
“max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable snapshots.

-Jojy

> On Nov 12, 2015, at 3:30 PM, tommy xiao <xia...@gmail.com> wrote:
> 
> Hi Jojy
> 
> what mean for keep the “snapshot/backup” ? could you please give some docs to 
> ref
> 
> 2015-11-13 1:59 GMT+08:00 Jojy Varghese <j...@mesosphere.io 
> <mailto:j...@mesosphere.io>>:
> Hi Jeremy
>  Good to hear that you have a solution. Was curious about the correlation 
> between snapshot creation and timeouts. Wondering if you can change 
> “max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep 
> the “snapshot/backup” also.
> 
> thanks
> Jojy
> 
> 
> > On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com 
> > <mailto:jol...@spscommerce.com>> wrote:
> >
> > Hi Joris, all,
> >
> > We are still at the default timeout values for those that you linked. In 
> > the meantime, since the community pushed us to look at other things besides 
> > evading firewall timeouts, we have disabled snapshot/backups on the VMs and 
> > this has resolved the issue for the past 24 hours on the control group that 
> > we have disabled, which has been the best behavior that we have ever seen. 
> > There was a very close correlation between snapshot creation and 
> > mesos-slave process restart (within minutes) that got us to this point. 
> > Apparently, the snapshot creation and quiesce of the filesystem cause 
> > enough disruption to trigger the default timeouts within mesos.
> >
> > We are fine with this solution because Mesos has enabled us to have a more 
> > heterogeneous fleet of servers and backups aren't needed on these hosts. 
> > Mesos for the win, there.
> >
> > Thanks to everyone that has contributed on this thread! It was a fun 
> > exercise for me, in the code. It was also useful to hear feedback from the 
> > list on places to look, eventually pushing me to a solution.
> > -Jeremy
> >
> > From: Joris Van Remoortere <jo...@mesosphere.io 
> > <mailto:jo...@mesosphere.io>>
> > Sent: Wednesday, November 11, 2015 12:56 AM
> > To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> > Subject: Re: Mesos and Zookeeper TCP keepalive
> >
> > Hi Jeremy,
> >
> > Can you read the description of these parameters on the master, and 
> > possibly share your values for these flags?
> >
> >
> > It seems from the re-registration attempt on the agent, that the master has 
> > already treated the agent as "failed", and so will tell it to shut down on 
> > any re-registration attempt.
> >
> > I'm curious if there is a conflict (or too narrow of a time gap) of 
> > timeouts in your environment to allow re-registration by the agent after 
> > the agent notices it needs to re-establish the connection.
> >
> > —
> > Joris Van Remoortere
> > Mesosphere
> >
> > On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com 
> > <mailto:jol...@spscommerce.com>> wrote:
> > Hi Tommy, Erik, all,
> >
> > You are correct in your assumption that I'm trying to solve for a one hour 
> > session expire time on a firewall. For some more background info, our 
> > master cluster is in datacenter X, the slaves in X will stay "up" for days 
> > and days. The slaves in a different datacenter, Y, connected to that master 
> > cluster will stay "up" for about a few days and restart. The master cluster 
> > is healthy, with a stable leader for months (no flapping), same for the ZK 
> > "leader". There are about 35 slaves in datacenter Y. Maybe the firewall 
> > session timer is a red herring because the slave restart is seemingly 
> > random (the slave with the highest uptime is 6 days, but a handful only 
> > have uptime of a day)
> >
> > I've started debugging this awhile ago, and the gist of the logs is here: 
> > https://gist.github.com/jolexa/1a80e26a4b017846d083 
> > <https://gist.github.com/jolexa/1a80e26a4b017846d083> I've posted this back 
> > in October seeking help and Benjamin suggested network issues in both 
> > directions, so I thought firewall.
> >
> > Thanks for any hints,
> > Jeremy
> >
> > From: tommy xiao <xia...@gmail.com <mailto:xia...@gmail.com>>
> > Sent: Tuesday, November 10, 2015 3:07 AM
> >
> > To: user@mesos.apache.org <mailto:user@mesos.apache.org>
> > Subject: Re: Mesos and Zookeeper TCP keepalive
> >

Re: Mesos and Zookeeper TCP keepalive

2015-11-12 Thread Jojy Varghese
Hi Jeremy
 Good to hear that you have a solution. Was curious about the correlation 
between snapshot creation and timeouts. Wondering if you can change 
“max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep 
the “snapshot/backup” also.

thanks
Jojy


> On Nov 11, 2015, at 6:04 PM, Jeremy Olexa <jol...@spscommerce.com> wrote:
> 
> Hi Joris, all,
> 
> We are still at the default timeout values for those that you linked. In the 
> meantime, since the community pushed us to look at other things besides 
> evading firewall timeouts, we have disabled snapshot/backups on the VMs and 
> this has resolved the issue for the past 24 hours on the control group that 
> we have disabled, which has been the best behavior that we have ever seen. 
> There was a very close correlation between snapshot creation and mesos-slave 
> process restart (within minutes) that got us to this point. Apparently, the 
> snapshot creation and quiesce of the filesystem cause enough disruption to 
> trigger the default timeouts within mesos.
> 
> We are fine with this solution because Mesos has enabled us to have a more 
> heterogeneous fleet of servers and backups aren't needed on these hosts. 
> Mesos for the win, there.
> 
> Thanks to everyone that has contributed on this thread! It was a fun exercise 
> for me, in the code. It was also useful to hear feedback from the list on 
> places to look, eventually pushing me to a solution.
> -Jeremy
> 
> From: Joris Van Remoortere <jo...@mesosphere.io>
> Sent: Wednesday, November 11, 2015 12:56 AM
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>  
> Hi Jeremy,
> 
> Can you read the description of these parameters on the master, and possibly 
> share your values for these flags?
> 
> 
> It seems from the re-registration attempt on the agent, that the master has 
> already treated the agent as "failed", and so will tell it to shut down on 
> any re-registration attempt.
> 
> I'm curious if there is a conflict (or too narrow of a time gap) of timeouts 
> in your environment to allow re-registration by the agent after the agent 
> notices it needs to re-establish the connection.
> 
> — 
> Joris Van Remoortere
> Mesosphere
> 
> On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa <jol...@spscommerce.com> wrote:
> Hi Tommy, Erik, all,
> 
> You are correct in your assumption that I'm trying to solve for a one hour 
> session expire time on a firewall. For some more background info, our master 
> cluster is in datacenter X, the slaves in X will stay "up" for days and days. 
> The slaves in a different datacenter, Y, connected to that master cluster 
> will stay "up" for about a few days and restart. The master cluster is 
> healthy, with a stable leader for months (no flapping), same for the ZK 
> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall 
> session timer is a red herring because the slave restart is seemingly random 
> (the slave with the highest uptime is 6 days, but a handful only have uptime 
> of a day)
> 
> I've started debugging this awhile ago, and the gist of the logs is here: 
> https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back in 
> October seeking help and Benjamin suggested network issues in both 
> directions, so I thought firewall. 
> 
> Thanks for any hints,
> Jeremy
> 
> From: tommy xiao <xia...@gmail.com>
> Sent: Tuesday, November 10, 2015 3:07 AM
> 
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>  
> same here , same question with Erik. could you please input more background 
> info, thanks
> 
> 2015-11-10 15:56 GMT+08:00 Erik Weathers <eweath...@groupon.com>:
> It would really help if you (Jeremy) explained the *actual* problem you are 
> facing.  I'm *guessing* that it's a firewall timing out the sessions because 
> there isn't activity on them for whatever the timeout of the firewall is?   
> It seems likely to be unreasonably short, given that mesos has constant 
> activity between master and 
> slave/agent/whatever-it-is-being-called-nowadays-but-not-really-yet-maybe-someday-for-reals.
> 
> - Erik
> 
> On Mon, Nov 9, 2015 at 10:00 PM, Jojy Varghese <j...@mesosphere.io> wrote:
> Hi Jeremy
>  Its great that you are making progress but I doubt if this is what you 
> intend to achieve since network failures are a valid state in distributed 
> systems. If you think there is a special case you are trying to solve, I 
> suggest proposing a design document for review.
>   For ZK client code, I would suggest asking the zookeeper mailing list.
> 
> thanks
> -Jojy
> 
>> On Nov 9

Re: Mesos and Zookeeper TCP keepalive

2015-11-09 Thread Jojy Varghese
Hi Jeremy
 Its great that you are making progress but I doubt if this is what you intend 
to achieve since network failures are a valid state in distributed systems. If 
you think there is a special case you are trying to solve, I suggest proposing 
a design document for review.
  For ZK client code, I would suggest asking the zookeeper mailing list.

thanks
-Jojy

> On Nov 9, 2015, at 7:56 PM, Jeremy Olexa <jol...@spscommerce.com> wrote:
> 
> Alright, great, I'm making some progress,
> 
> I did a simple copy/paste modification and recompiled mesos. The keepalive 
> timer is set from slave to master so this is an improvement for me. I didn't 
> test the other direction yet - 
> https://gist.github.com/jolexa/ee9e152aa7045c558e02 
> <https://gist.github.com/jolexa/ee9e152aa7045c558e02> - I'd like to file an 
> enhancement request for this since it seems like an improvement for other 
> people as well, after some real world testing
> 
> I'm having some harder time figuring out the zk client code. I started by 
> modifying build/3rdparty/zookeeper-3.4.5/src/c/zookeeper.c but either a) my 
> change wasn't correct or b) I'm modifying a wrong file, since I just assumed 
> using the c client. Is this the correct place?
> 
> Thanks much,
> Jeremy
> 
> 
> From: Jojy Varghese <j...@mesosphere.io>
> Sent: Monday, November 9, 2015 2:09 PM
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>  
> Hi Jeremy
>  The “network” code is at "3rdparty/libprocess/include/process/network.hpp” , 
> "3rdparty/libprocess/src/poll_socket.hpp/cpp”. 
> 
> thanks
> jojy
> 
> 
>> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com 
>> <mailto:jol...@spscommerce.com>> wrote:
>> 
>> Hi all,
>> 
>> Jojy, That is correct, but more specifically a keepalive timer from slave to 
>> master and slave to zookeeper. Can you send a link to the portion of the 
>> code that builds the socket/connection? Is there any reason to not set the 
>> SO_KEEPALIVE option in your opinion?
>> 
>> hasodent, I'm not looking for keepalive between zk quorum members, like the 
>> ZOOKEEPER JIRA is referencing.
>> 
>> Thanks,
>> Jeremy
>> 
>> 
>> From: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>>
>> Sent: Sunday, November 8, 2015 8:37 PM
>> To: user@mesos.apache.org <mailto:user@mesos.apache.org>
>> Subject: Re: Mesos and Zookeeper TCP keepalive
>>  
>> Hi Jeremy
>>   Are you trying to establish a keepalive timer between mesos master and 
>> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE 
>> option is  not set on an accepting socket. 
>> 
>> -Jojy
>> 
>>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com 
>>> <mailto:haosd...@gmail.com>> wrote:
>>> 
>>> I think keepalive option should be set in Zookeeper, not in Mesos. See this 
>>> related issue in Zookeeper. 
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>>>  
>>> <https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085>
>>> 
>>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com 
>>> <mailto:jol...@spscommerce.com>> wrote:
>>> Hello all,
>>> 
>>> We have been fighting some network/session disconnection issues between 
>>> datacenters and I'm curious if there is anyway to enable tcp keepalive on 
>>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp kernel 
>>> settings would be used. I believe keepalive has to be enabled by the 
>>> software which is opening the connection. (That is my understanding anyway)
>>> 
>>> Here is what I see via netstat --timers -tn:
>>> tcp0  0 172.18.1.1:55842 <http://172.18.1.1:55842/>  
>>> 10.10.1.1:2181 <http://10.10.1.1:2181/>  ESTABLISHED off (0.00/0/0)
>>> tcp0  0 172.18.1.1:49702  10.10.1.1:5050  ESTABLISHED 
>>> off (0.00/0/0)
>>> 
>>> 
>>> Where 172 is the mesos-slave network and 10 is the mesos-master network. 
>>> The "off" keyword means that keepalive's are not being sent.
>>> 
>>> I've trolled through JIRA, git, etc and cannot easily determine if this is 
>>> expected behavior or should be an enhancement request. Any ideas?
>>> 
>>> Thanks much!
>>> -Jeremy
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Haosdent Huang



Re: Mesos and Zookeeper TCP keepalive

2015-11-09 Thread Jojy Varghese
Hi Jeremy
 The “network” code is at "3rdparty/libprocess/include/process/network.hpp” , 
"3rdparty/libprocess/src/poll_socket.hpp/cpp”. 

thanks
jojy


> On Nov 9, 2015, at 6:54 AM, Jeremy Olexa <jol...@spscommerce.com> wrote:
> 
> Hi all,
> 
> Jojy, That is correct, but more specifically a keepalive timer from slave to 
> master and slave to zookeeper. Can you send a link to the portion of the code 
> that builds the socket/connection? Is there any reason to not set the 
> SO_KEEPALIVE option in your opinion?
> 
> hasodent, I'm not looking for keepalive between zk quorum members, like the 
> ZOOKEEPER JIRA is referencing.
> 
> Thanks,
> Jeremy
> 
> 
> From: Jojy Varghese <j...@mesosphere.io>
> Sent: Sunday, November 8, 2015 8:37 PM
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>  
> Hi Jeremy
>   Are you trying to establish a keepalive timer between mesos master and 
> mesos slave? If so, I don’t believe its possible today as SO_KEEPALIVE option 
> is  not set on an accepting socket. 
> 
> -Jojy
> 
>> On Nov 8, 2015, at 8:43 AM, haosdent <haosd...@gmail.com 
>> <mailto:haosd...@gmail.com>> wrote:
>> 
>> I think keepalive option should be set in Zookeeper, not in Mesos. See this 
>> related issue in Zookeeper. 
>> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>>  
>> <https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085>
>> 
>> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa <jol...@spscommerce.com 
>> <mailto:jol...@spscommerce.com>> wrote:
>> Hello all,
>> 
>> We have been fighting some network/session disconnection issues between 
>> datacenters and I'm curious if there is anyway to enable tcp keepalive on 
>> the zookeeper/mesos sockets? If there was a way, then the sysctl tcp kernel 
>> settings would be used. I believe keepalive has to be enabled by the 
>> software which is opening the connection. (That is my understanding anyway)
>> 
>> Here is what I see via netstat --timers -tn:
>> tcp0  0 172.18.1.1:55842 <http://172.18.1.1:55842/>  
>> 10.10.1.1:2181 <http://10.10.1.1:2181/>  ESTABLISHED off (0.00/0/0)
>> tcp0  0 172.18.1.1:49702  10.10.1.1:5050  ESTABLISHED 
>> off (0.00/0/0)
>> 
>> 
>> Where 172 is the mesos-slave network and 10 is the mesos-master network. The 
>> "off" keyword means that keepalive's are not being sent.
>> 
>> I've trolled through JIRA, git, etc and cannot easily determine if this is 
>> expected behavior or should be an enhancement request. Any ideas?
>> 
>> Thanks much!
>> -Jeremy
>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Haosdent Huang



Re: Mesos and Zookeeper TCP keepalive

2015-11-08 Thread Jojy Varghese
Hi Jeremy
  Are you trying to establish a keepalive timer between mesos master and mesos 
slave? If so, I don’t believe its possible today as SO_KEEPALIVE option is  not 
set on an accepting socket. 

-Jojy

> On Nov 8, 2015, at 8:43 AM, haosdent  wrote:
> 
> I think keepalive option should be set in Zookeeper, not in Mesos. See this 
> related issue in Zookeeper. 
> https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14724085=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14724085
>  
> 
> 
> On Sun, Nov 8, 2015 at 4:47 AM, Jeremy Olexa  > wrote:
> Hello all,
> 
> 
> We have been fighting some network/session disconnection issues between 
> datacenters and I'm curious if there is anyway to enable tcp keepalive on the 
> zookeeper/mesos sockets? If there was a way, then the sysctl tcp kernel 
> settings would be used. I believe keepalive has to be enabled by the software 
> which is opening the connection. (That is my understanding anyway)
> 
> 
> Here is what I see via netstat --timers -tn:
> 
> tcp0  0 172.18.1.1:55842   
> 10.10.1.1:2181   ESTABLISHED off (0.00/0/0)
> tcp0  0 172.18.1.1:49702  10.10.1.1:5050  ESTABLISHED off 
> (0.00/0/0)
> 
> 
> Where 172 is the mesos-slave network and 10 is the mesos-master network. The 
> "off" keyword means that keepalive's are not being sent.
> 
> 
> I've trolled through JIRA, git, etc and cannot easily determine if this is 
> expected behavior or should be an enhancement request. Any ideas?
> 
> 
> Thanks much!
> 
> -Jeremy
> 
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang



Re: Welcome Kapil as Mesos committer and PMC member!

2015-11-05 Thread Jojy Varghese
Congratulations Kapil.

> On Nov 5, 2015, at 7:29 AM, Marco Massenzio  wrote:
> 
> Awesome stuff!
> Congratulations, Kapil - totally deserved!
> 
> On Thursday, November 5, 2015, Vinod Kone  > wrote:
> welcome kapil!
> 
> On Thu, Nov 5, 2015 at 6:49 AM,  > wrote:
> Congrats Dr. Arya!
> 
> > On Nov 5, 2015, at 02:02, Till Toenshoff  > > wrote:
> >
> > I'm happy to announce that Kapil Arya has been voted a Mesos committer and 
> > PMC member!
> >
> > Welcome Kapil, and thanks for all of your great contributions to the 
> > project so far!
> >
> > Looking forward to lots more of your contributions!
> >
> > Thanks
> > Till
> 
> 
> 
> -- 
> --
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com 



Re: Can't start docker container when SSL_ENABLED is on.

2015-10-30 Thread Jojy Varghese
Hi Xiaodong
  This might be because the executor inherits the SSL environment variables of 
slave and thus expects SSL key password to launch. Could you please add the 
part of the slave logs that says "Flags at startup” so that we can have more 
information?

thanks
Jojy


> On Oct 29, 2015, at 8:55 PM, Xiaodong Zhang  wrote:
> 
> Thanks a lot !~ @haosent
> 
> 发件人: haosdent >
> 答复: "user@mesos.apache.org " 
> >
> 日期: 2015年10月30日 星期五 上午11:45
> 至: user >
> 主题: Re: Can't start docker container when SSL_ENABLED is on.
> 
> Hi, @Xiaodong I interested in your problem. But recently days I don't have 
> enough time to try reproduce your problem. I think I could try to dig your 
> problem at this Sunday and give you feedback.
> 
> On Fri, Oct 30, 2015 at 11:30 AM, Xiaodong Zhang  > wrote:
>> Anybody know about this?
>> 
>> 发件人: Xiaodong Zhang >
>> 答复: "user@mesos.apache.org " 
>> >
>> 日期: 2015年10月29日 星期四 下午7:38
>> 
>> 至: "user@mesos.apache.org " 
>> >
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>> 
>> I think it is easy to reproduce this error.
>> 
>> Start master with env: 
>> 
>> SSL_SUPPORT_DOWNGRADE
>> SSL_ENABLED
>> SSL_KEY_FILE
>> SSL_CERT_FILE
>> 
>> Start slave with env:
>> 
>> SSL_ENABLED
>> SSL_KEY_FILE
>> SSL_CERT_FILE
>> LIBPROCESS_ADVERTISE_IP
>> 
>> 
>> Then run a docker task via marathon.
>> 
>> 发件人: Xiaodong Zhang >
>> 日期: 2015年10月29日 星期四 下午3:09
>> 至: "user@mesos.apache.org " 
>> >
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>> 
>> So now, mesos task work well but docker task doesn’t. 
>> 
>> 发件人: Xiaodong Zhang >
>> 答复: "user@mesos.apache.org " 
>> >
>> 日期: 2015年10月29日 星期四 下午2:08
>> 至: "user@mesos.apache.org " 
>> >
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>> 
>> I run a task by marathon:
>> 
>> {
>> "id": "basic-0", 
>> "cmd": "while [ true ] ; do echo 'Hello Marathon' ; sleep 5 ; done",
>> "cpus": 0.1,
>> "mem": 10.0,
>> "instances": 1}
>> 
>> It works well.
>> 
>> <742629F2-78E8-43F2-9015-F3D22720826B.png>
>> 
>> Docker task can pull image but can’t run as I mentioned.
>> 
>> My docker version 1.5.0
>> 
>> 发件人: Tim Chen >
>> 答复: "user@mesos.apache.org " 
>> >
>> 日期: 2015年10月29日 星期四 下午1:48
>> 至: "user@mesos.apache.org " 
>> >
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>> 
>> Does running a task without docker container (Mesos containerizer) works 
>> with ssl in your environment?
>> 
>> Tim
>> 
>> On Wed, Oct 28, 2015 at 10:19 PM, Xiaodong Zhang > > wrote:
>>> Thanks a lot. I find the log file in slave.
>>> 
>>> One of the task:
>>> 
>>> Stdout:
>>> 
>>> --container="mesos-20151029-043755-3549436724-5050-5674-S0.e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>>  --docker="/home/ubuntu/luna/bin/docker" --help="false" 
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
>>> --sandbox_directory="/tmp/mesos/slaves/20151029-043755-3549436724-5050-5674-S0/frameworks/20151029-043755-3549436724-5050-5674-/executors/e4a3bed5-64e6-4970-8bb1-df6404656a48.e3a20f3b-7dfb-11e5-b57b-0247b493b22f/runs/e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>>  --stop_timeout="0ns"
>>> --container="mesos-20151029-043755-3549436724-5050-5674-S0.e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>>  --docker="/home/ubuntu/luna/bin/docker" --help="false" 
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
>>> --sandbox_directory="/tmp/mesos/slaves/20151029-043755-3549436724-5050-5674-S0/frameworks/20151029-043755-3549436724-5050-5674-/executors/e4a3bed5-64e6-4970-8bb1-df6404656a48.e3a20f3b-7dfb-11e5-b57b-0247b493b22f/runs/e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>>  --stop_timeout="0ns"
>>> Shutting down
>>> 
>>> Stderr:
>>> 
>>> I1029 05:14:06.529364 27862 fetcher.cpp:414] Fetcher Info: 
>>> 

Re: Can't start docker container when SSL_ENABLED is on.

2015-10-30 Thread Jojy Varghese
Thanks Xiaodong. 

Based on the hypothesis that the container process launched with SSL_ENABLED in 
environment is the problem, I have created a patch 
https://reviews.apache.org/r/39818/ <https://reviews.apache.org/r/39818/>.  
This might be a quick and dirty was to test the hypothesis. Would it be 
possible for you to test again after applying the patch?

-Jojy



> On Oct 30, 2015, at 8:29 AM, Xiaodong Zhang <xdzh...@alauda.io> wrote:
> 
> Thanks @Jojy
> 
> 
> 
> Flags at startup: --appc_store_dir="/tmp/mesos/store/appc" 
> --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" 
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" 
> --cgroups_limit_swap="false" --cgroups_root="mesos" 
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" 
> --credential="/etc/mesos-slave-auth" --default_role="*" 
> --disk_watch_interval="1mins" --docker="/usr/bin/docker" 
> --docker_kill_orphans="true" --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock"  --docker_stop_timeout="0ns" 
> --enforce_container_disk_quota="false" --executor_registration_timeout="1hrs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --initialize_driver_logging="true" 
> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" 
> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" 
> --master="zk://172.31.43.77:2181,172.31.44.2:2181,172.31.36.91:2181/mesos" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" 
> --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
> --strict="true" --switch_user="true" --version="false" --work_dir="/tmp/mesos"
> 
> 发件人: Jojy Varghese <j...@mesosphere.io <mailto:j...@mesosphere.io>>
> 答复: "user@mesos.apache.org <mailto:user@mesos.apache.org>" 
> <user@mesos.apache.org <mailto:user@mesos.apache.org>>
> 日期: 2015年10月30日 星期五 下午11:17
> 至: "user@mesos.apache.org <mailto:user@mesos.apache.org>" 
> <user@mesos.apache.org <mailto:user@mesos.apache.org>>
> 主题: Re: Can't start docker container when SSL_ENABLED is on.
> 
> Hi Xiaodong
>   This might be because the executor inherits the SSL environment variables 
> of slave and thus expects SSL key password to launch. Could you please add 
> the part of the slave logs that says "Flags at startup” so that we can have 
> more information?
> 
> thanks
> Jojy
> 
> 
>> On Oct 29, 2015, at 8:55 PM, Xiaodong Zhang <xdzh...@alauda.io 
>> <mailto:xdzh...@alauda.io>> wrote:
>> 
>> Thanks a lot !~ @haosent
>> 
>> 发件人: haosdent <haosd...@gmail.com <mailto:haosd...@gmail.com>>
>> 答复: "user@mesos.apache.org <mailto:user@mesos.apache.org>" 
>> <user@mesos.apache.org <mailto:user@mesos.apache.org>>
>> 日期: 2015年10月30日 星期五 上午11:45
>> 至: user <user@mesos.apache.org <mailto:user@mesos.apache.org>>
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>> 
>> Hi, @Xiaodong I interested in your problem. But recently days I don't have 
>> enough time to try reproduce your problem. I think I could try to dig your 
>> problem at this Sunday and give you feedback.
>> 
>> On Fri, Oct 30, 2015 at 11:30 AM, Xiaodong Zhang <xdzh...@alauda.io 
>> <mailto:xdzh...@alauda.io>> wrote:
>>> Anybody know about this?
>>> 
>>> 发件人: Xiaodong Zhang <xdzh...@alauda.io <mailto:xdzh...@alauda.io>>
>>> 答复: "user@mesos.apache.org <mailto:user@mesos.apache.org>" 
>>> <user@mesos.apache.org <mailto:user@mesos.apache.org>>
>>> 日期: 2015年10月29日 星期四 下午7:38
>>> 
>>> 至: "user@mesos.apache.org <mailto:user@mesos.apache.org>" 
>>> <user@mesos.apache