[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-29 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026115#comment-17026115
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10068:
-

Thanks for you availability [~greggomann], even with a very limited time. I 
appreciate it.

 

I will try to organize myself so I can dedicate somte time to the project and 
then when (if) I have a better undestanting of this part of the code I will can 
reach you and we can talk in more detail about this issue.

Thanks.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-28 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025386#comment-17025386
 ] 

Greg Mann commented on MESOS-10068:
---

[~daltonmatos] regarding this ticket, yea I think it makes sense to close this 
one and mention it in MESOS-10089.

Time is tight over here, but I'd be happy to mentor you a bit in the codebase 
:) Would you like to start by addressing MESOS-10089? If so, we could do an 
intro call to get started. Feel free to find me on Mesos slack if you're on 
there.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-24 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023162#comment-17023162
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10068:
-

Hello [~greggomann],

Thanks for taking your time to answer this ticket.

About the time to dedicate to fix this bug, I undestand. In fact I would like 
to ask if you (and [~bmahler] or any others) are willing to mentor a new 
developer into the world of the mesos project codebase. I studied the code some 
time ago (because of the ticket MESOS-8517) but didn't manage to contribute 
with any code at that time.

 

About the new ticket you created to fix what I reported here, do you think it's 
better do close this ticket and mention it on the other (MESOS-10089)?

 

I'm already watching MESOS-9556 so if I have any new suggestion or the ticket 
has any new information I will post there.

 

Thanks.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-23 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022346#comment-17022346
 ] 

Greg Mann commented on MESOS-10068:
---

Yea we should definitely be sending AGENT_REMOVED when agents are marked gone, 
sounds like a bug to me. I created a ticket to track this: MESOS-10089

Regarding the unreachable agents, we may want to have an AGENT_UNREACHABLE 
event to indicate this.

[~daltonmatos], we have a ticket here to track the design of the full agent 
state diagram: MESOS-9556
That would be a great place to continue discussion, feel free to ping us there. 
Unfortunately, I'm not sure when we might find time to work on that, but it's 
definitely something we've been wanting to do for a while now.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-22 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021050#comment-17021050
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10068:
-

Hello [~bmahler],

First of all, thanks for taking some time to responde here.

About the inconsistency on events of the API: Does the Mesos project has any 
open discussion about what should be these events related to the agent 
lifecycle? Is there anything I could do to help dicuss this?

What I will probalby do to receive this AGENT_REMOVED is to write a dummy 
framework which connects to mesos and stays forever supressed. This way I can 
receive this event and don't need to keep declining offers sent my the master.

Let's see what [~greggomann] has to say about this.

Thanks,

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2020-01-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020545#comment-17020545
 ] 

Benjamin Mahler commented on MESOS-10068:
-

The first thing to comment on is that we don't yet have a formalized agent 
lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is 
also the notion of disconnecting, becoming unreachable, getting transitioned to 
gone. So the API and internals are at a bit of a mismatch here and more broadly 
of this particular ticket we would need to make them consistent to have events 
that make sense.

[~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that 
the the agent became unreachable, and we don't send it in that case. The first 
case goes through a different path where we never were able to communicate with 
the agent, but we don't know that and the agent retries its registration, upon 
seeing this we remove the previous version of that agent and try to register 
the new one. You may see this repeating itself over and over.

[~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as 
gone? Seems like a bug due to {{__removeSlave}} being used for both marking 
unreachable and gone?



> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2019-12-13 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995763#comment-16995763
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10068:
-

I also attached to the {{/api/v1/scheduler}} endpoint and as soon as the agent 
is removed form master's internal state we see this on master logs:
{noformat}
I1213 17:10:05.783211 8 master.cpp:2087] Notifying framework 
f8da63ce-ad54-4a9b-b08f-0514c37abb6a-0002 (Example HTTP Framework) of lost 
agent f8da63ce-ad54-4a9b-b08f-0514c37abb6a-S0 (6355
547f0b3c)
{noformat}
And a new event is delivered to the framework:
{noformat}
{"type":"FAILURE","failure":{"agent_id":{"value":"f8da63ce-ad54-4a9b-b08f-0514c37abb6a-S0"}}}
{noformat}
Which makes me feel that the {{AGENT_REMOVED}} event is indeed missing being 
delivered to {{/api/v1}} subscribers.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state

2019-12-13 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995758#comment-16995758
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10068:
-

I also ran this same test using Marathon configured with a 
{{failover_timeout}}=30 and as soon as Mesos marks Marathon as a Completed 
Framework, the {{FRAMEWORK_REMOVED}} events gets delivered.

> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> ---
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.3, 1.8.2, 1.9.1
>Reporter: Dalton Matos Coelho Barreto
>Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.33893513 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.72667015 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)