[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026115#comment-17026115 ] Dalton Matos Coelho Barreto commented on MESOS-10068: - Thanks for you availability [~greggomann], even with a very limited time. I appreciate it. I will try to organize myself so I can dedicate somte time to the project and then when (if) I have a better undestanting of this part of the code I will can reach you and we can talk in more detail about this issue. Thanks. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025386#comment-17025386 ] Greg Mann commented on MESOS-10068: --- [~daltonmatos] regarding this ticket, yea I think it makes sense to close this one and mention it in MESOS-10089. Time is tight over here, but I'd be happy to mentor you a bit in the codebase :) Would you like to start by addressing MESOS-10089? If so, we could do an intro call to get started. Feel free to find me on Mesos slack if you're on there. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023162#comment-17023162 ] Dalton Matos Coelho Barreto commented on MESOS-10068: - Hello [~greggomann], Thanks for taking your time to answer this ticket. About the time to dedicate to fix this bug, I undestand. In fact I would like to ask if you (and [~bmahler] or any others) are willing to mentor a new developer into the world of the mesos project codebase. I studied the code some time ago (because of the ticket MESOS-8517) but didn't manage to contribute with any code at that time. About the new ticket you created to fix what I reported here, do you think it's better do close this ticket and mention it on the other (MESOS-10089)? I'm already watching MESOS-9556 so if I have any new suggestion or the ticket has any new information I will post there. Thanks. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17022346#comment-17022346 ] Greg Mann commented on MESOS-10068: --- Yea we should definitely be sending AGENT_REMOVED when agents are marked gone, sounds like a bug to me. I created a ticket to track this: MESOS-10089 Regarding the unreachable agents, we may want to have an AGENT_UNREACHABLE event to indicate this. [~daltonmatos], we have a ticket here to track the design of the full agent state diagram: MESOS-9556 That would be a great place to continue discussion, feel free to ping us there. Unfortunately, I'm not sure when we might find time to work on that, but it's definitely something we've been wanting to do for a while now. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021050#comment-17021050 ] Dalton Matos Coelho Barreto commented on MESOS-10068: - Hello [~bmahler], First of all, thanks for taking some time to responde here. About the inconsistency on events of the API: Does the Mesos project has any open discussion about what should be these events related to the agent lifecycle? Is there anything I could do to help dicuss this? What I will probalby do to receive this AGENT_REMOVED is to write a dummy framework which connects to mesos and stays forever supressed. This way I can receive this event and don't need to keep declining offers sent my the master. Let's see what [~greggomann] has to say about this. Thanks, > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020545#comment-17020545 ] Benjamin Mahler commented on MESOS-10068: - The first thing to comment on is that we don't yet have a formalized agent lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is also the notion of disconnecting, becoming unreachable, getting transitioned to gone. So the API and internals are at a bit of a mismatch here and more broadly of this particular ticket we would need to make them consistent to have events that make sense. [~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that the the agent became unreachable, and we don't send it in that case. The first case goes through a different path where we never were able to communicate with the agent, but we don't know that and the agent retries its registration, upon seeing this we remove the previous version of that agent and try to register the new one. You may see this repeating itself over and over. [~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as gone? Seems like a bug due to {{__removeSlave}} being used for both marking unreachable and gone? > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995763#comment-16995763 ] Dalton Matos Coelho Barreto commented on MESOS-10068: - I also attached to the {{/api/v1/scheduler}} endpoint and as soon as the agent is removed form master's internal state we see this on master logs: {noformat} I1213 17:10:05.783211 8 master.cpp:2087] Notifying framework f8da63ce-ad54-4a9b-b08f-0514c37abb6a-0002 (Example HTTP Framework) of lost agent f8da63ce-ad54-4a9b-b08f-0514c37abb6a-S0 (6355 547f0b3c) {noformat} And a new event is delivered to the framework: {noformat} {"type":"FAILURE","failure":{"agent_id":{"value":"f8da63ce-ad54-4a9b-b08f-0514c37abb6a-S0"}}} {noformat} Which makes me feel that the {{AGENT_REMOVED}} event is indeed missing being delivered to {{/api/v1}} subscribers. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state
[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995758#comment-16995758 ] Dalton Matos Coelho Barreto commented on MESOS-10068: - I also ran this same test using Marathon configured with a {{failover_timeout}}=30 and as soon as Mesos marks Marathon as a Completed Framework, the {{FRAMEWORK_REMOVED}} events gets delivered. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > --- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.3, 1.8.2, 1.9.1 >Reporter: Dalton Matos Coelho Barreto >Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.33893513 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.33908913 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.33920713 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.72667015 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)