HI, Still one question: during the next 14 minutes of the agent shutdown mesos
GUI is still showing that Cassandra consumes 99% of all resources on the agent
that went down. Even if it were originally Cassandra bug or misconfiguration
that led to this situation – isn’t it still a bug in mesos that it is showing
that kind of consumtion in the agent that doesn’t exist any more ?
Jaana
Lähettäjä: Jaana Miettinen [mailto:jaa...@kolumbus.fi]
Lähetetty: 5. marraskuuta 2016 16:43
Vastaanottaja: 'user@mesos.apache.org'
Aihe: VS: framework failover
HI, Thanks for your quick reply, pls see my answers below
* Are you running your frameworks via Marathon?
yes
* How are you terminating the Mesos Agent?
So far I have been just issuing Linux ‘halt’-command from the agent’s command
line or terminating the agent instance from the cloud management console. So
actually I want to simulate the case when the whole host, where my framework is
running, goes down.
* Implies that the master does not remove the agent immediately, meaning you
killed the agent, but did not kill the tasks.
During this time, the master is waiting for the agent to come back online. If
the agent doesn't come back during some (configurable) timeout, it will notify
the frameworks about the loss of an agent.
Sounds like you would be talking about this timer
‘ALLOCATION_HOLD_OFF_RECOVERY_TIMEOUT’ that has been hardcoded to 10 minutes in
mesos.0.28.0.
But now we are reaching the most interesting question in our discussion. You
wrote: If the agent doesn't come back during some (configurable) timeout, it
will notify the frameworks about the loss of an agent.
How could this happen if the framework was just running in the agent that went
down ? Or do you mean the frameworks running on other agents would get the
information about the loss of the agent ?
* Also, it's a little odd that your frameworks will disconnect upon the agent
process dying. You may want to investigate your framework dependencies. A
framework should definitely not depend on the agent process (frameworks depend
on the master though).
*
For me it looks very natural that the frameworks disconnect when the agent host
shuts down. And if Cassandra wouldn’t be there and consuming all resources then
the other frameworks would re-register and continue running their tasks on the
other agents. Wouldn’t this be the correct procedure ?
Hopefully I answered your questions clearly enough. Anyway, please let me know
which configurable timer you were talking about !
And thanks a lot,
Jaana
BTW. if ALLOCATION_HOLD_OFF_RECOVERY_TIMEOUT were the correct guess then I
should see "Triggered allocator recovery: waiting for " in my log-file
mesos.master.INFO. But it’s not there.
// Setup recovery timer.
delay(ALLOCATION_HOLD_OFF_RECOVERY_TIMEOUT, self(), ::resume);
// NOTE: `quotaRoleSorter` is updated implicitly in `setQuota()`.
foreachpair (const string& role, const Quota& quota, quotas) {
setQuota(role, quota);
}
LOG(INFO) << "Triggered allocator recovery: waiting for "
<< expectedAgentCount.get() << " slaves to reconnect or "
<< ALLOCATION_HOLD_OFF_RECOVERY_TIMEOUT << " to pass";
}
Lähettäjä: Joseph Wu [mailto:jos...@mesosphere.io]
Lähetetty: 4. marraskuuta 2016 20:03
Vastaanottaja: user >
Aihe: Re: framework failover
A couple questions/notes:
What do you mean by:
the system will deploy the framework on a new node within less than three
minutes.
Are you running your frameworks via Marathon?
How are you terminating the Mesos Agent? If you send a `kill -SIGUSR1`, the
agent will immediately kill all of its tasks and un-register with the master.
If you kill the agent with some other signal, the agent will simply stop, but
tasks will continue to run.
According to the mesos GUI page cassandra holds 99-100 % of the resources on
the terminated slave during that 14 minutes.
^ Implies that the master does not remove the agent immediately, meaning you
killed the agent, but did not kill the tasks.
During this time, the master is waiting for the agent to come back online. If
the agent doesn't come back during some (configurable) timeout, it will notify
the frameworks about the loss of an agent.
Also, it's a little odd that your frameworks will disconnect upon the agent
process dying. You may want to investigate your framework dependencies. A
framework should definitely not depend on the agent process (frameworks depend
on the master though).
On Fri, Nov 4, 2016 at 10:32 AM, Jaana Miettinen > wrote:
Hi, Would you help me to find out how the framework failover happens in mesos
0.28.0 ?
In my mesos-environment I have the following frameworks:
etcd-mesos
cassandra-mesos 0.2.0-1
eremitic