Re: Long running storage service on mesos.

coocood Thu, 23 Jan 2014 22:19:03 -0800

Thank you, running slave under a tool that automatically restart it relieve
the problem a lot.



On Fri, Jan 24, 2014 at 8:30 AM, Benjamin Mahler
<[email protected]>wrote:

>
>
>
> On Wed, Jan 22, 2014 at 10:27 PM, coocood <[email protected]> wrote:
>
>> I want to run redis server cluster on mesos, but have some problems.
>>
>> The first problem is the storage path, since it is storage service, I
>> need to set the storage path out of the sandbox, so the next run of the
>> service will find the data. and the data will not get garbage collected.
>> The scheduler will keep track of storage services and its state, and pass
>> storage path to the executor to create or restart a storage service.
>>
>> Does this solution have any problems?
>>
>
> This works at the current time given the sandbox is not chrooted. In the
> future, there may be a different story for persistent storage, but at the
> current time you can write outside your sandbox so long as you have the
> necessary privileges. But, you will need to make sure you can tolerate the
> disk failing through a replication / backup strategy, which I've imagine
> you've done with your redis setup.
>
> In the future, to better deal with persistence requirements, we may expose
> raw disk as a resource that can have reservations applied to them as other
> resources. This would allow you to reserve disk resources for your
> particular role that needs them. Much of this is still up in the air.
>
>
>>
>> And another problem is that when the slave is deactived due to network
>> partition or slave process exited for more than 75 seconds, when the slave
>> connected again, the slave will be asked to shutdown itself and all its
>> tasks, so all the running storage services will be shutdown and you have to
>> start the slave again, and restart all the storage services. If the service
>> takes long time to restart, it will cause the service unavailable for a
>> while.
>>
>
> A few questions here:
>
> How are you running your slaves? Typically slaves are run under a tool
> that monitors the pid and restarts the slave when it exits. This ensures
> your slave is restarted automatically.
>
> We cannot distinguish between a partition and other classes of failure
> (such as machine failure), so the question is, why treat partitions any
> differently than the machine failing? How would your framework react when
> one of the machines running redis fails? Could the same strategy be applied
> to network partitions?
>
>
>>
>> Are there any way to solve this problem? like instead of simply shut down
>> the deactived slave and all its task, let framework decide how to handle
>> the re-registering of deactivated slave. I read the code, it says "We
>> disallow deactivated slaves from re-registering, we don't allow the slave
>> to re-register, as we've already informed frameworks that the tasks were
>> lost."
>>
>>
> The 75 second time could be configurable depending on the specific length
> of network partitions you want to consider acceptable.
>
> As for the issue re-registering deactivated slaves, we've opted not to
> allow this given the resulting complexity that is exposed to frameworks.
> With your suggestion we may inform the framework that a task is LOST, and
> subsequently later inform them that it is RUNNING.
>
> Tasks can go LOST for many reasons outside of network partitions, so a
> reaction is typically required on the LOST signal. At this point, when you
> later receive RUNNING, you've already reacted to the LOST signal, so what
> do you do? I think the semantics here would be fairly tricky.
>
>

Re: Long running storage service on mesos.

Reply via email to