Re: Reusing Task IDs

Sargun Dhillon Mon, 22 Feb 2016 13:24:12 -0800

As someone who has been there and back again (Reusing task-IDs, and
realizing it's a terrible idea), I'd put some advise in the docs +
mesos.proto to compose task IDs from GUIDs, and add that it's
dangerous to reuse them.


I would advocate for a mechanism to prevent the usage of non-unique
IDs for executors, tasks, and frameworks, but I feel that's a more
complex, and larger problem.

On Mon, Feb 22, 2016 at 1:19 PM, Vinod Kone <[email protected]> wrote:
> I would vote for updating comments in mesos.proto to warn users to not
> re-use task id for now.
>
> On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma <[email protected]> wrote:
>>
>> Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that
>> Master'll crash when Master failover with duplicated TaskID.
>>
>> Here's the case of MESOS-3070:
>> T1: launch task (t1) on agent (agent_1)
>> T2: master failover
>> T3: launch another task (t1) on agent (agent_2) before agent_1
>> re-registering back
>> T4: agent_1 re-registered back; master'll crash because of `CHECK` when
>> add task (t1) back to master
>>
>> Is there any special case that framework has to re-use the TaskID; if no
>> special case, I think we should ask framework to avoid reuse TaskID.
>>
>> ----
>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> Platform OpenSource Technology, STG, IBM GCG
>> +86-10-8245 4084 | [email protected] | http://k82.me
>>
>> On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <[email protected]>
>> wrote:
>>>
>>> tldr; Reusing TaskIDs clashes with the mesos-agent recovery feature.
>>>
>>> Adam Bordelon wrote:
>>> > Reusing taskIds may work if you're guaranteed to never be running two
>>> > instances of the same taskId simultaneously
>>>
>>> I've encountered another scenario where reusing TaskIDs is dangerous,
>>> even if you meet the guarantee of never running 2 task instances with the
>>> same TaskID simultaneously.
>>>
>>> Scenario leading to a problem:
>>>
>>> Say you have a task with ID "T1", which terminates for some reason, so
>>> its terminal status update gets recorded into the agent's current "run" in
>>> the task's updates file:
>>>
>>>
>>> MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates
>>>
>>> Then say a new task is launched with the same ID of T1, and it gets
>>> scheduled under the same Executor on the same agent host. In that case, the
>>> task will be reusing the same work_dir path, and thus have the already
>>> recorded "terminal status update" in its task.updates file.  So the updates
>>> file has a stream of updates that might look like this:
>>>
>>> TASK_RUNNING
>>> TASK_FINISHED
>>> TASK_RUNNING
>>>
>>> Say you subsequently restart the mesos-slave/agent, expecting all tasks
>>> to survive the restart via the recovery process.  Unfortunately, T1 is
>>> terminated because the task recovery logic [1] looks at the current run's
>>> tasks' task.updates files, searching for tasks with "terminal status
>>> updates", and then terminating any such tasks.  So, even though T1 was
>>> actually running just fine, it gets terminated because at some point in its
>>> previous incarnation it had a "terminal status update" recorded.
>>>
>>> Leads to inconsistent state
>>>
>>> Compounding the problem, this termination is done without informing the
>>> Executor, and thus the process underlying the task continues to run, even
>>> though mesos thinks it's gone.  Which is really bad since it leaves the host
>>> with a different state than mesos thinks exists. e.g., if the task had a
>>> port resource, then mesos incorrectly thinks the port is now free, so a
>>> framework might try to launch a task/executor that uses the port, but it
>>> will fail because the process cannot bind to the port.
>>>
>>> Change recovery code or just update comments in mesos.proto?
>>>
>>> Perhaps this behavior could be considered a "bug" and the recovery logic
>>> that processes tasks status updates could be modified to ignore "terminal
>>> status updates" if there is a subsequent TASK_RUNNING update in the
>>> task.updates file.  If that sounds like a desirable change, I'm happy to
>>> file a JIRA issue for that and work on the fix myself.
>>>
>>> If we think the recovery logic is fine as it is, then we should update
>>> these comments [2] in mesos.proto since they are incorrect given the
>>> behavior I just encountered:
>>>
>>>> A framework generated ID to distinguish a task. The ID must remain
>>>> unique while the task is active. However, a framework can reuse an
>>>> ID _only_ if a previous task with the same ID has reached a
>>>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.).
>>>
>>>
>>> Conclusion
>>>
>>> It is dangerous indeed to reuse a TaskID for separate task runs, even if
>>> they are guaranteed to not be running concurrently.
>>>
>>> - Erik
>>>
>>>
>>> P.S., I encountered this problem while trying to use mesos-agent recovery
>>> with the storm-mesos framework [3].  Notably, this framework sets the TaskID
>>> to "<agenthostname>-<stormworkerport>" for the storm worker tasks, so when a
>>> storm worker dies and is reborn on that host, the TaskID gets reused.  But
>>> then the task doesn't survive an agent restart (even though the worker
>>> *process* does survive, putting us in an inconsistent state!).
>>>
>>> P.P.S., being able to enable verbose logging in mesos-slave/agent with
>>> the GLOG_v=3 environment variable is *super* convenient!  Would have taken
>>> me *way* longer to figure this out if the verbose logging didn't exist.
>>>
>>> P.P.P.S, To debug this, I wrote a tool [4] to decode length-prefixed
>>> protobuf [5] files, such as task.updates.
>>>
>>> Here's an example of invoking the tool (notably, it has the same syntax
>>> as "protoc --decode", but handles the length-prefix headers):
>>>
>>> cat task.updates | \
>>>   protoc-decode-lenprefix \
>>>     --decode mesos.internal.StatusUpdateRecord \
>>>     -I MESOS_CODE/src -I MESOS_CODE/include \
>>>     MESOS_CODE/src/messages/messages.proto
>>>
>>>
>>> [1]
>>> https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708
>>> [2]
>>> https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66
>>> [3] https://github.com/mesos/storm
>>> [4] https://github.com/erikdw/protoc-decode-lenprefix
>>> [5]
>>> http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers
>>>
>>> On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <[email protected]> wrote:
>>>>
>>>> I'd be most curious to see a working example of this idea, prefixes
>>>> and all for sleeping (long term sleeping) nodes (slave and masters).
>>>>
>>>> Anybody, do post what you have/are doing on this taskid resuse and
>>>> reservations experimentations. Probably many are interested for a variety 
>>>> of
>>>> reasons including but not limited to security, auditing  and node
>>>> diversification interests.... My interests are in self-modifying
>>>> codes, which can be achieved whilst the nodes sleep for some very
>>>> interesting applications.
>>>>
>>>>
>>>> James
>>>>
>>>>
>>>>
>>>> On 07/11/2015 06:01 AM, Adam Bordelon wrote:
>>>>>
>>>>> Reusing taskIds may work if you're guaranteed to never be running two
>>>>> instances of the same taskId simultaneously, but I could imagine a
>>>>> particularly dangerous scenario where a master and slave experience a
>>>>> network partition, so the master declares the slave lost and therefore
>>>>> its tasks lost, and then the framework scheduler launches a new task
>>>>> with the same taskId. However, the task is still running on the
>>>>> original
>>>>> slave. When the slave reregisters and claims it is running that taskId,
>>>>> or that that taskId has completed, the Mesos master may have a
>>>>> difficult
>>>>> time reconciling which instance of the task is on which node and in
>>>>> which status, since it expects only one instance to exist at a time.
>>>>> You may be better off using a fixed taskId prefix and appending an
>>>>> incrementing instance/trial number so that each run gets a uniqueId.
>>>>> Also note that taskIds only need to be unique within a single
>>>>> frameworkId, so don't worry about conflicting with other frameworks.
>>>>> TL;DR: I wouldn't recommend it.
>>>>>
>>>>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>
>>>>>     Sounds risky. Every task should have its own unique id, collisions
>>>>>     could happen and unexpected issues.
>>>>>
>>>>>     I think it will be as hard to monitor that you can start again a
>>>>>     task than get a mechanism to know it’s ID.
>>>>>
>>>>>
>>>>>
>>>>>>     On 10 Jul 2015, at 19:14, Jie Yu <[email protected]
>>>>>>     <mailto:[email protected]>> wrote:
>>>>>>
>>>>>>     Re-using Task IDs is definitely not encouraged. As far as I know,
>>>>>>     many of the Mesos code assume Task ID is unique. So I probably
>>>>>>     won't risk that.
>>>>>>
>>>>>>
>>>>>>     On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon <[email protected]
>>>>>>     <mailto:[email protected]>> wrote:
>>>>>>
>>>>>>         Is reusing Task IDs good behaviour? Let's say that I have some
>>>>>>         singleton task - I'll call it a monitoring service. It's
>>>>>>         always going
>>>>>>         to be the same process, doing the same thing, and there will
>>>>>>         only ever
>>>>>>         be one around (per instance of a framework). Reading the
>>>>>>         protobuf doc,
>>>>>>         I learned this:
>>>>>>
>>>>>>
>>>>>>         /**
>>>>>>          * A framework generated ID to distinguish a task. The ID must
>>>>>>         remain
>>>>>>          * unique while the task is active. However, a framework can
>>>>>>         reuse an
>>>>>>          * ID _only_ if a previous task with the same ID has reached a
>>>>>>          * terminal state (e.g., TASK_FINISHED, TASK_LOST,
>>>>>>         TASK_KILLED, etc.).
>>>>>>          */
>>>>>>         message TaskID {
>>>>>>           required string value = 1;
>>>>>>         }
>>>>>>         ---
>>>>>>         Which makes me think that it's reasonable to just give this
>>>>>>         task the
>>>>>>         same taskID, and that every time I bring it from a terminal
>>>>>>         status to
>>>>>>         running once more, I can reuse the same ID. This also gives me
>>>>>> the
>>>>>>         benefit of being able to more easily locate the task for a
>>>>>> given
>>>>>>         framework, and I'm able to exploit Mesos for some weak
>>>>>> guarantees
>>>>>>         saying there wont be multiple of these running (don't worry,
>>>>>>         they lock
>>>>>>         in Zookeeper, and concurrent runs don't do anything, they just
>>>>>>         fail).
>>>>>>
>>>>>>         Opinions?
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>     ^^Nos encantan los árboles. No me imprimas si no es necesario.
>>>>>
>>>>>     Protección de Datos:Mundo Reader S.L. le informa de que los datos
>>>>>     personales facilitados por Ud. y utilizados para el envío de esta
>>>>>     comunicación serán objeto de tratamiento automatizado o no en
>>>>>     nuestros ficheros, con la finalidad de gestionar la agenda de
>>>>>     contactos de nuestra empresa y para el envío de comunicaciones
>>>>>     profesionales por cualquier medio electrónico o no. Puede consultar
>>>>>     en www.bq.com <http://www.bq.com/>los detalles de nuestra Política
>>>>>     de Privacidad y dónde ejercer el derecho de acceso, rectificación,
>>>>>     cancelación y oposición.
>>>>>
>>>>>     Confidencialidad:Este mensaje contiene material confidencial y está
>>>>>     dirigido exclusivamente a su destinatario. Cualquier revisión,
>>>>>     modificación o distribución por otras personas, así como su reenvío
>>>>>     sin el consentimiento expreso está estrictamente prohibido. Si
>>>>> usted
>>>>>     no es el destinatario del mensaje, por favor, comuníqueselo al
>>>>>     emisor y borre todas las copias de forma inmediata.
>>>>>     Confidentiality:This e-mail contains material that is confidential
>>>>>     for de sole use of de intended recipient. Any review, reliance or
>>>>>     distribution by others or forwarding without express permission is
>>>>>     strictly prohibited. If you are not the intended recipient, please
>>>>>     contact the sender and delete all copies.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Reusing Task IDs

Reply via email to