Re: Reusing Task IDs

Erik Weathers Mon, 22 Feb 2016 13:48:13 -0800

Thanks for the responses.  Filed a ticket for this:

   - https://issues.apache.org/jira/browse/MESOS-4737


- Erik

On Mon, Feb 22, 2016 at 1:23 PM, Sargun Dhillon <[email protected]> wrote:

> As someone who has been there and back again (Reusing task-IDs, and
> realizing it's a terrible idea), I'd put some advise in the docs +
> mesos.proto to compose task IDs from GUIDs, and add that it's
> dangerous to reuse them.
>
> I would advocate for a mechanism to prevent the usage of non-unique
> IDs for executors, tasks, and frameworks, but I feel that's a more
> complex, and larger problem.
>
> On Mon, Feb 22, 2016 at 1:19 PM, Vinod Kone <[email protected]> wrote:
> > I would vote for updating comments in mesos.proto to warn users to not
> > re-use task id for now.
> >
> > On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma <[email protected]>
> wrote:
> >>
> >> Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that
> >> Master'll crash when Master failover with duplicated TaskID.
> >>
> >> Here's the case of MESOS-3070:
> >> T1: launch task (t1) on agent (agent_1)
> >> T2: master failover
> >> T3: launch another task (t1) on agent (agent_2) before agent_1
> >> re-registering back
> >> T4: agent_1 re-registered back; master'll crash because of `CHECK` when
> >> add task (t1) back to master
> >>
> >> Is there any special case that framework has to re-use the TaskID; if no
> >> special case, I think we should ask framework to avoid reuse TaskID.
> >>
> >> ----
> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> >> Platform OpenSource Technology, STG, IBM GCG
> >> +86-10-8245 4084 | [email protected] | http://k82.me
> >>
> >> On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <[email protected]>
> >> wrote:
> >>>
> >>> tldr; Reusing TaskIDs clashes with the mesos-agent recovery feature.
> >>>
> >>> Adam Bordelon wrote:
> >>> > Reusing taskIds may work if you're guaranteed to never be running two
> >>> > instances of the same taskId simultaneously
> >>>
> >>> I've encountered another scenario where reusing TaskIDs is dangerous,
> >>> even if you meet the guarantee of never running 2 task instances with
> the
> >>> same TaskID simultaneously.
> >>>
> >>> Scenario leading to a problem:
> >>>
> >>> Say you have a task with ID "T1", which terminates for some reason, so
> >>> its terminal status update gets recorded into the agent's current
> "run" in
> >>> the task's updates file:
> >>>
> >>>
> >>>
> MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates
> >>>
> >>> Then say a new task is launched with the same ID of T1, and it gets
> >>> scheduled under the same Executor on the same agent host. In that
> case, the
> >>> task will be reusing the same work_dir path, and thus have the already
> >>> recorded "terminal status update" in its task.updates file.  So the
> updates
> >>> file has a stream of updates that might look like this:
> >>>
> >>> TASK_RUNNING
> >>> TASK_FINISHED
> >>> TASK_RUNNING
> >>>
> >>> Say you subsequently restart the mesos-slave/agent, expecting all tasks
> >>> to survive the restart via the recovery process.  Unfortunately, T1 is
> >>> terminated because the task recovery logic [1] looks at the current
> run's
> >>> tasks' task.updates files, searching for tasks with "terminal status
> >>> updates", and then terminating any such tasks.  So, even though T1 was
> >>> actually running just fine, it gets terminated because at some point
> in its
> >>> previous incarnation it had a "terminal status update" recorded.
> >>>
> >>> Leads to inconsistent state
> >>>
> >>> Compounding the problem, this termination is done without informing the
> >>> Executor, and thus the process underlying the task continues to run,
> even
> >>> though mesos thinks it's gone.  Which is really bad since it leaves
> the host
> >>> with a different state than mesos thinks exists. e.g., if the task had
> a
> >>> port resource, then mesos incorrectly thinks the port is now free, so a
> >>> framework might try to launch a task/executor that uses the port, but
> it
> >>> will fail because the process cannot bind to the port.
> >>>
> >>> Change recovery code or just update comments in mesos.proto?
> >>>
> >>> Perhaps this behavior could be considered a "bug" and the recovery
> logic
> >>> that processes tasks status updates could be modified to ignore
> "terminal
> >>> status updates" if there is a subsequent TASK_RUNNING update in the
> >>> task.updates file.  If that sounds like a desirable change, I'm happy
> to
> >>> file a JIRA issue for that and work on the fix myself.
> >>>
> >>> If we think the recovery logic is fine as it is, then we should update
> >>> these comments [2] in mesos.proto since they are incorrect given the
> >>> behavior I just encountered:
> >>>
> >>>> A framework generated ID to distinguish a task. The ID must remain
> >>>> unique while the task is active. However, a framework can reuse an
> >>>> ID _only_ if a previous task with the same ID has reached a
> >>>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.).
> >>>
> >>>
> >>> Conclusion
> >>>
> >>> It is dangerous indeed to reuse a TaskID for separate task runs, even
> if
> >>> they are guaranteed to not be running concurrently.
> >>>
> >>> - Erik
> >>>
> >>>
> >>> P.S., I encountered this problem while trying to use mesos-agent
> recovery
> >>> with the storm-mesos framework [3].  Notably, this framework sets the
> TaskID
> >>> to "<agenthostname>-<stormworkerport>" for the storm worker tasks, so
> when a
> >>> storm worker dies and is reborn on that host, the TaskID gets reused.
> But
> >>> then the task doesn't survive an agent restart (even though the worker
> >>> *process* does survive, putting us in an inconsistent state!).
> >>>
> >>> P.P.S., being able to enable verbose logging in mesos-slave/agent with
> >>> the GLOG_v=3 environment variable is *super* convenient!  Would have
> taken
> >>> me *way* longer to figure this out if the verbose logging didn't exist.
> >>>
> >>> P.P.P.S, To debug this, I wrote a tool [4] to decode length-prefixed
> >>> protobuf [5] files, such as task.updates.
> >>>
> >>> Here's an example of invoking the tool (notably, it has the same syntax
> >>> as "protoc --decode", but handles the length-prefix headers):
> >>>
> >>> cat task.updates | \
> >>>   protoc-decode-lenprefix \
> >>>     --decode mesos.internal.StatusUpdateRecord \
> >>>     -I MESOS_CODE/src -I MESOS_CODE/include \
> >>>     MESOS_CODE/src/messages/messages.proto
> >>>
> >>>
> >>> [1]
> >>>
> https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708
> >>> [2]
> >>>
> https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66
> >>> [3] https://github.com/mesos/storm
> >>> [4] https://github.com/erikdw/protoc-decode-lenprefix
> >>> [5]
> >>>
> http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers
> >>>
> >>> On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <[email protected]> wrote:
> >>>>
> >>>> I'd be most curious to see a working example of this idea, prefixes
> >>>> and all for sleeping (long term sleeping) nodes (slave and masters).
> >>>>
> >>>> Anybody, do post what you have/are doing on this taskid resuse and
> >>>> reservations experimentations. Probably many are interested for a
> variety of
> >>>> reasons including but not limited to security, auditing  and node
> >>>> diversification interests.... My interests are in self-modifying
> >>>> codes, which can be achieved whilst the nodes sleep for some very
> >>>> interesting applications.
> >>>>
> >>>>
> >>>> James
> >>>>
> >>>>
> >>>>
> >>>> On 07/11/2015 06:01 AM, Adam Bordelon wrote:
> >>>>>
> >>>>> Reusing taskIds may work if you're guaranteed to never be running two
> >>>>> instances of the same taskId simultaneously, but I could imagine a
> >>>>> particularly dangerous scenario where a master and slave experience a
> >>>>> network partition, so the master declares the slave lost and
> therefore
> >>>>> its tasks lost, and then the framework scheduler launches a new task
> >>>>> with the same taskId. However, the task is still running on the
> >>>>> original
> >>>>> slave. When the slave reregisters and claims it is running that
> taskId,
> >>>>> or that that taskId has completed, the Mesos master may have a
> >>>>> difficult
> >>>>> time reconciling which instance of the task is on which node and in
> >>>>> which status, since it expects only one instance to exist at a time.
> >>>>> You may be better off using a fixed taskId prefix and appending an
> >>>>> incrementing instance/trial number so that each run gets a uniqueId.
> >>>>> Also note that taskIds only need to be unique within a single
> >>>>> frameworkId, so don't worry about conflicting with other frameworks.
> >>>>> TL;DR: I wouldn't recommend it.
> >>>>>
> >>>>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández
> >>>>> <[email protected] <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>     Sounds risky. Every task should have its own unique id,
> collisions
> >>>>>     could happen and unexpected issues.
> >>>>>
> >>>>>     I think it will be as hard to monitor that you can start again a
> >>>>>     task than get a mechanism to know it’s ID.
> >>>>>
> >>>>>
> >>>>>
> >>>>>>     On 10 Jul 2015, at 19:14, Jie Yu <[email protected]
> >>>>>>     <mailto:[email protected]>> wrote:
> >>>>>>
> >>>>>>     Re-using Task IDs is definitely not encouraged. As far as I
> know,
> >>>>>>     many of the Mesos code assume Task ID is unique. So I probably
> >>>>>>     won't risk that.
> >>>>>>
> >>>>>>
> >>>>>>     On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon <
> [email protected]
> >>>>>>     <mailto:[email protected]>> wrote:
> >>>>>>
> >>>>>>         Is reusing Task IDs good behaviour? Let's say that I have
> some
> >>>>>>         singleton task - I'll call it a monitoring service. It's
> >>>>>>         always going
> >>>>>>         to be the same process, doing the same thing, and there will
> >>>>>>         only ever
> >>>>>>         be one around (per instance of a framework). Reading the
> >>>>>>         protobuf doc,
> >>>>>>         I learned this:
> >>>>>>
> >>>>>>
> >>>>>>         /**
> >>>>>>          * A framework generated ID to distinguish a task. The ID
> must
> >>>>>>         remain
> >>>>>>          * unique while the task is active. However, a framework can
> >>>>>>         reuse an
> >>>>>>          * ID _only_ if a previous task with the same ID has
> reached a
> >>>>>>          * terminal state (e.g., TASK_FINISHED, TASK_LOST,
> >>>>>>         TASK_KILLED, etc.).
> >>>>>>          */
> >>>>>>         message TaskID {
> >>>>>>           required string value = 1;
> >>>>>>         }
> >>>>>>         ---
> >>>>>>         Which makes me think that it's reasonable to just give this
> >>>>>>         task the
> >>>>>>         same taskID, and that every time I bring it from a terminal
> >>>>>>         status to
> >>>>>>         running once more, I can reuse the same ID. This also gives
> me
> >>>>>> the
> >>>>>>         benefit of being able to more easily locate the task for a
> >>>>>> given
> >>>>>>         framework, and I'm able to exploit Mesos for some weak
> >>>>>> guarantees
> >>>>>>         saying there wont be multiple of these running (don't worry,
> >>>>>>         they lock
> >>>>>>         in Zookeeper, and concurrent runs don't do anything, they
> just
> >>>>>>         fail).
> >>>>>>
> >>>>>>         Opinions?
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>     ^^Nos encantan los árboles. No me imprimas si no es necesario.
> >>>>>
> >>>>>     Protección de Datos:Mundo Reader S.L. le informa de que los datos
> >>>>>     personales facilitados por Ud. y utilizados para el envío de esta
> >>>>>     comunicación serán objeto de tratamiento automatizado o no en
> >>>>>     nuestros ficheros, con la finalidad de gestionar la agenda de
> >>>>>     contactos de nuestra empresa y para el envío de comunicaciones
> >>>>>     profesionales por cualquier medio electrónico o no. Puede
> consultar
> >>>>>     en www.bq.com <http://www.bq.com/>los detalles de nuestra
> Política
> >>>>>     de Privacidad y dónde ejercer el derecho de acceso,
> rectificación,
> >>>>>     cancelación y oposición.
> >>>>>
> >>>>>     Confidencialidad:Este mensaje contiene material confidencial y
> está
> >>>>>     dirigido exclusivamente a su destinatario. Cualquier revisión,
> >>>>>     modificación o distribución por otras personas, así como su
> reenvío
> >>>>>     sin el consentimiento expreso está estrictamente prohibido. Si
> >>>>> usted
> >>>>>     no es el destinatario del mensaje, por favor, comuníqueselo al
> >>>>>     emisor y borre todas las copias de forma inmediata.
> >>>>>     Confidentiality:This e-mail contains material that is
> confidential
> >>>>>     for de sole use of de intended recipient. Any review, reliance or
> >>>>>     distribution by others or forwarding without express permission
> is
> >>>>>     strictly prohibited. If you are not the intended recipient,
> please
> >>>>>     contact the sender and delete all copies.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Reusing Task IDs

Reply via email to