I would vote for updating comments in mesos.proto to warn users to not re-use task id for now.
On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma <[email protected]> wrote: > Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that > Master'll crash when Master failover with duplicated TaskID. > > Here's the case of *MESOS-3070*: > T1: launch task (t1) on agent (agent_1) > T2: master failover > T3: launch another task (t1) on agent (agent_2) before agent_1 > re-registering back > T4: agent_1 re-registered back; master'll crash because of `CHECK` when > add task (t1) back to master > > Is there any special case that framework has to re-use the TaskID; if no > special case, I think we should ask framework to avoid reuse TaskID. > > ---- > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 | [email protected] | http://k82.me > > On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <[email protected]> > wrote: > >> tldr; *Reusing TaskIDs clashes with the mesos-agent recovery feature.* >> >> Adam Bordelon wrote: >> > Reusing taskIds may work if you're guaranteed to never be running two >> instances of the same taskId simultaneously >> >> I've encountered another scenario where reusing TaskIDs is dangerous, >> even if you meet the guarantee of never running 2 task instances with the >> same TaskID simultaneously. >> >> *Scenario leading to a problem:* >> >> Say you have a task with ID "T1", which terminates for some reason, so >> its terminal status update gets recorded into the agent's current "run" in >> the task's updates file: >> >> >> MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates >> >> Then say a new task is launched with the same ID of T1, and it gets >> scheduled under the same Executor on the same agent host. In that case, the >> task will be reusing the same work_dir path, and thus have the already >> recorded "terminal status update" in its task.updates file. So the updates >> file has a stream of updates that might look like this: >> >> - TASK_RUNNING >> - TASK_FINISHED >> - TASK_RUNNING >> >> Say you subsequently restart the mesos-slave/agent, expecting all tasks >> to survive the restart via the recovery process. Unfortunately, T1 is >> terminated because the task recovery logic >> <https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708> >> [1] >> looks at the current run's tasks' task.updates files, searching for tasks >> with "terminal status updates", and then terminating any such tasks. So, >> even though T1 was actually running just fine, it gets terminated because >> at some point in its previous incarnation it had a "terminal status update" >> recorded. >> >> *Leads to inconsistent state* >> >> Compounding the problem, this termination is done without informing the >> Executor, and thus the process underlying the task continues to run, even >> though mesos thinks it's gone. Which is really bad since it leaves the >> host with a different state than mesos thinks exists. e.g., if the task had >> a port resource, then mesos incorrectly thinks the port is now free, so a >> framework might try to launch a task/executor that uses the port, but it >> will fail because the process cannot bind to the port. >> >> *Change recovery code or just update comments in mesos.proto?* >> >> Perhaps this behavior could be considered a "bug" and the recovery logic >> that processes tasks status updates could be modified to ignore "terminal >> status updates" if there is a subsequent TASK_RUNNING update in the >> task.updates file. If that sounds like a desirable change, I'm happy to >> file a JIRA issue for that and work on the fix myself. >> >> If we think the recovery logic is fine as it is, then we should update these >> comments >> <https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66> >> [2] >> in mesos.proto since they are incorrect given the behavior I just >> encountered: >> >> A framework generated ID to distinguish a task. The ID must remain >>> unique while the task is active. However, a framework can reuse an >>> ID _only_ if a previous task with the same ID has reached a >>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). >> >> >> *Conclusion* >> >> It is dangerous indeed to reuse a TaskID for separate task runs, even if >> they are guaranteed to not be running concurrently. >> >> - Erik >> >> >> P.S., I encountered this problem while trying to use mesos-agent recovery >> with the storm-mesos framework <https://github.com/mesos/storm> [3]. >> Notably, this framework sets the TaskID to >> "<agenthostname>-<stormworkerport>" for the storm worker tasks, so when a >> storm worker dies and is reborn on that host, the TaskID gets reused. But >> then the task doesn't survive an agent restart (even though the worker >> *process* does survive, putting us in an inconsistent state!). >> >> P.P.S., being able to enable verbose logging in mesos-slave/agent with >> the GLOG_v=3 environment variable is *super* convenient! Would have taken >> me *way* longer to figure this out if the verbose logging didn't exist. >> >> P.P.P.S, To debug this, I wrote a tool >> <https://github.com/erikdw/protoc-decode-lenprefix> [4] to decode >> length-prefixed >> protobuf >> <http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers> >> [5] >> files, such as task.updates. >> >> Here's an example of invoking the tool (notably, it has the same syntax >> as "protoc --decode", but handles the length-prefix headers): >> >> cat task.updates | \ >> protoc-decode-lenprefix \ >> --decode mesos.internal.StatusUpdateRecord \ >> -I MESOS_CODE/src -I MESOS_CODE/include \ >> MESOS_CODE/src/messages/messages.proto >> >> >> [1] >> https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708 >> [2] >> https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66 >> [3] https://github.com/mesos/storm >> [4] https://github.com/erikdw/protoc-decode-lenprefix >> [5] >> http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers >> >> On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <[email protected]> wrote: >> >>> I'd be most curious to see a working example of this idea, prefixes >>> and all for sleeping (long term sleeping) nodes (slave and masters). >>> >>> Anybody, do post what you have/are doing on this taskid resuse and >>> reservations experimentations. Probably many are interested for a variety >>> of reasons including but not limited to security, auditing and node >>> diversification interests.... My interests are in self-modifying >>> codes, which can be achieved whilst the nodes sleep for some very >>> interesting applications. >>> >>> >>> James >>> >>> >>> >>> On 07/11/2015 06:01 AM, Adam Bordelon wrote: >>> >>>> Reusing taskIds may work if you're guaranteed to never be running two >>>> instances of the same taskId simultaneously, but I could imagine a >>>> particularly dangerous scenario where a master and slave experience a >>>> network partition, so the master declares the slave lost and therefore >>>> its tasks lost, and then the framework scheduler launches a new task >>>> with the same taskId. However, the task is still running on the original >>>> slave. When the slave reregisters and claims it is running that taskId, >>>> or that that taskId has completed, the Mesos master may have a difficult >>>> time reconciling which instance of the task is on which node and in >>>> which status, since it expects only one instance to exist at a time. >>>> You may be better off using a fixed taskId prefix and appending an >>>> incrementing instance/trial number so that each run gets a uniqueId. >>>> Also note that taskIds only need to be unique within a single >>>> frameworkId, so don't worry about conflicting with other frameworks. >>>> TL;DR: I wouldn't recommend it. >>>> >>>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> Sounds risky. Every task should have its own unique id, collisions >>>> could happen and unexpected issues. >>>> >>>> I think it will be as hard to monitor that you can start again a >>>> task than get a mechanism to know it’s ID. >>>> >>>> >>>> >>>> On 10 Jul 2015, at 19:14, Jie Yu <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Re-using Task IDs is definitely not encouraged. As far as I know, >>>>> many of the Mesos code assume Task ID is unique. So I probably >>>>> won't risk that. >>>>> >>>>> >>>>> On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Is reusing Task IDs good behaviour? Let's say that I have some >>>>> singleton task - I'll call it a monitoring service. It's >>>>> always going >>>>> to be the same process, doing the same thing, and there will >>>>> only ever >>>>> be one around (per instance of a framework). Reading the >>>>> protobuf doc, >>>>> I learned this: >>>>> >>>>> >>>>> /** >>>>> * A framework generated ID to distinguish a task. The ID must >>>>> remain >>>>> * unique while the task is active. However, a framework can >>>>> reuse an >>>>> * ID _only_ if a previous task with the same ID has reached a >>>>> * terminal state (e.g., TASK_FINISHED, TASK_LOST, >>>>> TASK_KILLED, etc.). >>>>> */ >>>>> message TaskID { >>>>> required string value = 1; >>>>> } >>>>> --- >>>>> Which makes me think that it's reasonable to just give this >>>>> task the >>>>> same taskID, and that every time I bring it from a terminal >>>>> status to >>>>> running once more, I can reuse the same ID. This also gives me >>>>> the >>>>> benefit of being able to more easily locate the task for a >>>>> given >>>>> framework, and I'm able to exploit Mesos for some weak >>>>> guarantees >>>>> saying there wont be multiple of these running (don't worry, >>>>> they lock >>>>> in Zookeeper, and concurrent runs don't do anything, they just >>>>> fail). >>>>> >>>>> Opinions? >>>>> >>>>> >>>>> >>>> >>>> ^^Nos encantan los árboles. No me imprimas si no es necesario. >>>> >>>> Protección de Datos:Mundo Reader S.L. le informa de que los datos >>>> personales facilitados por Ud. y utilizados para el envío de esta >>>> comunicación serán objeto de tratamiento automatizado o no en >>>> nuestros ficheros, con la finalidad de gestionar la agenda de >>>> contactos de nuestra empresa y para el envío de comunicaciones >>>> profesionales por cualquier medio electrónico o no. Puede consultar >>>> en www.bq.com <http://www.bq.com/>los detalles de nuestra Política >>>> de Privacidad y dónde ejercer el derecho de acceso, rectificación, >>>> cancelación y oposición. >>>> >>>> Confidencialidad:Este mensaje contiene material confidencial y está >>>> dirigido exclusivamente a su destinatario. Cualquier revisión, >>>> modificación o distribución por otras personas, así como su reenvío >>>> sin el consentimiento expreso está estrictamente prohibido. Si usted >>>> no es el destinatario del mensaje, por favor, comuníqueselo al >>>> emisor y borre todas las copias de forma inmediata. >>>> Confidentiality:This e-mail contains material that is confidential >>>> for de sole use of de intended recipient. Any review, reliance or >>>> distribution by others or forwarding without express permission is >>>> strictly prohibited. If you are not the intended recipient, please >>>> contact the sender and delete all copies. >>>> >>>> >>>> >>> >> >

