Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that Master'll crash when Master failover with duplicated TaskID.
Here's the case of *MESOS-3070*: T1: launch task (t1) on agent (agent_1) T2: master failover T3: launch another task (t1) on agent (agent_2) before agent_1 re-registering back T4: agent_1 re-registered back; master'll crash because of `CHECK` when add task (t1) back to master Is there any special case that framework has to re-use the TaskID; if no special case, I think we should ask framework to avoid reuse TaskID. ---- Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <eweath...@groupon.com> wrote: > tldr; *Reusing TaskIDs clashes with the mesos-agent recovery feature.* > > Adam Bordelon wrote: > > Reusing taskIds may work if you're guaranteed to never be running two > instances of the same taskId simultaneously > > I've encountered another scenario where reusing TaskIDs is dangerous, even > if you meet the guarantee of never running 2 task instances with the same > TaskID simultaneously. > > *Scenario leading to a problem:* > > Say you have a task with ID "T1", which terminates for some reason, so its > terminal status update gets recorded into the agent's current "run" in the > task's updates file: > > > MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates > > Then say a new task is launched with the same ID of T1, and it gets > scheduled under the same Executor on the same agent host. In that case, the > task will be reusing the same work_dir path, and thus have the already > recorded "terminal status update" in its task.updates file. So the updates > file has a stream of updates that might look like this: > > - TASK_RUNNING > - TASK_FINISHED > - TASK_RUNNING > > Say you subsequently restart the mesos-slave/agent, expecting all tasks to > survive the restart via the recovery process. Unfortunately, T1 is > terminated because the task recovery logic > <https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708> > [1] > looks at the current run's tasks' task.updates files, searching for tasks > with "terminal status updates", and then terminating any such tasks. So, > even though T1 was actually running just fine, it gets terminated because > at some point in its previous incarnation it had a "terminal status update" > recorded. > > *Leads to inconsistent state* > > Compounding the problem, this termination is done without informing the > Executor, and thus the process underlying the task continues to run, even > though mesos thinks it's gone. Which is really bad since it leaves the > host with a different state than mesos thinks exists. e.g., if the task had > a port resource, then mesos incorrectly thinks the port is now free, so a > framework might try to launch a task/executor that uses the port, but it > will fail because the process cannot bind to the port. > > *Change recovery code or just update comments in mesos.proto?* > > Perhaps this behavior could be considered a "bug" and the recovery logic > that processes tasks status updates could be modified to ignore "terminal > status updates" if there is a subsequent TASK_RUNNING update in the > task.updates file. If that sounds like a desirable change, I'm happy to > file a JIRA issue for that and work on the fix myself. > > If we think the recovery logic is fine as it is, then we should update these > comments > <https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66> > [2] > in mesos.proto since they are incorrect given the behavior I just > encountered: > > A framework generated ID to distinguish a task. The ID must remain >> unique while the task is active. However, a framework can reuse an >> ID _only_ if a previous task with the same ID has reached a >> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). > > > *Conclusion* > > It is dangerous indeed to reuse a TaskID for separate task runs, even if > they are guaranteed to not be running concurrently. > > - Erik > > > P.S., I encountered this problem while trying to use mesos-agent recovery > with the storm-mesos framework <https://github.com/mesos/storm> [3]. > Notably, this framework sets the TaskID to > "<agenthostname>-<stormworkerport>" for the storm worker tasks, so when a > storm worker dies and is reborn on that host, the TaskID gets reused. But > then the task doesn't survive an agent restart (even though the worker > *process* does survive, putting us in an inconsistent state!). > > P.P.S., being able to enable verbose logging in mesos-slave/agent with the > GLOG_v=3 environment variable is *super* convenient! Would have taken me > *way* longer to figure this out if the verbose logging didn't exist. > > P.P.P.S, To debug this, I wrote a tool > <https://github.com/erikdw/protoc-decode-lenprefix> [4] to decode > length-prefixed > protobuf > <http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers> > [5] > files, such as task.updates. > > Here's an example of invoking the tool (notably, it has the same syntax as > "protoc --decode", but handles the length-prefix headers): > > cat task.updates | \ > protoc-decode-lenprefix \ > --decode mesos.internal.StatusUpdateRecord \ > -I MESOS_CODE/src -I MESOS_CODE/include \ > MESOS_CODE/src/messages/messages.proto > > > [1] > https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708 > [2] > https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66 > [3] https://github.com/mesos/storm > [4] https://github.com/erikdw/protoc-decode-lenprefix > [5] > http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers > > On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <cc...@tampabay.rr.com> wrote: > >> I'd be most curious to see a working example of this idea, prefixes >> and all for sleeping (long term sleeping) nodes (slave and masters). >> >> Anybody, do post what you have/are doing on this taskid resuse and >> reservations experimentations. Probably many are interested for a variety >> of reasons including but not limited to security, auditing and node >> diversification interests.... My interests are in self-modifying >> codes, which can be achieved whilst the nodes sleep for some very >> interesting applications. >> >> >> James >> >> >> >> On 07/11/2015 06:01 AM, Adam Bordelon wrote: >> >>> Reusing taskIds may work if you're guaranteed to never be running two >>> instances of the same taskId simultaneously, but I could imagine a >>> particularly dangerous scenario where a master and slave experience a >>> network partition, so the master declares the slave lost and therefore >>> its tasks lost, and then the framework scheduler launches a new task >>> with the same taskId. However, the task is still running on the original >>> slave. When the slave reregisters and claims it is running that taskId, >>> or that that taskId has completed, the Mesos master may have a difficult >>> time reconciling which instance of the task is on which node and in >>> which status, since it expects only one instance to exist at a time. >>> You may be better off using a fixed taskId prefix and appending an >>> incrementing instance/trial number so that each run gets a uniqueId. >>> Also note that taskIds only need to be unique within a single >>> frameworkId, so don't worry about conflicting with other frameworks. >>> TL;DR: I wouldn't recommend it. >>> >>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández >>> <antonio.fernan...@bq.com <mailto:antonio.fernan...@bq.com>> wrote: >>> >>> Sounds risky. Every task should have its own unique id, collisions >>> could happen and unexpected issues. >>> >>> I think it will be as hard to monitor that you can start again a >>> task than get a mechanism to know it’s ID. >>> >>> >>> >>> On 10 Jul 2015, at 19:14, Jie Yu <yujie....@gmail.com >>>> <mailto:yujie....@gmail.com>> wrote: >>>> >>>> Re-using Task IDs is definitely not encouraged. As far as I know, >>>> many of the Mesos code assume Task ID is unique. So I probably >>>> won't risk that. >>>> >>>> >>>> On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon <sar...@sargun.me >>>> <mailto:sar...@sargun.me>> wrote: >>>> >>>> Is reusing Task IDs good behaviour? Let's say that I have some >>>> singleton task - I'll call it a monitoring service. It's >>>> always going >>>> to be the same process, doing the same thing, and there will >>>> only ever >>>> be one around (per instance of a framework). Reading the >>>> protobuf doc, >>>> I learned this: >>>> >>>> >>>> /** >>>> * A framework generated ID to distinguish a task. The ID must >>>> remain >>>> * unique while the task is active. However, a framework can >>>> reuse an >>>> * ID _only_ if a previous task with the same ID has reached a >>>> * terminal state (e.g., TASK_FINISHED, TASK_LOST, >>>> TASK_KILLED, etc.). >>>> */ >>>> message TaskID { >>>> required string value = 1; >>>> } >>>> --- >>>> Which makes me think that it's reasonable to just give this >>>> task the >>>> same taskID, and that every time I bring it from a terminal >>>> status to >>>> running once more, I can reuse the same ID. This also gives me >>>> the >>>> benefit of being able to more easily locate the task for a given >>>> framework, and I'm able to exploit Mesos for some weak >>>> guarantees >>>> saying there wont be multiple of these running (don't worry, >>>> they lock >>>> in Zookeeper, and concurrent runs don't do anything, they just >>>> fail). >>>> >>>> Opinions? >>>> >>>> >>>> >>> >>> ^^Nos encantan los árboles. No me imprimas si no es necesario. >>> >>> Protección de Datos:Mundo Reader S.L. le informa de que los datos >>> personales facilitados por Ud. y utilizados para el envío de esta >>> comunicación serán objeto de tratamiento automatizado o no en >>> nuestros ficheros, con la finalidad de gestionar la agenda de >>> contactos de nuestra empresa y para el envío de comunicaciones >>> profesionales por cualquier medio electrónico o no. Puede consultar >>> en www.bq.com <http://www.bq.com/>los detalles de nuestra Política >>> de Privacidad y dónde ejercer el derecho de acceso, rectificación, >>> cancelación y oposición. >>> >>> Confidencialidad:Este mensaje contiene material confidencial y está >>> dirigido exclusivamente a su destinatario. Cualquier revisión, >>> modificación o distribución por otras personas, así como su reenvío >>> sin el consentimiento expreso está estrictamente prohibido. Si usted >>> no es el destinatario del mensaje, por favor, comuníqueselo al >>> emisor y borre todas las copias de forma inmediata. >>> Confidentiality:This e-mail contains material that is confidential >>> for de sole use of de intended recipient. Any review, reliance or >>> distribution by others or forwarding without express permission is >>> strictly prohibited. If you are not the intended recipient, please >>> contact the sender and delete all copies. >>> >>> >>> >> >