As someone who has been there and back again (Reusing task-IDs, and realizing it's a terrible idea), I'd put some advise in the docs + mesos.proto to compose task IDs from GUIDs, and add that it's dangerous to reuse them.
I would advocate for a mechanism to prevent the usage of non-unique IDs for executors, tasks, and frameworks, but I feel that's a more complex, and larger problem. On Mon, Feb 22, 2016 at 1:19 PM, Vinod Kone <[email protected]> wrote: > I would vote for updating comments in mesos.proto to warn users to not > re-use task id for now. > > On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma <[email protected]> wrote: >> >> Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that >> Master'll crash when Master failover with duplicated TaskID. >> >> Here's the case of MESOS-3070: >> T1: launch task (t1) on agent (agent_1) >> T2: master failover >> T3: launch another task (t1) on agent (agent_2) before agent_1 >> re-registering back >> T4: agent_1 re-registered back; master'll crash because of `CHECK` when >> add task (t1) back to master >> >> Is there any special case that framework has to re-use the TaskID; if no >> special case, I think we should ask framework to avoid reuse TaskID. >> >> ---- >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >> Platform OpenSource Technology, STG, IBM GCG >> +86-10-8245 4084 | [email protected] | http://k82.me >> >> On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <[email protected]> >> wrote: >>> >>> tldr; Reusing TaskIDs clashes with the mesos-agent recovery feature. >>> >>> Adam Bordelon wrote: >>> > Reusing taskIds may work if you're guaranteed to never be running two >>> > instances of the same taskId simultaneously >>> >>> I've encountered another scenario where reusing TaskIDs is dangerous, >>> even if you meet the guarantee of never running 2 task instances with the >>> same TaskID simultaneously. >>> >>> Scenario leading to a problem: >>> >>> Say you have a task with ID "T1", which terminates for some reason, so >>> its terminal status update gets recorded into the agent's current "run" in >>> the task's updates file: >>> >>> >>> MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates >>> >>> Then say a new task is launched with the same ID of T1, and it gets >>> scheduled under the same Executor on the same agent host. In that case, the >>> task will be reusing the same work_dir path, and thus have the already >>> recorded "terminal status update" in its task.updates file. So the updates >>> file has a stream of updates that might look like this: >>> >>> TASK_RUNNING >>> TASK_FINISHED >>> TASK_RUNNING >>> >>> Say you subsequently restart the mesos-slave/agent, expecting all tasks >>> to survive the restart via the recovery process. Unfortunately, T1 is >>> terminated because the task recovery logic [1] looks at the current run's >>> tasks' task.updates files, searching for tasks with "terminal status >>> updates", and then terminating any such tasks. So, even though T1 was >>> actually running just fine, it gets terminated because at some point in its >>> previous incarnation it had a "terminal status update" recorded. >>> >>> Leads to inconsistent state >>> >>> Compounding the problem, this termination is done without informing the >>> Executor, and thus the process underlying the task continues to run, even >>> though mesos thinks it's gone. Which is really bad since it leaves the host >>> with a different state than mesos thinks exists. e.g., if the task had a >>> port resource, then mesos incorrectly thinks the port is now free, so a >>> framework might try to launch a task/executor that uses the port, but it >>> will fail because the process cannot bind to the port. >>> >>> Change recovery code or just update comments in mesos.proto? >>> >>> Perhaps this behavior could be considered a "bug" and the recovery logic >>> that processes tasks status updates could be modified to ignore "terminal >>> status updates" if there is a subsequent TASK_RUNNING update in the >>> task.updates file. If that sounds like a desirable change, I'm happy to >>> file a JIRA issue for that and work on the fix myself. >>> >>> If we think the recovery logic is fine as it is, then we should update >>> these comments [2] in mesos.proto since they are incorrect given the >>> behavior I just encountered: >>> >>>> A framework generated ID to distinguish a task. The ID must remain >>>> unique while the task is active. However, a framework can reuse an >>>> ID _only_ if a previous task with the same ID has reached a >>>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). >>> >>> >>> Conclusion >>> >>> It is dangerous indeed to reuse a TaskID for separate task runs, even if >>> they are guaranteed to not be running concurrently. >>> >>> - Erik >>> >>> >>> P.S., I encountered this problem while trying to use mesos-agent recovery >>> with the storm-mesos framework [3]. Notably, this framework sets the TaskID >>> to "<agenthostname>-<stormworkerport>" for the storm worker tasks, so when a >>> storm worker dies and is reborn on that host, the TaskID gets reused. But >>> then the task doesn't survive an agent restart (even though the worker >>> *process* does survive, putting us in an inconsistent state!). >>> >>> P.P.S., being able to enable verbose logging in mesos-slave/agent with >>> the GLOG_v=3 environment variable is *super* convenient! Would have taken >>> me *way* longer to figure this out if the verbose logging didn't exist. >>> >>> P.P.P.S, To debug this, I wrote a tool [4] to decode length-prefixed >>> protobuf [5] files, such as task.updates. >>> >>> Here's an example of invoking the tool (notably, it has the same syntax >>> as "protoc --decode", but handles the length-prefix headers): >>> >>> cat task.updates | \ >>> protoc-decode-lenprefix \ >>> --decode mesos.internal.StatusUpdateRecord \ >>> -I MESOS_CODE/src -I MESOS_CODE/include \ >>> MESOS_CODE/src/messages/messages.proto >>> >>> >>> [1] >>> https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708 >>> [2] >>> https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66 >>> [3] https://github.com/mesos/storm >>> [4] https://github.com/erikdw/protoc-decode-lenprefix >>> [5] >>> http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers >>> >>> On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <[email protected]> wrote: >>>> >>>> I'd be most curious to see a working example of this idea, prefixes >>>> and all for sleeping (long term sleeping) nodes (slave and masters). >>>> >>>> Anybody, do post what you have/are doing on this taskid resuse and >>>> reservations experimentations. Probably many are interested for a variety >>>> of >>>> reasons including but not limited to security, auditing and node >>>> diversification interests.... My interests are in self-modifying >>>> codes, which can be achieved whilst the nodes sleep for some very >>>> interesting applications. >>>> >>>> >>>> James >>>> >>>> >>>> >>>> On 07/11/2015 06:01 AM, Adam Bordelon wrote: >>>>> >>>>> Reusing taskIds may work if you're guaranteed to never be running two >>>>> instances of the same taskId simultaneously, but I could imagine a >>>>> particularly dangerous scenario where a master and slave experience a >>>>> network partition, so the master declares the slave lost and therefore >>>>> its tasks lost, and then the framework scheduler launches a new task >>>>> with the same taskId. However, the task is still running on the >>>>> original >>>>> slave. When the slave reregisters and claims it is running that taskId, >>>>> or that that taskId has completed, the Mesos master may have a >>>>> difficult >>>>> time reconciling which instance of the task is on which node and in >>>>> which status, since it expects only one instance to exist at a time. >>>>> You may be better off using a fixed taskId prefix and appending an >>>>> incrementing instance/trial number so that each run gets a uniqueId. >>>>> Also note that taskIds only need to be unique within a single >>>>> frameworkId, so don't worry about conflicting with other frameworks. >>>>> TL;DR: I wouldn't recommend it. >>>>> >>>>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Sounds risky. Every task should have its own unique id, collisions >>>>> could happen and unexpected issues. >>>>> >>>>> I think it will be as hard to monitor that you can start again a >>>>> task than get a mechanism to know it’s ID. >>>>> >>>>> >>>>> >>>>>> On 10 Jul 2015, at 19:14, Jie Yu <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Re-using Task IDs is definitely not encouraged. As far as I know, >>>>>> many of the Mesos code assume Task ID is unique. So I probably >>>>>> won't risk that. >>>>>> >>>>>> >>>>>> On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Is reusing Task IDs good behaviour? Let's say that I have some >>>>>> singleton task - I'll call it a monitoring service. It's >>>>>> always going >>>>>> to be the same process, doing the same thing, and there will >>>>>> only ever >>>>>> be one around (per instance of a framework). Reading the >>>>>> protobuf doc, >>>>>> I learned this: >>>>>> >>>>>> >>>>>> /** >>>>>> * A framework generated ID to distinguish a task. The ID must >>>>>> remain >>>>>> * unique while the task is active. However, a framework can >>>>>> reuse an >>>>>> * ID _only_ if a previous task with the same ID has reached a >>>>>> * terminal state (e.g., TASK_FINISHED, TASK_LOST, >>>>>> TASK_KILLED, etc.). >>>>>> */ >>>>>> message TaskID { >>>>>> required string value = 1; >>>>>> } >>>>>> --- >>>>>> Which makes me think that it's reasonable to just give this >>>>>> task the >>>>>> same taskID, and that every time I bring it from a terminal >>>>>> status to >>>>>> running once more, I can reuse the same ID. This also gives me >>>>>> the >>>>>> benefit of being able to more easily locate the task for a >>>>>> given >>>>>> framework, and I'm able to exploit Mesos for some weak >>>>>> guarantees >>>>>> saying there wont be multiple of these running (don't worry, >>>>>> they lock >>>>>> in Zookeeper, and concurrent runs don't do anything, they just >>>>>> fail). >>>>>> >>>>>> Opinions? >>>>>> >>>>>> >>>>> >>>>> >>>>> ^^Nos encantan los árboles. No me imprimas si no es necesario. >>>>> >>>>> Protección de Datos:Mundo Reader S.L. le informa de que los datos >>>>> personales facilitados por Ud. y utilizados para el envío de esta >>>>> comunicación serán objeto de tratamiento automatizado o no en >>>>> nuestros ficheros, con la finalidad de gestionar la agenda de >>>>> contactos de nuestra empresa y para el envío de comunicaciones >>>>> profesionales por cualquier medio electrónico o no. Puede consultar >>>>> en www.bq.com <http://www.bq.com/>los detalles de nuestra Política >>>>> de Privacidad y dónde ejercer el derecho de acceso, rectificación, >>>>> cancelación y oposición. >>>>> >>>>> Confidencialidad:Este mensaje contiene material confidencial y está >>>>> dirigido exclusivamente a su destinatario. Cualquier revisión, >>>>> modificación o distribución por otras personas, así como su reenvío >>>>> sin el consentimiento expreso está estrictamente prohibido. Si >>>>> usted >>>>> no es el destinatario del mensaje, por favor, comuníqueselo al >>>>> emisor y borre todas las copias de forma inmediata. >>>>> Confidentiality:This e-mail contains material that is confidential >>>>> for de sole use of de intended recipient. Any review, reliance or >>>>> distribution by others or forwarding without express permission is >>>>> strictly prohibited. If you are not the intended recipient, please >>>>> contact the sender and delete all copies. >>>>> >>>>> >>>> >>> >> >

