Thanks for the responses. Filed a ticket for this: - https://issues.apache.org/jira/browse/MESOS-4737
- Erik On Mon, Feb 22, 2016 at 1:23 PM, Sargun Dhillon <[email protected]> wrote: > As someone who has been there and back again (Reusing task-IDs, and > realizing it's a terrible idea), I'd put some advise in the docs + > mesos.proto to compose task IDs from GUIDs, and add that it's > dangerous to reuse them. > > I would advocate for a mechanism to prevent the usage of non-unique > IDs for executors, tasks, and frameworks, but I feel that's a more > complex, and larger problem. > > On Mon, Feb 22, 2016 at 1:19 PM, Vinod Kone <[email protected]> wrote: > > I would vote for updating comments in mesos.proto to warn users to not > > re-use task id for now. > > > > On Sun, Feb 21, 2016 at 9:05 PM, Klaus Ma <[email protected]> > wrote: > >> > >> Yes, it's dangerous to reuse TaskID; there's a JIRA (MESOS-3070) that > >> Master'll crash when Master failover with duplicated TaskID. > >> > >> Here's the case of MESOS-3070: > >> T1: launch task (t1) on agent (agent_1) > >> T2: master failover > >> T3: launch another task (t1) on agent (agent_2) before agent_1 > >> re-registering back > >> T4: agent_1 re-registered back; master'll crash because of `CHECK` when > >> add task (t1) back to master > >> > >> Is there any special case that framework has to re-use the TaskID; if no > >> special case, I think we should ask framework to avoid reuse TaskID. > >> > >> ---- > >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > >> Platform OpenSource Technology, STG, IBM GCG > >> +86-10-8245 4084 | [email protected] | http://k82.me > >> > >> On Mon, Feb 22, 2016 at 12:24 PM, Erik Weathers <[email protected]> > >> wrote: > >>> > >>> tldr; Reusing TaskIDs clashes with the mesos-agent recovery feature. > >>> > >>> Adam Bordelon wrote: > >>> > Reusing taskIds may work if you're guaranteed to never be running two > >>> > instances of the same taskId simultaneously > >>> > >>> I've encountered another scenario where reusing TaskIDs is dangerous, > >>> even if you meet the guarantee of never running 2 task instances with > the > >>> same TaskID simultaneously. > >>> > >>> Scenario leading to a problem: > >>> > >>> Say you have a task with ID "T1", which terminates for some reason, so > >>> its terminal status update gets recorded into the agent's current > "run" in > >>> the task's updates file: > >>> > >>> > >>> > MESOS_WORK_DIR/meta/slaves/latest/frameworks/FRAMEWORK_ID/executors/EXECUTOR_ID/runs/latest/tasks/T1/task.updates > >>> > >>> Then say a new task is launched with the same ID of T1, and it gets > >>> scheduled under the same Executor on the same agent host. In that > case, the > >>> task will be reusing the same work_dir path, and thus have the already > >>> recorded "terminal status update" in its task.updates file. So the > updates > >>> file has a stream of updates that might look like this: > >>> > >>> TASK_RUNNING > >>> TASK_FINISHED > >>> TASK_RUNNING > >>> > >>> Say you subsequently restart the mesos-slave/agent, expecting all tasks > >>> to survive the restart via the recovery process. Unfortunately, T1 is > >>> terminated because the task recovery logic [1] looks at the current > run's > >>> tasks' task.updates files, searching for tasks with "terminal status > >>> updates", and then terminating any such tasks. So, even though T1 was > >>> actually running just fine, it gets terminated because at some point > in its > >>> previous incarnation it had a "terminal status update" recorded. > >>> > >>> Leads to inconsistent state > >>> > >>> Compounding the problem, this termination is done without informing the > >>> Executor, and thus the process underlying the task continues to run, > even > >>> though mesos thinks it's gone. Which is really bad since it leaves > the host > >>> with a different state than mesos thinks exists. e.g., if the task had > a > >>> port resource, then mesos incorrectly thinks the port is now free, so a > >>> framework might try to launch a task/executor that uses the port, but > it > >>> will fail because the process cannot bind to the port. > >>> > >>> Change recovery code or just update comments in mesos.proto? > >>> > >>> Perhaps this behavior could be considered a "bug" and the recovery > logic > >>> that processes tasks status updates could be modified to ignore > "terminal > >>> status updates" if there is a subsequent TASK_RUNNING update in the > >>> task.updates file. If that sounds like a desirable change, I'm happy > to > >>> file a JIRA issue for that and work on the fix myself. > >>> > >>> If we think the recovery logic is fine as it is, then we should update > >>> these comments [2] in mesos.proto since they are incorrect given the > >>> behavior I just encountered: > >>> > >>>> A framework generated ID to distinguish a task. The ID must remain > >>>> unique while the task is active. However, a framework can reuse an > >>>> ID _only_ if a previous task with the same ID has reached a > >>>> terminal state (e.g., TASK_FINISHED, TASK_LOST, TASK_KILLED, etc.). > >>> > >>> > >>> Conclusion > >>> > >>> It is dangerous indeed to reuse a TaskID for separate task runs, even > if > >>> they are guaranteed to not be running concurrently. > >>> > >>> - Erik > >>> > >>> > >>> P.S., I encountered this problem while trying to use mesos-agent > recovery > >>> with the storm-mesos framework [3]. Notably, this framework sets the > TaskID > >>> to "<agenthostname>-<stormworkerport>" for the storm worker tasks, so > when a > >>> storm worker dies and is reborn on that host, the TaskID gets reused. > But > >>> then the task doesn't survive an agent restart (even though the worker > >>> *process* does survive, putting us in an inconsistent state!). > >>> > >>> P.P.S., being able to enable verbose logging in mesos-slave/agent with > >>> the GLOG_v=3 environment variable is *super* convenient! Would have > taken > >>> me *way* longer to figure this out if the verbose logging didn't exist. > >>> > >>> P.P.P.S, To debug this, I wrote a tool [4] to decode length-prefixed > >>> protobuf [5] files, such as task.updates. > >>> > >>> Here's an example of invoking the tool (notably, it has the same syntax > >>> as "protoc --decode", but handles the length-prefix headers): > >>> > >>> cat task.updates | \ > >>> protoc-decode-lenprefix \ > >>> --decode mesos.internal.StatusUpdateRecord \ > >>> -I MESOS_CODE/src -I MESOS_CODE/include \ > >>> MESOS_CODE/src/messages/messages.proto > >>> > >>> > >>> [1] > >>> > https://github.com/apache/mesos/blob/0.27.0/src/slave/slave.cpp#L5701-L5708 > >>> [2] > >>> > https://github.com/apache/mesos/blob/0.27.0/include/mesos/mesos.proto#L63-L66 > >>> [3] https://github.com/mesos/storm > >>> [4] https://github.com/erikdw/protoc-decode-lenprefix > >>> [5] > >>> > http://eli.thegreenplace.net/2011/08/02/length-prefix-framing-for-protocol-buffers > >>> > >>> On Sat, Jul 11, 2015 at 11:45 AM, CCAAT <[email protected]> wrote: > >>>> > >>>> I'd be most curious to see a working example of this idea, prefixes > >>>> and all for sleeping (long term sleeping) nodes (slave and masters). > >>>> > >>>> Anybody, do post what you have/are doing on this taskid resuse and > >>>> reservations experimentations. Probably many are interested for a > variety of > >>>> reasons including but not limited to security, auditing and node > >>>> diversification interests.... My interests are in self-modifying > >>>> codes, which can be achieved whilst the nodes sleep for some very > >>>> interesting applications. > >>>> > >>>> > >>>> James > >>>> > >>>> > >>>> > >>>> On 07/11/2015 06:01 AM, Adam Bordelon wrote: > >>>>> > >>>>> Reusing taskIds may work if you're guaranteed to never be running two > >>>>> instances of the same taskId simultaneously, but I could imagine a > >>>>> particularly dangerous scenario where a master and slave experience a > >>>>> network partition, so the master declares the slave lost and > therefore > >>>>> its tasks lost, and then the framework scheduler launches a new task > >>>>> with the same taskId. However, the task is still running on the > >>>>> original > >>>>> slave. When the slave reregisters and claims it is running that > taskId, > >>>>> or that that taskId has completed, the Mesos master may have a > >>>>> difficult > >>>>> time reconciling which instance of the task is on which node and in > >>>>> which status, since it expects only one instance to exist at a time. > >>>>> You may be better off using a fixed taskId prefix and appending an > >>>>> incrementing instance/trial number so that each run gets a uniqueId. > >>>>> Also note that taskIds only need to be unique within a single > >>>>> frameworkId, so don't worry about conflicting with other frameworks. > >>>>> TL;DR: I wouldn't recommend it. > >>>>> > >>>>> On Fri, Jul 10, 2015 at 10:20 AM, Antonio Fernández > >>>>> <[email protected] <mailto:[email protected]>> wrote: > >>>>> > >>>>> Sounds risky. Every task should have its own unique id, > collisions > >>>>> could happen and unexpected issues. > >>>>> > >>>>> I think it will be as hard to monitor that you can start again a > >>>>> task than get a mechanism to know it’s ID. > >>>>> > >>>>> > >>>>> > >>>>>> On 10 Jul 2015, at 19:14, Jie Yu <[email protected] > >>>>>> <mailto:[email protected]>> wrote: > >>>>>> > >>>>>> Re-using Task IDs is definitely not encouraged. As far as I > know, > >>>>>> many of the Mesos code assume Task ID is unique. So I probably > >>>>>> won't risk that. > >>>>>> > >>>>>> > >>>>>> On Fri, Jul 10, 2015 at 10:06 AM, Sargun Dhillon < > [email protected] > >>>>>> <mailto:[email protected]>> wrote: > >>>>>> > >>>>>> Is reusing Task IDs good behaviour? Let's say that I have > some > >>>>>> singleton task - I'll call it a monitoring service. It's > >>>>>> always going > >>>>>> to be the same process, doing the same thing, and there will > >>>>>> only ever > >>>>>> be one around (per instance of a framework). Reading the > >>>>>> protobuf doc, > >>>>>> I learned this: > >>>>>> > >>>>>> > >>>>>> /** > >>>>>> * A framework generated ID to distinguish a task. The ID > must > >>>>>> remain > >>>>>> * unique while the task is active. However, a framework can > >>>>>> reuse an > >>>>>> * ID _only_ if a previous task with the same ID has > reached a > >>>>>> * terminal state (e.g., TASK_FINISHED, TASK_LOST, > >>>>>> TASK_KILLED, etc.). > >>>>>> */ > >>>>>> message TaskID { > >>>>>> required string value = 1; > >>>>>> } > >>>>>> --- > >>>>>> Which makes me think that it's reasonable to just give this > >>>>>> task the > >>>>>> same taskID, and that every time I bring it from a terminal > >>>>>> status to > >>>>>> running once more, I can reuse the same ID. This also gives > me > >>>>>> the > >>>>>> benefit of being able to more easily locate the task for a > >>>>>> given > >>>>>> framework, and I'm able to exploit Mesos for some weak > >>>>>> guarantees > >>>>>> saying there wont be multiple of these running (don't worry, > >>>>>> they lock > >>>>>> in Zookeeper, and concurrent runs don't do anything, they > just > >>>>>> fail). > >>>>>> > >>>>>> Opinions? > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> ^^Nos encantan los árboles. No me imprimas si no es necesario. > >>>>> > >>>>> Protección de Datos:Mundo Reader S.L. le informa de que los datos > >>>>> personales facilitados por Ud. y utilizados para el envío de esta > >>>>> comunicación serán objeto de tratamiento automatizado o no en > >>>>> nuestros ficheros, con la finalidad de gestionar la agenda de > >>>>> contactos de nuestra empresa y para el envío de comunicaciones > >>>>> profesionales por cualquier medio electrónico o no. Puede > consultar > >>>>> en www.bq.com <http://www.bq.com/>los detalles de nuestra > Política > >>>>> de Privacidad y dónde ejercer el derecho de acceso, > rectificación, > >>>>> cancelación y oposición. > >>>>> > >>>>> Confidencialidad:Este mensaje contiene material confidencial y > está > >>>>> dirigido exclusivamente a su destinatario. Cualquier revisión, > >>>>> modificación o distribución por otras personas, así como su > reenvío > >>>>> sin el consentimiento expreso está estrictamente prohibido. Si > >>>>> usted > >>>>> no es el destinatario del mensaje, por favor, comuníqueselo al > >>>>> emisor y borre todas las copias de forma inmediata. > >>>>> Confidentiality:This e-mail contains material that is > confidential > >>>>> for de sole use of de intended recipient. Any review, reliance or > >>>>> distribution by others or forwarding without express permission > is > >>>>> strictly prohibited. If you are not the intended recipient, > please > >>>>> contact the sender and delete all copies. > >>>>> > >>>>> > >>>> > >>> > >> > > >

