Hi Justin, I'm sorry that you've been having difficulty with your cluster. Do you have access to master/agent logs around the time that these tasks went missing from the Mesos UI? It would be great to have a look at those if possible.
I would still recommend against setting the agent work_dir to '/tmp/mesos' for a long-running cluster scenario - this location is really only suitable for local, short-term testing purposes. We currently have a patch in flight to update our docs to clarify this point. Even though the work_dir appeared to be intact when you checked it, it's possible that some of the agent's checkpoint data had been deleted. Could you try changing the work_dir for your agents to see if that helps? Cheers, Greg On Wed, Apr 6, 2016 at 11:27 AM, Justin Ryan <[email protected]> wrote: > Thanks Rik – Interesting theory, I considered that it might have some > connection to the removal of sandbox files. > > Sooo this morning I had all of my kafka brokers disappear again, and > checked this on a node that is definitely still running kafka. All of > /tmp/mesos, including what appear to be the sandbox and logs of the running > process, are still there, and the “running” count this time is actually > higher than I’d expect. I had 9 kafka brokers and 3 flume processes > running, and the running count currently says 15. > > From: <[email protected]> on behalf of Rik <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, April 5, 2016 at 3:19 PM > To: "[email protected]" <[email protected]> > Subject: Re: Disappearing tasks > > FWIW, the only time I've seen this happen here is when someone > accidentally clears the work dir (default=/tmp/mesos), which I personally > would advise to put somewhere else where rogue people or processes are less > likely to throw things away accidentally. Could it be that? Although... > tasks were 'lost' at that point, so it differs slightly (same general > outcome, not entirely the same symptoms). > > On Tue, Apr 5, 2016 at 11:35 PM, Justin Ryan <[email protected]> wrote: > >> An interesting fact I left out, the count of “Running” tasks remains >> intact, while absolutely no history remains in the dashboard. >> >> >> >> From: Justin Ryan <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, April 5, 2016 at 12:29 PM >> To: "[email protected]" <[email protected]> >> Subject: Disappearing tasks >> >> Hiya folks! >> >> I’ve spent the past few weeks prototyping a new data cluster with Mesos, >> Kafka, and Flume delivering data to HDFS which we plan to interact with via >> Spark. In the prototype environment, I had a fairly high volume of test >> data flowing for some weeks with little to no major issues except for >> learning about tuning Kafka and Flume. >> >> I’m launching kafka with the github.com/mesos/kafka project, and flume >> is run via marathon. >> >> Yesterday morning, I came in and my flume jobs had disappeared from the >> task list in Mesos, though I found the actual processes still running when >> I searched the cluster ’ps’ output. Later in the day, I had the same >> happen to my kafka brokers. In some cases, the only way I’ve found to >> recover from this is to shut everything down and clear the zookeeper data, >> which would be fairly drastic if it happened in production, and >> particularly if we had many tasks / frameworks that were fine, but one or >> two disappeared. >> >> I’d appreciate any help sorting through this, I’m using latest Mesos and >> CDH5 installed via community Chef cookbooks. >> >> > ------------------------------ > > P Please consider the environment before printing this e-mail > The information in this electronic mail message is the sender's > confidential business and may be legally privileged. It is intended solely > for the addressee(s). Access to this internet electronic mail message by > anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it is prohibited and may be unlawful. The sender > believes that this E-mail and any attachments were free of any virus, worm, > Trojan horse, and/or malicious code when sent. This message and its > attachments could have been infected during transmission. By reading the > message and opening any attachments, the recipient accepts full > responsibility for taking protective and remedial action about viruses and > other defects. The sender's employer is not liable for any loss or damage > arising in any way. >

