Hi Justin,
I'm sorry that you've been having difficulty with your cluster. Do you have
access to master/agent logs around the time that these tasks went missing
from the Mesos UI? It would be great to have a look at those if possible.

I would still recommend against setting the agent work_dir to '/tmp/mesos'
for a long-running cluster scenario - this location is really only suitable
for local, short-term testing purposes. We currently have a patch in flight
to update our docs to clarify this point. Even though the work_dir appeared
to be intact when you checked it, it's possible that some of the agent's
checkpoint data had been deleted. Could you try changing the work_dir for
your agents to see if that helps?

Cheers,
Greg


On Wed, Apr 6, 2016 at 11:27 AM, Justin Ryan <[email protected]> wrote:

> Thanks Rik – Interesting theory, I considered that it might have some
> connection to the removal of sandbox files.
>
> Sooo this morning I had all of my kafka brokers disappear again, and
> checked this on a node that is definitely still running kafka.  All of
> /tmp/mesos, including what appear to be the sandbox and logs of the running
> process, are still there, and the “running” count this time is actually
> higher than I’d expect.  I had 9 kafka brokers and 3 flume processes
> running, and the running count currently says 15.
>
> From: <[email protected]> on behalf of Rik <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Tuesday, April 5, 2016 at 3:19 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Disappearing tasks
>
> FWIW, the only time I've seen this happen here is when someone
> accidentally clears the work dir (default=/tmp/mesos), which I personally
> would advise to put somewhere else where rogue people or processes are less
> likely to throw things away accidentally. Could it be that? Although...
> tasks were 'lost' at that point, so it differs slightly (same general
> outcome, not entirely the same symptoms).
>
> On Tue, Apr 5, 2016 at 11:35 PM, Justin Ryan <[email protected]> wrote:
>
>> An interesting fact I left out, the count of “Running” tasks remains
>> intact, while absolutely no history remains in the dashboard.
>>
>>
>>
>> From: Justin Ryan <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, April 5, 2016 at 12:29 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Disappearing tasks
>>
>> Hiya folks!
>>
>> I’ve spent the past few weeks prototyping a new data cluster with Mesos,
>> Kafka, and Flume delivering data to HDFS which we plan to interact with via
>> Spark.  In the prototype environment, I had a fairly high volume of test
>> data flowing for some weeks with little to no major issues except for
>> learning about tuning Kafka and Flume.
>>
>> I’m launching kafka with the github.com/mesos/kafka project, and flume
>> is run via marathon.
>>
>> Yesterday morning, I came in and my flume jobs had disappeared from the
>> task list in Mesos, though I found the actual processes still running when
>> I searched the cluster ’ps’ output.  Later in the day, I had the same
>> happen to my kafka brokers.  In some cases, the only way I’ve found to
>> recover from this is to shut everything down and clear the zookeeper data,
>> which would be fairly drastic if it happened in production, and
>> particularly if we had many tasks / frameworks that were fine, but one or
>> two disappeared.
>>
>> I’d appreciate any help sorting through this, I’m using latest Mesos and
>> CDH5 installed via community Chef cookbooks.
>>
>>
> ------------------------------
>
> P Please consider the environment before printing this e-mail
> The information in this electronic mail message is the sender's
> confidential business and may be legally privileged. It is intended solely
> for the addressee(s). Access to this internet electronic mail message by
> anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it is prohibited and may be unlawful. The sender
> believes that this E-mail and any attachments were free of any virus, worm,
> Trojan horse, and/or malicious code when sent. This message and its
> attachments could have been infected during transmission. By reading the
> message and opening any attachments, the recipient accepts full
> responsibility for taking protective and remedial action about viruses and
> other defects. The sender's employer is not liable for any loss or damage
> arising in any way.
>

Reply via email to