Hiya, coming back to this thread after having to focus on some other things 
(and facing some issues I brought up in another thread).

I reconfigured this cluster with work_dir as /var/mesos and am logging output 
from ‘mesos ps’ from the python mesos.cli package in a loop to try and catch 
the next occurrence.

Still, what seems most interesting to me is that the count of “Running” 
remembers the lost processes.  Even now, as I’ve launched 3 new instances of 
flume from marathon, the running count is 6.  Killed count shows recently 
killed tasks, but was at 0 earlier when I had 3 processes running which mesos 
had lost.


From: Greg Mann <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, April 6, 2016 at 4:24 PM
To: user <[email protected]<mailto:[email protected]>>
Subject: Re: Disappearing tasks

Hi Justin,
I'm sorry that you've been having difficulty with your cluster. Do you have 
access to master/agent logs around the time that these tasks went missing from 
the Mesos UI? It would be great to have a look at those if possible.

I would still recommend against setting the agent work_dir to '/tmp/mesos' for 
a long-running cluster scenario - this location is really only suitable for 
local, short-term testing purposes. We currently have a patch in flight to 
update our docs to clarify this point. Even though the work_dir appeared to be 
intact when you checked it, it's possible that some of the agent's checkpoint 
data had been deleted. Could you try changing the work_dir for your agents to 
see if that helps?

Cheers,
Greg


On Wed, Apr 6, 2016 at 11:27 AM, Justin Ryan 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Rik – Interesting theory, I considered that it might have some 
connection to the removal of sandbox files.

Sooo this morning I had all of my kafka brokers disappear again, and checked 
this on a node that is definitely still running kafka.  All of /tmp/mesos, 
including what appear to be the sandbox and logs of the running process, are 
still there, and the “running” count this time is actually higher than I’d 
expect.  I had 9 kafka brokers and 3 flume processes running, and the running 
count currently says 15.

From: <[email protected]<mailto:[email protected]>> on behalf of 
Rik <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, April 5, 2016 at 3:19 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Disappearing tasks

FWIW, the only time I've seen this happen here is when someone accidentally 
clears the work dir (default=/tmp/mesos), which I personally would advise to 
put somewhere else where rogue people or processes are less likely to throw 
things away accidentally. Could it be that? Although... tasks were 'lost' at 
that point, so it differs slightly (same general outcome, not entirely the same 
symptoms).

On Tue, Apr 5, 2016 at 11:35 PM, Justin Ryan 
<[email protected]<mailto:[email protected]>> wrote:
An interesting fact I left out, the count of “Running” tasks remains intact, 
while absolutely no history remains in the dashboard.



From: Justin Ryan <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, April 5, 2016 at 12:29 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Disappearing tasks

Hiya folks!

I’ve spent the past few weeks prototyping a new data cluster with Mesos, Kafka, 
and Flume delivering data to HDFS which we plan to interact with via Spark.  In 
the prototype environment, I had a fairly high volume of test data flowing for 
some weeks with little to no major issues except for learning about tuning 
Kafka and Flume.

I’m launching kafka with the 
github.com/mesos/kafka<http://github.com/mesos/kafka> project, and flume is run 
via marathon.

Yesterday morning, I came in and my flume jobs had disappeared from the task 
list in Mesos, though I found the actual processes still running when I 
searched the cluster ’ps’ output.  Later in the day, I had the same happen to 
my kafka brokers.  In some cases, the only way I’ve found to recover from this 
is to shut everything down and clear the zookeeper data, which would be fairly 
drastic if it happened in production, and particularly if we had many tasks / 
frameworks that were fine, but one or two disappeared.

I’d appreciate any help sorting through this, I’m using latest Mesos and CDH5 
installed via community Chef cookbooks.


________________________________

P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.

Reply via email to