Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Gary Ogden
On the master node, I see this printed over and over in the
mesos-master.WARNING log file:
W0615 06:06:51.211262  8672 hierarchical_allocator_process.hpp:589] Using
the default value of 'refuse_seconds' to create the refused resources
filter because the input value is negative

Here's what I see in the master INFO file:
I0616 12:10:55.040024  8674 http.cpp:478] HTTP request for
'/master/state.json'
I0616 12:10:55.425833  8669 master.cpp:3843] Sending 1 offers to framework
20150511-140547-189138442-5051-8667-0831 (Savings) at
scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979
I0616 12:10:55.438303  8669 master.cpp:3843] Sending 1 offers to framework
20150304-134212-222692874-5051-2300-0054
(chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at
scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549
I0616 12:10:55.441295  8669 master.cpp:3843] Sending 1 offers to framework
20150511-140547-189138442-5051-8667-0838 (Savings) at
scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346
I0616 12:10:55.442204  8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282037 ] on slave
20150511-140547-189138442-5051-8667-S4 at slave(1)@10.6.71.203:5151
(secasdb01-2) for framework 20150511-140547-189138442-5051-8667-0831
(Savings) at
scheduler-5a5e99d4-5e16-4a48-94d5-86f751615a04@10.6.71.203:47979
I0616 12:10:55.443111  8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282038 ] on slave
20150304-134111-205915658-5051-1595-S0 at slave(1)@10.6.71.206:5151
(secasdb01-3) for framework 20150304-134212-222692874-5051-2300-0054
(chronos-2.3.2_mesos-0.20.1-SNAPSHOT) at
scheduler-c8f2acc2-d16e-44d5-b54f-7f88d3ab39a2@10.6.70.11:57549
I0616 12:10:55.444875  8671 hierarchical_allocator_process.hpp:563]
Recovered mem(*):5305; disk(*):4744; ports(*):[25001-3] (total
allocatable: mem(*):5305; disk(*):4744; ports(*):[25001-3]) on slave
20150511-140547-189138442-5051-8667-S4 from framework
20150511-140547-189138442-5051-8667-0831
I0616 12:10:55.445121  8669 master.cpp:2344] Processing reply for offers: [
20150511-140547-189138442-5051-8667-O9282039 ] on slave
20150511-140547-189138442-5051-8667-S2 at slave(1)@10.6.71.202:5151
(secasdb01-1) for framework 20150511-140547-189138442-5051-8667-0838
(Savings) at
scheduler-8b4389df-109e-49f5-8064-dd263fbec9fe@10.6.71.202:53346
I0616 12:10:55.445971  8670 hierarchical_allocator_process.hpp:563]
Recovered mem(*):6329; disk(*):5000; ports(*):[25001-3] (total
allocatable: mem(*):6329; disk(*):5000; ports(*):[25001-3]) on slave
20150304-134111-205915658-5051-1595-S0 from framework
20150304-134212-222692874-5051-2300-0054
I0616 12:10:55.446185  8674 hierarchical_allocator_process.hpp:563]
Recovered mem(*):4672; disk(*):4488; ports(*):[25001-25667, 25669-3]
(total allocatable: mem(*):4672; disk(*):4488; ports(*):[25001-25667,
25669-3]) on slave 20150511-140547-189138442-5051-8667-S2 from
framework 20150511-140547-189138442-5051-8667-0838

There's two savings jobs and one weather job and they're all hung right now
(all started from chronos).


Here's what the frameworks tab looks like in mesos:
IDHostUserNameActive TasksCPUsMemMax ShareRegisteredRe-Registered
…5051-8667-0840
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0840
secasdb01-1mesosWeather000 B0%4 hours ago-…5051-8667-0838
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0838
secasdb01-1mesosSavings000 B0%4 hours ago-…5051-8667-0831
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0831
secasdb01-2mesosSavings000 B0%7 hours ago-…5051-8667-0804
http://intmesosmaster01:5051/#/frameworks/20150511-140547-189138442-5051-8667-0804
secasdb01-1mesosAlertConsumer131.0 GB50%20 hours ago-…5051-2300-0090
http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0090
intMesosMaster02
mesosmarathon10.5128 MB8.333%a month agoa month ago…5051-2300-0054
http://intmesosmaster01:5051/#/frameworks/20150304-134212-222692874-5051-2300-0054
intMesosMaster01rootchronos-2.3.2_mesos-0.20.1-SNAPSHOT32.53.0 GB41.667%a
month agoa month ago
It seems that the chronos framework has reserved all the remaining cpu in
the cluster but not given it to the jobs that need it (savings and
weather).

AlertConsumer is a marathon job that's always running and is working fine.

On 16 June 2015 at 04:32, Akhil Das ak...@sigmoidanalytics.com wrote:

 Did you look inside all logs? Mesos logs and executor logs?

 Thanks
 Best Regards

 On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote:

 My Mesos cluster has 1.5 CPU and 17GB free.  If I set:

 conf.set(spark.mesos.coarse, true);
 conf.set(spark.cores.max, 1);

 in the SparkConf object, the job will run in the mesos cluster fine.

 But if I comment out those settings above so that it defaults to fine
 grained, the task never finishes. It just shows as 0 for everything in the
 mesos frameworks (# of 

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs?

Thanks
Best Regards

On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote:

 My Mesos cluster has 1.5 CPU and 17GB free.  If I set:

 conf.set(spark.mesos.coarse, true);
 conf.set(spark.cores.max, 1);

 in the SparkConf object, the job will run in the mesos cluster fine.

 But if I comment out those settings above so that it defaults to fine
 grained, the task never finishes. It just shows as 0 for everything in the
 mesos frameworks (# of tasks, cpu, memory are all 0).  There's nothing in
 the log files anywhere as to what's going on.

 Thanks






tasks won't run on mesos when using fine grained

2015-06-15 Thread Gary Ogden
My Mesos cluster has 1.5 CPU and 17GB free.  If I set:

conf.set(spark.mesos.coarse, true);
conf.set(spark.cores.max, 1);

in the SparkConf object, the job will run in the mesos cluster fine.

But if I comment out those settings above so that it defaults to fine
grained, the task never finishes. It just shows as 0 for everything in the
mesos frameworks (# of tasks, cpu, memory are all 0).  There's nothing in
the log files anywhere as to what's going on.

Thanks