As an addendum, I see a large number of the following in the mesos slave info logs:
W1211 05:44:37.057456 14205 monitor.cpp:186] Failed to collect resource usage for executor '201312061449-1315739402-5050-23513-0' of framework '201312061449-1315739402-5050-23513-0026': Future discarded W1211 05:44:42.057998 14207 monitor.cpp:186] Failed to collect resource usage for executor '201312061449-1315739402-5050-23513-0' of framework '201312061449-1315739402-5050-23513-0026': Future discarded On Tue, Dec 10, 2013 at 6:27 PM, Gary Malouf <[email protected]> wrote: > Hi guys, > > For reference, we are on a master build of spark from November 19 and > Mesos 0.13. > > Periodically, we run into an issue where one of our Mesos slaves takes > some tasks from a Spark query and according to the Mesos ui they are stuck > in 'STAGING'. This ends up blocking the query from running and blocks > future queries until we stop and restart the slave in question. > > Has anyone else seen and/or resolved this type of issue? > > Thanks, > > Gary >
