Hi All,
need your advice:
we see in some very rare cases following error in log
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient resources

and in spark UI there are idle workers and application in WAITING state

in json endpoint I see

"cores" : 280,
  "coresused" : 0,
  "memory" : 2006561,
  "memoryused" : 0,
  "activeapps" : [ {
    "starttime" : 1483534808858,
    "id" : "app-20170104130008-0181",
    "name" : "our name",
    "cores" : -1,
    "user" : "spark",
    "memoryperslave" : 31744,
    "submitdate" : "Wed Jan 04 13:00:08 UTC 2017",
    "state" : "WAITING",
    "duration" : 6568575
  } ],


when I kill the application and restart it - everything works fine,
ie. it's not an issue that some workers are not properly connected,
workers are there, and usually work fine

Is there some way to handle this? Maybe some timeout on this WAITING
state, so that it will exit automatically, because currently it might
be "WAITING" indefinitely...

I've thought of implementing periodic check(by calling rest api /json)
that will kill application when waiting time > 10-15 mins for some
activeapp

any advice will be appreciated,

thanks in advance

Igor

Reply via email to