I am cross posting this to mesos-users, hoping someone has came accros this issue, and can help me resolve this issue I have. There are several JIRA issues open with similar symptoms.
All of a sudden I having problems with marathon ui getting stuck at 'loading' and end points like http://m01.local:8081/v2/info are not responding (http://m01.local:8081/ping). I have now downgraded the test cluster to one node, running only mesos-master and zookeeper and marathon. Cleaning between tests the /var/lib/zookeeper and the /var/lib/mesos directories. I have also removed many of the configuration options I had, like ssl etc. I am only able to get to run marathon-1.7.216-9e2a9b579. marathon-1.8.222-86475ddac and marathon-1.10.17-c427ce965 are having the above mentioned errors/problem. I have been comparing the marathon 1.7 and marathon 1.8 logs and this what I have noticed. There are quite a bit of log statements missing between 'All services up and running. (mesosphere.marathon.MarathonApp:main' and 'akka://marathon/deadLetters' in the 1.8 log. Anyone had something similar? [@mesos-master]# rpm -qa | grep java python-javapackages-3.4.1-11.el7.noarch tzdata-java-2020a-1.el7.noarch java-1.8.0-openjdk-headless-1.8.0.252.b09-2.el7_8.x86_64 javapackages-tools-3.4.1-11.el7.noarch [@mesos-master]# uname -a Linux m01.local 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux [@mesos-master]# cat /etc/redhat-release CentOS Linux release 7.8.2003 (Core) marathon 1.8 (unresponsive) =========================== Jun 7 17:40:59 m01 marathon: [2020-06-07 17:40:59,696] INFO All services up and running. (mesosphere.marathon.MarathonApp:main) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,833] INFO initiate task reconciliation (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-9) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,854] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0) Jun 7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.858621 11227 master.cpp:8846] Performing implicit task state reconciliation for framework f5d67e06-6600-4fb9-94dc-a878be2563be-0000 (marathon) at [email protected]:36941 Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,864] INFO task reconciliation has finished (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-4) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,879] INFO Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1746491390] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. If this is not an expected behavior, then [Actor[akka://marathon/deadLetters]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. (akka.actor.DeadLetterActorRef:marathon-akka.actor.default-dispatcher-7) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,910] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-7) Jun 7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.914615 11228 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be-0000 (marathon) at [email protected]:36941 Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,924] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-13) Jun 7 17:41:28 m01 marathon: [2020-06-07 17:41:28,939] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-4) Jun 7 17:41:28 m01 mesos-master[11203]: I0607 17:41:28.946494 11229 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be-0000 (marathon) at [email protected]:36941 Jun 7 17:41:28 m01 marathon: [2020-06-07 17:41:28,950] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-14) marathon 1.7 (ok) ================= Jun 7 17:37:02 m01 marathon: [2020-06-07 17:37:02,681] INFO All services up and running. (mesosphere.marathon.MarathonApp:main) Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,222] INFO Received TimedCheck (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,228] INFO => revive offers NOW, canceling any scheduled revives (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:06 m01 mesos-master[10661]: I0607 17:37:06.232568 10690 master.cpp:5521] Processing REVIVE call for framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 (marathon) at [email protected]:40447 Jun 7 17:37:06 m01 mesos-master[10661]: I0607 17:37:06.232730 10690 hierarchical.cpp:1788] Unsuppressed offers and cleared filters for roles { * } of framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,235] INFO 2 further revives still needed. Repeating reviveOffers according to --revive_offers_repetitions 3 (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,238] INFO => Schedule next revive at 2020-06-07T15:37:11.228Z in 4990 milliseconds, adhering to --min_revive_offers_interval 5000 (ms) (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:11 m01 marathon: [2020-06-07 17:37:11,240] INFO Received TimedCheck (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-5) Jun 7 17:37:11 m01 mesos-master[10661]: I0607 17:37:11.246363 10685 master.cpp:5521] Processing REVIVE call for framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 (marathon) at [email protected]:40447 Jun 7 17:37:11 m01 mesos-master[10661]: I0607 17:37:11.246500 10685 hierarchical.cpp:1788] Unsuppressed offers and cleared filters for roles { * } of framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 Jun 7 17:37:11 m01 marathon: [2020-06-07 17:37:11,240] INFO => revive offers NOW, canceling any scheduled revives (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-5) Jun 7 17:37:11 m01 marathon: [2020-06-07 17:37:11,241] INFO 1 further revives still needed. Repeating reviveOffers according to --revive_offers_repetitions 3 (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-5) Jun 7 17:37:11 m01 marathon: [2020-06-07 17:37:11,241] INFO => Schedule next revive at 2020-06-07T15:37:16.240Z in 4999 milliseconds, adhering to --min_revive_offers_interval 5000 (ms) (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-5) Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,261] INFO Received TimedCheck (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:16 m01 mesos-master[10661]: I0607 17:37:16.265516 10689 master.cpp:5521] Processing REVIVE call for framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 (marathon) at [email protected]:40447 Jun 7 17:37:16 m01 mesos-master[10661]: I0607 17:37:16.265655 10689 hierarchical.cpp:1788] Unsuppressed offers and cleared filters for roles { * } of framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,261] INFO => revive offers NOW, canceling any scheduled revives (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,409] INFO initiate task reconciliation (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-5) Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,437] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0) Jun 7 17:37:16 m01 mesos-master[10661]: I0607 17:37:16.441344 10686 master.cpp:8846] Performing implicit task state reconciliation for framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 (marathon) at [email protected]:40447 Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,444] INFO task reconciliation has finished (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-2) Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,459] INFO Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#-463341905] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. If this is not an expected behavior, then [Actor[akka://marathon/deadLetters]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. (akka.actor.DeadLetterActorRef:marathon-akka.actor.default-dispatcher-8) Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,502] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-5) Jun 7 17:37:16 m01 mesos-master[10661]: I0607 17:37:16.506299 10687 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework f2318310-8c7b-438c-9a9d-48fdf1cd0406-0000 (marathon) at [email protected]:40447 Jun 7 17:37:16 m01 marathon: [2020-06-07 17:37:16,513] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-14) Jun 7 17:37:31 m01 marathon: [2020-06-07 17:37:31,012] INFO Killing overdue instances: (mesosphere.marathon.core.task.jobs.impl.OverdueInstancesActor$Support:s cala-execution-context-global-54) Jun 7 17:37:31 m01 marathon: [2020-06-07 17:37:31,018] INFO Kill and forget following instances for reason Overdue: (mesosphere.marathon.core.task.termination.impl.KillServ

