One thing that was puzzling me yesterday when reading your post: Have you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I played around with Mesos, I remember using HOST to resolve the host's IP address instead of the host's name. It could be that the hostname itself cannot be resolved to the right IP address. But I struggled to find proper documentation to back that up. Only in the recipes section of the Marathon docs [1], HOST was used as well.
Matthias [1] https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jve...@strava.com> wrote: > Another update: Looking more carefully in my appmaster log, I see the > following > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > Registering as new framework. > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > ----------------------------------------------------------------------------- > > --- > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos > Info: > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Master > URL: 10.0.18.246:5050 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Framework > Info: > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Name: > flink-test > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Failover > Timeout (secs): 604800.0 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Role: * > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > Capabilities: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Principal: > (none) > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Host: > 311dcf7fd77c > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web > UI: http://311dcf7fd77c:8081 > > 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - > ----------------------------------------------------------------------------- > > --- > > > which is picking up the mesos.master and > mesos.resourcemanager.framework.name params I am passing to > mesos-appmaster.sh > > > In my Mesos dashboard I can see the framework has been created with the > right name, but has no associated agents/tasks to it. So at least Flink has > been able to connect to the Mesos master to create the framework > > > Later in the mesos-appmaster log is when I see the Mesos connection errors: > > > 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - Starting > the slot manager. > > 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG > org.apache.flink.mesos.scheduler.ConnectionMonitor - State change > (StoppedState -> StoppedState) with data () > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > org.apache.flink.mesos.scheduler.ReconciliationCoordinator - State > change (Suspended -> Suspended) with data ReconciliationData(Map(),0) > > 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO > org.apache.flink.mesos.scheduler.ConnectionMonitor - Connecting to > Mesos... > > 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG > org.apache.flink.mesos.scheduler.ConnectionMonitor - State change > (StoppedState -> ConnectingState) with data () > > 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO > o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos > resource manager started. > > 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG > org.apache.flink.mesos.scheduler.LaunchCoordinator - State change > (Suspended -> Suspended) with data GatherData(List(),List()) > > 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN > org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to connect > to Mesos; still trying... > > 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG > o.a.f.runtime.resourcemanager.active.ActiveResourceManager - Trigger > heartbeat request. > > > > > So why the appmaster was able to connect to Mesos master to create the > framework but failed to connect later to do whatever it does later? > > > One possible issue I see is that the framework is set with web UI in h > ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos master. > 311dcf7fd77c > is the result of doing hostname on the Docker container, and the Mesos > master can not resolve that name. I could try to replace the Docker > container hostname with the Docker host hostname, but the host port that > gets mapped to 8081 on the container is a random port that I can not know > beforehand. Does Mesos master try to reach Flink using that Web UI setting? > Could this be the issue causing my connection problem, or is this a red > herring and the problem is a different one? > > > Thanks, > > > Javier Vegas > > > > > > > > > On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jve...@strava.com> wrote: > >> Thanks, Matthias! >> >> There are lots of apps deployed to the Mesos cluster, the task manager >> itself is deployed to Mesos via Marathon. In the Mesos log I can see the >> Job manager agent starting, but no error messages related to it. As you >> say, TaskManagers don't even have the chance to get confused about >> variables, since the Job Manager can not connect to the Mesos master to >> tell it to start the Task Managers. >> >> Thanks, >> >> Javier >> >> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <matth...@ververica.com> >> wrote: >> >>> Hi Javier, >>> I don't see anything that's configured in the wrong way based on the >>> jobmanager logs you've provided. Have you been able to deploy other >>> applications to this Mesos cluster? Do the Mesos master logs reveal >>> anything? The variable resolution on the TaskManager side is a valid >>> concern shared by Roman since it's easy to run into such an issue. But the >>> JobManager logs indicate that the JobManager is not able to contact the >>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers >>> not coming up. >>> >>> Best, >>> Matthias >>> >>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org> >>> wrote: >>> >>>> Hi, >>>> >>>> No additional ports need to be open as far as I know. >>>> >>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs? >>>> >>>> Please also make sure that the following gets executed before >>>> mesos-appmaster.sh: >>>> export HADOOP_CLASSPATH=$(hadoop classpath) >>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >>>> (as per the documentation you linked) >>>> >>>> Regards, >>>> Roman >>>> >>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jve...@strava.com> wrote: >>>> > >>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions >>>> in >>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >>>> and using Marathon to deploy a Docker image with both the Flink and my >>>> binaries. >>>> > >>>> > My entrypoint for the Docker image is: >>>> > >>>> > >>>> > /opt/flink/bin/mesos-appmaster.sh \ >>>> > >>>> > -Djobmanager.rpc.address=$HOSTNAME \ >>>> > >>>> > -Dmesos.resourcemanager.framework.user=flink \ >>>> > >>>> > -Dmesos.master=10.0.18.246:5050 \ >>>> > >>>> > -Dmesos.resourcemanager.tasks.cpus=6 >>>> > >>>> > >>>> > >>>> > When mesos-appmaster.sh starts, in the stderr I see this: >>>> > >>>> > >>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >>>> > >>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on >>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >>>> > >>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker >>>> executor on 10.0.20.177 >>>> > >>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >>>> > >>>> > WARNING: Your kernel does not support swap limit capabilities or the >>>> cgroup is not mounted. Memory limited without swap. >>>> > >>>> > WARNING: An illegal reflective access operation has occurred >>>> > >>>> > WARNING: Illegal reflective access by >>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method >>>> sun.security.krb5.Config.getInstance() >>>> > >>>> > WARNING: Please consider reporting this to the maintainers of >>>> org.apache.hadoop.security.authentication.util.KerberosUtil >>>> > >>>> > WARNING: Use --illegal-access=warn to enable warnings of further >>>> illegal reflective access operations >>>> > >>>> > WARNING: All illegal access operations will be denied in a future >>>> release >>>> > >>>> > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 >>>> > >>>> > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at >>>> master@10.0.18.246:5050 >>>> > >>>> > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. >>>> Attempting to register without authentication >>>> > >>>> > >>>> > where the "New master detected" line is promising. >>>> > >>>> > However, on the Flink UI I see only the jobmanager started, and there >>>> are no task managers. Getting into the Docker container, I see this in the >>>> log: >>>> > >>>> > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to >>>> connect to Mesos; still trying... >>>> > >>>> > >>>> > I have verified that from the container I can access the Mesos >>>> container 10.0.18.246:5050 >>>> > >>>> > >>>> > Does any other port besides the web UI port 5050 need to be open for >>>> mesos-appmaster to connect with the Mesos master? >>>> > >>>> > >>>> > In the appmaster log (attached) I see one exception that I don't know >>>> if they are related to the Mesos connection problem, one is >>>> > >>>> > >>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are >>>> unset. >>>> > >>>> > at >>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) >>>> > >>>> > at >>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) >>>> > >>>> > at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496) >>>> > >>>> > at >>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) >>>> > >>>> > at >>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) >>>> > >>>> > at >>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) >>>> > >>>> > at >>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90) >>>> > >>>> > at >>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) >>>> > >>>> > at >>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) >>>> > >>>> > at >>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) >>>> > >>>> > at >>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) >>>> > >>>> > at >>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) >>>> > >>>> > at >>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native >>>> Method) >>>> > >>>> > at >>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown >>>> Source) >>>> > >>>> > at >>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown >>>> Source) >>>> > >>>> > at java.base/java.lang.reflect.Method.invoke(Unknown Source) >>>> > >>>> > at >>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) >>>> > >>>> > at >>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) >>>> > >>>> > at >>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95) >>>> > >>>> > >>>> > >>>> > >>>> > I am not trying (yet) to run in high availability mode, so I am not >>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything >>>> about HADOOP_HOME in the FLink docs. >>>> > >>>> > >>>> > >>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so >>>> Flink can connect to my Mesos master? >>>> > >>>> > >>>> > Thanks, >>>> > >>>> > >>>> > Javier Vegas >>>> > >>>> > >>> >>>