How can I enable forwarding in a mesos container
If I run it as root I am getting sysctl: error setting key 'net.ipv4.ip_forward': Read-only file system Looking for something like this: docker run --sysctl net.ipv4.ip_forward=1 someimage
Problems getting the new mvp csi working
I have been looking forward to the update of mesos offering this mvp csi, mainly to finally be able to use ceph. But unfortunately I am still not able to get a simple rbd image attached to a container. I am able to use the csilvm by adding the volume like this[2], but the cephcsi keeps failing. It looks like the secrets are not being send to the driver, it keeps complaining about 'stage secrets cannot be nil or empty'[1], with this config[3] having staging secrets. I have also tried using a secrets plugin doing something like "username": { "secret": "secretpassword"}. Any hints on what I am doing wrong are very welcome! [1] I1221 21:54:36.932030 10356 utils.go:132] ID: 14 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC call: /csi.v1.Node/NodeStageVolume I1221 21:54:36.932302 10356 utils.go:133] ID: 14 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC request: {"staging_target_path":"/var/lib/mesos/csi/rbd.csi.ceph.io/default/mount s/0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1/s taging","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"m ode":1}},"volume_context":{"clusterID":"ceph","pool":"app"},"volume_id": "0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1"} E1221 21:54:36.932316 10356 utils.go:136] ID: 14 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC error: rpc error: code = InvalidArgument desc = stage secrets cannot be nil or empty I1221 21:54:36.976159 10356 utils.go:132] ID: 15 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC call: /csi.v1.Node/NodeUnstageVolume I1221 21:54:36.976308 10356 utils.go:133] ID: 15 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC request: {"staging_target_path":"/var/lib/mesos/csi/rbd.csi.ceph.io/default/mount s/0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1/s taging","volume_id":"0001-0004-ceph-0016-7957e938-405a-11eb- bfd0-0050563001a1"} I1221 21:54:36.976465 10356 nodeserver.go:666] ID: 15 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 failed to find image metadata: missing stash: open /var/lib/mesos/csi/rbd.csi.ceph.io/default/mounts/0001-0004-ceph-000 00016-7957e938-405a-11eb-bfd0-0050563001a1/staging/image-meta.json: no such file or directory I1221 21:54:36.976537 10356 utils.go:138] ID: 15 Req-ID: 0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 GRPC response: {} [3] "volumes": [ { "containerPath": "xxx", "mode": "rw", "external": { "provider": "csi", "name": "0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1", "options": { "pluginName": "rbd.csi.ceph.io", "capability": { "accessType": "block", "accessMode": "SINGLE_NODE_WRITER", "fsType": "" }, "volumeContext": { "clusterID": "ceph", "pool": "app" }, "nodeStageSecret": { "username": "userID", "password": "asdfasdfasdfasdfasdfasdf" } } } } ] [2] "volumes": [ { "containerPath": "xxx", "mode": "rw", "external": { "provider": "csi", "name": "LVtestman1", "options": { "pluginName": "lvm.csi.mesosphere.io", "capability": { "accessType": "mount", "accessMode": "SINGLE_NODE_WRITER", "fsType": "xfs" } } } } ]
new mvp how to use block csi lvm
If I use the csilvm driver, I am able to use a published volume with this task[1] with xfs fs. However when I try and ad the volume as block device the task[3] fails to deploy the log[4] however seems ok and does a mount and unmount. Should I change more than just accessType and fsType? The stderr of the task mentions 'Failed to prepare mounts: Failed to mount' and 'directory does not exist'. The driver seems ok, because it respons to a publishing request from the command line[2] [2] csc -e unix:///tmp/mesos-csi-v5WtjJ/endpoint.sock node publish --cap SINGLE_NODE_WRITER,block --target-path /mnt/testman1 'LVtestman1' Log: ... [VGtest]2020/12/18 17:50:18 lvm.go:893: stderr: [VGtest]2020/12/18 17:50:18 server.go:1069: Volume path is /dev/VGtest/LVtestman1 [VGtest]2020/12/18 17:50:18 server.go:1071: Target path is /mnt/testman1 [VGtest]2020/12/18 17:50:18 server.go:1074: Mounting readonly: false [VGtest]2020/12/18 17:50:18 server.go:1094: Attempting to publish volume /dev/VGtest/LVtestman1 as BLOCK_DEVICE to /mnt/testman1 [VGtest]2020/12/18 17:50:18 server.go:1095: Determining mount info at /mnt/testman1 [VGtest]2020/12/18 17:50:18 server.go:1104: Mount info at /mnt/testman1: [VGtest]2020/12/18 17:50:18 server.go:1135: Creating Mount Target /mnt/testman1 [VGtest]2020/12/18 17:50:18 server.go:1143: Nothing mounted at targetPath /mnt/testman1 yet [VGtest]2020/12/18 17:50:18 server.go:1148: Performing bind mount of /dev/VGtest/LVtestman1 -> /mnt/testman1 [VGtest]2020/12/18 17:50:18 logging.go:30: Served /csi.v1.Node/NodePublishVolume: resp= [1] { "id": "/app5", "instances": 1, "cpus": 1, "mem": 32, "cmd": "echo $(date +'%m%d %H%M%S'): $HOSTNAME >> xxx/file ; sleep 3600", "acceptedResourceRoles": ["*"], "constraints": [["hostname", "CLUSTER", "m01.local"]], "backoffSeconds": 10, "networks": [ { "mode": "host" } ], "container": { "type": "MESOS", "volumes": [ { "containerPath": "xxx", "mode": "rw", "external": { "provider": "csi", "name": "LVtestman1", "options": { "pluginName": "lvm.csi.mesosphere.io", "capability": { "accessType": "mount", "accessMode": "SINGLE_NODE_WRITER", "fsType": "xfs" } } } } ] } } [3] { "id": "/app4", "instances": 1, "cpus": 1, "mem": 32, "cmd": "echo $(date +'%m%d %H%M%S'): $HOSTNAME >> file ; sleep 3600", "acceptedResourceRoles": ["*"], "constraints": [["hostname", "CLUSTER", "m01.local"]], "backoffSeconds": 10, "networks": [ { "mode": "host" } ], "container": { "type": "MESOS", "volumes": [ { "containerPath": "xxx", "mode": "rw", "external": { "provider": "csi", "name": "LVtestman1", "options": { "pluginName": "lvm.csi.mesosphere.io", "capability": { "accessType": "block", "accessMode": "SINGLE_NODE_WRITER", "fsType": "" } } } } ] } } [4] [VGtest]2020/12/18 17:53:22 server.go:1329: Determining mount info at /var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe t [VGtest]2020/12/18 17:53:22 server.go:1337: Mount info at /var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe t: &{root:/dm-2 path:/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/ target fstype:devtmpfs mountopts:[rw nosuid] mountsource:devtmpfs} [VGtest]2020/12/18 17:53:22 server.go:1346: Unmounting /var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe t [VGtest]2020/12/18 17:53:22 server.go:1361: Deleting Mount Target /var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe t [VGtest]2020/12/18 17:53:22 logging.go:30: Served /csi.v1.Node/NodeUnpublishVolume: resp=
csi specification handles volumeid(?)
I hope nobody minds putting this here since the csi mailing list is on invitation only, and Jie Yu seems to be everywhere ;) I am having some problems understanding how the cephcsi plugin works. I am using the csc[1] from the rexray people which I believe to have quite some history with the development of provisioning storage. I seem to be able to use this ok with several plugins like csilvm and csinfs and publish some volumes. However with the cephcsi plugin I seem to have to use a different arguments with the StageVolume/PublishVolume event and the CreateVolume event. When I create a volume with 'csc -e unix:///tmp/csiceph.sock controller create-volume' I can supply a volume name 'app-test2'. But when I want to stage/publish this volume with 'csc -e unix:///tmp/csiceph.sock node stage' I have to supply a volume ID '0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1'. Question: is this indeed correct according to the csi specification? It just looks weird to me especially that other plugins do not behave like this, or is this new? I do not know why, but I get a bit the impression that cephcsi plugin is implementing fixes that need to be fixed in kubernetes. I honestly do not know why a csi plugin is trying to generate random image names, volume ids etc. If something needs to randomized then the OC is responsible for this not? [1] https://github.com/rexray/gocsi [2] https://github.com/ceph/ceph-csi/issues/1802
csi volume
When I create this task[2], I am getting the error message: "There was a problem with your configuration general: App creation unsuccessful. Check your app settings and try again." I have the csi managed plugin running and can mount with the command line csc[1]. What should I look at to fix this? [1] # csc -e unix:///tmp/mesos-csi-Xxe00V/endpoint.sock node publish --cap SINGLE_NODE_WRITER,mount,nfs --target-path /mnt/test --vol-context 'server=192.168.10.58,share=/test' 192.168.10.58/test [2] { "id": "app1", "instances": 1, "cpus": 1, "mem": 32, "cmd": "echo yes > xxx/file && sleep 3600", "container": { "type": "MESOS", "volumes": [ { "containerPath": "xxx", "mode": "rw", "external": { "provider": "csi", "name": "no-need", "options": { "pluginName": "nfs.csi.k8s.io", "capability": { "accessType": "mount", "accessMode": "MULTI_NODE_MULTI_WRITER", "fsType": "nfs" }, "volumeContext": { "server": "192.168.10.58", "share": "/mnt/test" } } } } ] } }
virtual memory task on mesos ~40GB while on docker ~3G
When I launch a task via docker with: docker run --memory 2G --memory-swappiness 0 -v /dev/log:/dev/log -it --network host marathon:1.11.24 This task seems to use ~400MB resident, and 2,8GB virtual. When I launch the same task on mesos I am getting This task seems to use ~900MB resident, and 47GB virtual. Is this difference normal? Or should I configure other settings in mesos? On the mesos agent I have cgroups_limit_swap=true and to isolation I have added cgroups/mem
marathon plugin interface for mesos 1.11
I wanted to test csi in mesos 1.11, but noticed that I am using a plugin for marathon that does not load any more. It has this in the "build.sbt" file: libraryDependencies += "mesosphere.marathon" %% "plugin-interface" % "1.6.325" % "provided" I assume this needs to be changed to a newer marathon plugin interface version? Where can I find what versions are available?
RE: hostname in task
Hi James, Sorry to bring this up again. But I have marathon constantly logging because of using the host name from the host networking, instead of using its own task name marathon.xxx.xxx.xxx.mesos as a host name for which there is a certificate. Do you have an example of setting the hostname via mesos json? Because I have no idea how to interpret the github link. [1] Dec 8 11:03:39 c02 marathon: ERROR Connection to leader refused. Dec 8 11:03:39 c02 #011akka.stream.ConnectionException: Hostname verification failed! Expected session to be for c03 Dec 8 11:03:40 c03 marathon: ERROR Connection to leader refused. Dec 8 11:03:40 c03 #011akka.stream.ConnectionException: Hostname verification failed! Expected session to be for c03 Dec 8 11:03:54 c02 marathon: ERROR Connection to leader refused. Dec 8 11:03:54 c02 #011akka.stream.ConnectionException: Hostname verification failed! Expected session to be for c03 Dec 8 11:03:55 c03 marathon: ERROR Connection to leader refused. Dec 8 11:03:55 c03 #011akka.stream.ConnectionException: Hostname verification failed! Expected session to be for c03 Dec 8 11:04:09 c02 marathon: ERROR Connection to leader refused. Dec 8 11:04:09 c02 #011akka.stream.ConnectionException: Hostname verification failed! Expected session to be for c03 Dec 8 11:04:10 c03 marathon: ERROR Connection to leader refused. -Original Message- Subject: Re: hostname in task > > > I read you can add a hostname option to the container in this > issue[0], however I still have the uuid. Is this in available in mesos 1.8? Yep. > Can I > somewhere read all these options? Like here[1] The Mesos API is defined in the ContainerInfo protobuf, but Im not sure how marathon maps that https://github.com/apache/mesos/blob/master/include/mesos/v1/mesos.proto#L3395 > > > [@ cni]# cat 2f261fa8-4985-4614-b712-f0785ca6ce04/hosts > 127.0.0.1 localhost > 192.168.123.32 2f261fa8-4985-4614-b712-f0785ca6ce04 > > [0] > https://reviews.apache.org/r/55191/ > [1] > http://mesosphere.github.io/marathon/api-console/index.html > > Using mesos 1.8 > And > > "container": { >"type": "MESOS", >"hostname": "test.example.com", >"docker": { >"image": "test", >"credential": null, >"forcePullImage": true >}, > "volumes": [ > { > "mode": "RW", > "containerPath": "/dev/log", > "hostPath": "/dev/log" > } > ] > },
Marathon shutdown after master connection lost
I hope nobody minds that I am crossposting this to mesos, since there is not much activity on the marathon mailing list. Is there an option to keep marathon running, having it try to reconnect to the mesos-master after it lost connection? Currently I am running sort of a test cluster with only 1 zookeeper and 1 mesos-master. This is quite ok for now, only annoying thing is that when I update the mesos-master my marathon tasks are gone, and I have to manually start one, that in turn re starts the failed ones. It would be nice if they could try reconnect to the mesos-master for 15-30 min and then go down.
Package mesos-1.11.0-2.0.1.el7.x86_64.rpm is not signed
Package mesos-1.11.0-2.0.1.el7.x86_64.rpm is not signed
RE: Suddenly all tasks gone, framework at completed, cannot start framework
Is there a way to change this failover_timeout after the framework is running? Via the api or so? I see it is changed when the leader is changing. -Original Message- To: user Cc: cf.natali; janiszt Subject: RE: Suddenly all tasks gone, framework at completed, cannot start framework Thanks Tomek, Charles, I increased my MARATHON_FAILOVER_TIMEOUT from a day to a week. I almost cannot believe something happened yesterday that made everything go down today. However I have recently been testing with JAVA_OPTS to prevent oom's from the marathon tasks.
Changing logging timestamp
I have default remote syslog setup on centos all applications and server log the same timestamp (zone), except mesos and marathon tasks. I assume UTC times are send from them. How can I set this back to the 'hosts default'?
RE: Paid help for getting csi ceph working
Hi Vinod, thanks for the link, I had a look at the design document also, looks promissing and clears also up some questions I had. Can't wait to give it a try. Good luck with this! -Original Message- To: user Subject: Re: Paid help for getting csi ceph working SERP is not available yet. We are currently working on an alternative way to get external storage into Mesos instead of using SLRP. Please watch the progress here: https://issues.apache.org/jira/browse/MESOS-10141 . MVP support will land in the upcoming release of Mesos. On Mon, Sep 7, 2020 at 2:08 PM Marc Roos wrote: Is there anyone interested in giving some paid help to get me up and running with an slrp with ceph? I assume this serp is not available still not?
Paid help for getting csi ceph working
Is there anyone interested in giving some paid help to get me up and running with an slrp with ceph? I assume this serp is not available still not?
slrp csi ceph rbd static volume possible with mesos 1.10
I would like to map a ceph rbd device to a task as a static/pre-existing volume. Is there any guide on how to do this?
recommended ceph csi plugin?
Is there a recommended csi ceph plugin? I found this one[1] but I think it is only usable for kubernetes since it requires secrets to be stored in some kubernetes property. [1] https://github.com/ceph/ceph-csi
RE: marathon (or java) container contantly oom
Thanks for this, I will stop trying then. What I noticed, but I am not sure about this. Is that it starts getting worse with consuming memory after the webinterface has been accessed by me. Then the usage climbs more rapidly. -Original Message- To: user Cc: marathon-framework Subject: Re: marathon (or java) container contantly oom Hi, it is a known issue with Marathon: https://jira.d2iq.com/browse/MARATHON-8180 AFAIK it hasn't been fixed yet. You can tune GC or increase memory limits, but the memory usage will grow indefinitely with a higher number of tasks. Regards, Tomas On Wed, 26 Aug 2020 at 11:11, Marc Roos wrote: Recently I enabled the cpu and memory isolators on my test cluster. And since then I have been seeing the marathon containers (when becoming leader) increase memory usage from ~400MB until they oom at 850MB (checking vi systemd-cgtop). Now I am testing with these settings from this page[1] JAVA_OPTS "-Xshare:off -XX:+UseSerialGC -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xint -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler" LD_PRELOAD "/usr/lib64/libjemalloc.so.1" Is someone able to share an efficient config? Or is it not possible to get marathon running below 1GB? At the moment I have only ~10 tasks. [1] https://stackoverflow.com/questions/53451103/java-using-much-more-m emory-than-heap-size-or-size-correctly-docker-memory-limi
marathon (or java) container contantly oom
Recently I enabled the cpu and memory isolators on my test cluster. And since then I have been seeing the marathon containers (when becoming leader) increase memory usage from ~400MB until they oom at 850MB (checking vi systemd-cgtop). Now I am testing with these settings from this page[1] JAVA_OPTS "-Xshare:off -XX:+UseSerialGC -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xint -XX:+UnlockExperimentalVMOptions -XX:+UseJVMCICompiler" LD_PRELOAD "/usr/lib64/libjemalloc.so.1" Is someone able to share an efficient config? Or is it not possible to get marathon running below 1GB? At the moment I have only ~10 tasks. [1] https://stackoverflow.com/questions/53451103/java-using-much-more-memory-than-heap-size-or-size-correctly-docker-memory-limi
FW: How to configure a pre-existing slrp volume/disk
On this dcos manual[1] there is only listed how to use a profile from an slrp. Any one know how to change this to a pre-existing (lvm) volume? (mesos example is also welcome ;) cat > app2.json <> data/foo && cat data/foo && sleep 5000", "container": { "docker": { "image": "alpine" }, "type": "MESOS", "volumes": [ { "containerPath": "data", "mode": "RW", "persistent": { "size": 100, "profileName": "fast", "type": "mount" } } ] }, "cpus": 0.1, "id": "/app-persistent-stable-good-profile-2", "instances": 1, "mem": 128, "residency": { "taskLostBehavior": "WAIT_FOREVER", "relaunchEscalationTimeoutSeconds": 3600 }, "unreachableStrategy": "disabled", "upgradeStrategy": { "maximumOverCapacity": 0, "minimumHealthCapacity": 0 } } EOF [1] https://docs.d2iq.com/mesosphere/dcos/services/storage/1.0.0/tutorials/manage-local-disks/
RE: Suddenly all tasks gone, framework at completed, cannot start framework
Thanks Tomek, Charles, I increased my MARATHON_FAILOVER_TIMEOUT from a day to a week. I almost cannot believe something happened yesterday that made everything go down today. However I have recently been testing with JAVA_OPTS to prevent oom's from the marathon tasks. -Original Message- From: Tomek Janiszewski [mailto:jani...@gmail.com] Sent: dinsdag 25 augustus 2020 16:55 To: user Subject: Re: Suddenly all tasks gone, framework at completed, cannot start framework See: https://stackoverflow.com/a/42544023/1387612 wt., 25 sie 2020 o 15:07 Marc Roos napisał(a): Today all my tasks are down and framework marathon is at completed. Any idea how this can happen? ed.cpp:520] Successfully authenticated with master master@192.168.10.151:5050 I0825 13:03:27.961248 108 sched.cpp:1188] Got error 'Framework has been removed'
RE: Suddenly all tasks gone, framework at completed, cannot start framework -
I assume this was because something happened with zookeeper, and it restarted loading the wrong configuration file without the quorum=1. Because I was testing with different zookeeper rpms (mesos rpm conf is not standard location) Question: Is this by design that all tasks are terminated when zookeeper is gone? Is there some timeout setting that allows tasks to run for a day without zookeeper -Original Message- To: user Subject: Suddenly all tasks gone, framework at completed, cannot start framework Today all my tasks are down and framework marathon is at completed. Any idea how this can happen? ed.cpp:520] Successfully authenticated with master master@192.168.10.151:5050 I0825 13:03:27.961248 108 sched.cpp:1188] Got error 'Framework has been removed'
Suddenly all tasks gone, framework at completed, cannot start framework
Today all my tasks are down and framework marathon is at completed. Any idea how this can happen? ed.cpp:520] Successfully authenticated with master master@192.168.10.151:5050 I0825 13:03:27.961248 108 sched.cpp:1188] Got error 'Framework has been removed'
mesosphere csilvm doesn't have socket after startup
I am not sure if the csi standard requires that CSI_ENDPOINT should be set, in any case. - csilvm does not work without specifically setting -unix-addr-env CSI_ENDPOINT. So either document this or make it default. - I could not test with csc on csilvm master branch only this update csi 1.2(?). So merge this pull request finally, so other people do not waste time trying to get this master working. E0822 19:00:42.980020 18830 provider.cpp:541] Failed to recover resource provider with type 'org.apache.mesos.rp.local.storage' and name 'local_vg': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock' E0822 19:00:42.980559 18828 container_daemon.cpp:150] Failed to launch container 'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv m--CONTROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock' E0822 19:00:42.980654 18828 service_manager.cpp:751] Container daemon for 'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv m--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for endpoint 'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock'
csi drivers endpoint errors, maybe update slrp page with info on how to configure these csi endpoints
E0815 18:43:38.774154 1073 service_manager.cpp:751] Container daemon for 'org-apache-mesos-rp-local-storage-local_blockdevices--nfs-csi-k8s-io-cs i_blockdevices--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for endpoint 'unix:///tmp/mesos-csi-iJusqh/endpoint.sock' E0815 18:43:38.780150 1070 provider.cpp:541] Failed to recover resource provider with type 'org.apache.mesos.rp.local.storage' and name 'local_nfs': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-bik1bT/endpoint.sock' E0815 18:43:38.780278 1075 container_daemon.cpp:150] Failed to launch container 'org-apache-mesos-rp-local-storage-local_nfs--nfs-csi-k8s-io-csilvm--CON TROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-bik1bT/endpoint.sock' E0815 18:43:38.780364 1075 service_manager.cpp:751] Container daemon for 'org-apache-mesos-rp-local-storage-local_nfs--nfs-csi-k8s-io-csilvm--CON TROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for endpoint 'unix:///tmp/mesos-csi-bik1bT/endpoint.sock' E0815 18:43:38.783254 1076 provider.cpp:541] Failed to recover resource provider with type 'org.apache.mesos.rp.local.storage' and name 'local_vg': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock' E0815 18:43:38.783386 1075 container_daemon.cpp:150] Failed to launch container 'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv m--CONTROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock' E0815 18:43:38.783461 1075 service_manager.cpp:751] Container daemon for 'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv m--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for endpoint 'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock' [1] http://mesos.apache.org/documentation/latest/csi/
cni chaining, bandwitdh plugin
You should reconsider supporting cni 0.3.0, so people can use this cni bandwidth plugin[1] [1] https://github.com/containernetworking/plugins/tree/master/plugins/meta/bandwidth
RE: How to test if slrp is working correctly
No one able to help? ;) -Original Message- To: user Subject: How to test if slrp is working correctly I am testing with slrp and csi drivers after watching this video[1] of mesosphere. I would like to know how I can verify that the slrp is properly configured and working. 1. Can I use an api endpoint to query controller/list-volumes or do a controller/create-volume. I found this csc tool that can use a socket, however it does not work with some csi drivers (only the csinfs)[2] After I disabled the endpoint authentication, the slrp seem to launch these cni drivers. I have processes like this 793 790 0 Aug15 ?00:00:00 ./csi-blockdevices 15298 15292 0 Aug15 ?00:01:00 ./test-csi-plugin --available_capacity=2GB --work_dir=workdir 16292 16283 0 Aug15 ?00:00:05 ./csilvm -unix-addr=unix:///run/csilvm.sock -volume-group VGtest 17639 17636 0 Aug15 ?00:00:08 ./csinfs --endpoint unix://run/csinfs.sock --nodeid test --alsologtostderr --log_dir /tmp [1] https://www.youtube.com/watch?v=zhALmyC3Om4 [2] [root@m01 resource-providers]# csc --endpoint unix:///run/csinfs.sock identity plugin-info "nfs.csi.k8s.io" "2.0.0" [root@m01 resource-providers]# csc --endpoint unix:///run/csilvm.sock identity plugin-info unknown service csi.v1.Identity [root@m01 resource-providers]# csc --endpoint unix:///run/csiblock.sock identity plugin-info unknown service csi.v1.Identity
How to test if slrp is working correctly
I am testing with slrp and csi drivers after watching this video[1] of mesosphere. I would like to know how I can verify that the slrp is properly configured and working. 1. Can I use an api endpoint to query controller/list-volumes or do a controller/create-volume. I found this csc tool that can use a socket, however it does not work with some csi drivers (only the csinfs)[2] After I disabled the endpoint authentication, the slrp seem to launch these cni drivers. I have processes like this 793 790 0 Aug15 ?00:00:00 ./csi-blockdevices 15298 15292 0 Aug15 ?00:01:00 ./test-csi-plugin --available_capacity=2GB --work_dir=workdir 16292 16283 0 Aug15 ?00:00:05 ./csilvm -unix-addr=unix:///run/csilvm.sock -volume-group VGtest 17639 17636 0 Aug15 ?00:00:08 ./csinfs --endpoint unix://run/csinfs.sock --nodeid test --alsologtostderr --log_dir /tmp [1] https://www.youtube.com/watch?v=zhALmyC3Om4 [2] [root@m01 resource-providers]# csc --endpoint unix:///run/csinfs.sock identity plugin-info "nfs.csi.k8s.io" "2.0.0" [root@m01 resource-providers]# csc --endpoint unix:///run/csilvm.sock identity plugin-info unknown service csi.v1.Identity [root@m01 resource-providers]# csc --endpoint unix:///run/csiblock.sock identity plugin-info unknown service csi.v1.Identity
RE: mesos csi test plugin slrp 401 Unauthorized
If I disable authenticate_http_readwrite authenticate_http_readonly. My test slrp's are indeed loaded and I see tasks running. Launching these tasks as described on the manual page via curl[1] also fails. The task is not running, but I see that curl commands json is being put in the resource-providers dir. So please some info on how to get this working with having the authenticate_http_readwrite authenticate_http_readonly enabled. [1] curl --user xxx:xxx -X POST -H 'Content-Type: application/json' http://m01.local:5051/api/v1 -d '{"type":"ADD_RESOURCE_PROVIDER_CONFIG","add_resource_provider_config":{ "info": -Original Message- To: user Subject: mesos csi test plugin slrp 401 Unauthorized I am testing with this Failed to recover resource provider with type 'org.apache.mesos.rp.local.storage' and name 'test_slrp': Failed to get containers: Unexpected response '401 Unauthorized' (401 Unauthorized.) Is this because I am having authentication on, and the standalone container cannot launch? How to resolve this? [1] http://mesos.apache.org/documentation/latest/csi/
A more practical guide on how to configure and get csi working (preferably with ceph)
Can anyone point me to a more practical guide on how to configure and get csi working (preferably with ceph)
test-csi-plugin should work?
This option has no effect when using the HTTP scheduler/executor APIs. By default, this option is true. (default: true) --log_dir=VALUE Location to put log files. By default, nothing is written to disk. Does not affect logging to stderr. If specified, the log file will appear in the Mesos WebUI. NOTE: 3rd party log messages (e.g. ZooKeeper) are only written to stderr! --logbufsecs=VALUE Maximum number of seconds that logs may be buffered for. By default, logs are flushed immediately. (default: 0) --logging_level=VALUELog message at or above this level. Possible values: `INFO`, `WARNING`, `ERROR`. If `--quiet` is specified, this will only affect the logs written to `--log_dir`, if specified. (default: INFO) --[no-]quiet Disable logging to stderr. (default: false) --volume_metadata=VALUE The static properties to add to the contextual information of each volume. The metadata are specified as a semicolon-delimited list of prop=value pairs. (Example: 'prop1=value1;prop2=value2') --volumes=VALUE Creates preprovisioned volumes upon start-up. The volumes are specified as a semicolon-delimited list of name:capacity pairs. If a volume with the same name already exists, the pair will be ignored. (Example: 'volume1:1GB;volume2:2GB') --work_dir=VALUE Path to the work directory of the plugin. (default: ) *** Error in `/usr/libexec/cni/test-csi-plugin': free(): invalid pointer: 0x7f5e1ea25a10 *** === Backtrace: = /lib64/libc.so.6(+0x81299)[0x7f5e18dcc299] /usr/libexec/cni/test-csi-plugin(_ZN9__gnu_cxx13new_allocatorIPNSt8__det ail15_Hash_node_baseEE10deallocateEPS3_m+0x20)[0x5631f93bc1b0] /usr/libexec/cni/test-csi-plugin(_ZNSt10_HashtableISsSt4pairIKSsSsESaIS2 _ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS4_18_Mod_range_ha shingENS4_20_Default_ranged_hashENS4_20_Prime_rehash_policyENS4_17_Hasht able_traitsILb1ELb0ELb121_M_deallocate_bucketsEPPNS4_15_Hash_node_ba seEm+0x58)[0x5631f93b2772] /usr/libexec/cni/test-csi-plugin(_ZNSt10_HashtableISsSt4pairIKSsSsESaIS2 _ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS4_18_Mod_range_ha shingENS4_20_Default_ranged_hashENS4_20_Prime_rehash_policyENS4_17_Hasht able_traitsILb1ELb0ELb1D2Ev+0x36)[0x5631f93a597c] /usr/libexec/cni/test-csi-plugin(_ZNSt13unordered_mapISsSsSt4hashISsESt8 equal_toISsESaISt4pairIKSsSsEEED1Ev+0x18)[0x5631f9399eb0] /usr/libexec/cni/test-csi-plugin(_ZN7hashmapISsSsSt4hashISsESt8equal_toI SsEED1Ev+0x18)[0x5631f9399eca] /lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f5e18d8505a] /usr/local/lib/libmesos-1.10.0.so(+0x22b34f3)[0x7f5e1be074f3] === Memory map: 5631f9315000-5631f9442000 r-xp fd:00 507586 /usr/libexec/cni/test-csi-plugin 5631f9642000-5631f9646000 r--p 0012d000 fd:00 507586 /usr/libexec/cni/test-csi-plugin 5631f9646000-5631f9647000 rw-p 00131000 fd:00 507586 /usr/libexec/cni/test-csi-plugin 5631fb041000-5631fb0a4000 rw-p 00:00 0 [heap] 7f5e0c00-7f5e0c021000 rw-p 00:00 0 7f5e0c021000-7f5e1000 ---p 00:00 0 7f5e130ea000-7f5e1314a000 r-xp fd:00 16872768 /usr/lib64/libpcre.so.1.2.0 7f5e1314a000-7f5e1334a000 ---p 0006 fd:00 16872768 /usr/lib64/libpcre.so.1.2.0 7f5e1334a000-7f5e1334b000 r--p 0006 fd:00 16872768 /usr/lib64/libpcre.so.1.2.0 7f5e1334b000-7f5e1334c000 rw-p 00061000 fd:00 16872768 /usr/lib64/libpcre.so.1.2.0
mesos csi test plugin slrp 401 Unauthorized
I am testing with this Failed to recover resource provider with type 'org.apache.mesos.rp.local.storage' and name 'test_slrp': Failed to get containers: Unexpected response '401 Unauthorized' (401 Unauthorized.) Is this because I am having authentication on, and the standalone container cannot launch? How to resolve this? [1] http://mesos.apache.org/documentation/latest/csi/
RE: crv port lookups on tasks with cni networks
"container": { "type": "MESOS", "portMappings": [ {"hostPort": 0, "name": "https", "protocol": "tcp", "networkNames": ["cni-apps"]}, {"hostPort": 0, "name": "metrics", "protocol": "tcp", "networkNames": ["cni-apps"]} ], -Original Message- To: user Subject: crv port lookups on tasks with cni networks How can I assign random ports to a cni network and read these back from srv. What is the equivalent of portDefinitions at network/host for network/container? -Original Message- To: user Subject: health check not working after changing host network If I change a task from: "networks": [ { "mode": "host" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], To: "networks": [ { "mode": "container", "name": "cni-storage" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], I am getting this error: W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' failed: curl exited with status 7: curl: (7) Failed connect to 127.0.0.1:0; Connection refused I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure of HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in grace period But when I disable the health check, and enter the network namespace of the running task. This localhost check is working [@ c59ea592-322f-4bfc-8981-21215904da58]# curl http://localhost:52684/test 200 OK Service ready. What am I doing wrong?
Are cni networks launch in sequence of how to they are configured?
I was wondering if cni networks were always applied in sequence. I am seeing the same order of eth0, eth1 etc. But is it true that the 2nd network is only created when the first was successfully completed/attached?
RE: Assymetric route possible between agent and container?
I was just giving this setting a try. But some test task does not want to launch. Should I always have to set this in combination with domain_socket_location? I am getting "Failed to synchronize with agent (it's probably exited)" This should not affect the way cni networks are provisioned, eg via dhcp? -Original Message- To: user Subject: Re: Assymetric route possible between agent and container? I think you could try this flag `http_executor_domain_sockets` which was introduced in Mesos 1.10.0. --http_executor_domain_sockets: If true, the agent will provide a unix domain socket that the executor can use to connect to the agent, instead of relying on a TCP connection. Regards, Qian Zhang On Sat, Aug 8, 2020 at 4:59 PM Marc Roos wrote: "it is imperative that the Agent IP is reachable from the container IP and vice versa." Anyone know/tested if this can be an asymmetric route when you are having multiple networks? [1] http://mesos.apache.org/documentation/latest/cni/
Assymetric route possible between agent and container?
"it is imperative that the Agent IP is reachable from the container IP and vice versa." Anyone know/tested if this can be an asymmetric route when you are having multiple networks? [1] http://mesos.apache.org/documentation/latest/cni/
Error "networkNames must be a single item list when hostPort is specified and more than 1 container network is defined"
I am getting this error message. When launching a task with portMappings and two container networks. What is the proper way to configure this? general: networkNames must be a single item list when hostPort is specified and more than 1 container network is defined "networks": [ { "mode": "container", "name": "cni-storage" }, { "mode": "container", "name": "cni-apps-public", "labels": {"vendorid": "ext-testing"}} ], "container": { "type": "MESOS", "portMappings": [ {"hostPort": 0, "name": "health", "protocol": "tcp"}, {"hostPort": 0, "name": "metrics", "protocol": "tcp"} ] } [1] https://jira.d2iq.com/browse/MARATHON-8760
crv port lookups on tasks with cni networks
How can I assign random ports to a cni network and read these back from srv. What is the equivalent of portDefinitions at network/host for network/container? -Original Message- To: user Subject: health check not working after changing host network If I change a task from: "networks": [ { "mode": "host" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], To: "networks": [ { "mode": "container", "name": "cni-storage" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], I am getting this error: W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' failed: curl exited with status 7: curl: (7) Failed connect to 127.0.0.1:0; Connection refused I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure of HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in grace period But when I disable the health check, and enter the network namespace of the running task. This localhost check is working [@ c59ea592-322f-4bfc-8981-21215904da58]# curl http://localhost:52684/test 200 OK Service ready. What am I doing wrong?
health check not working after changing host network
If I change a task from: "networks": [ { "mode": "host" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], To: "networks": [ { "mode": "container", "name": "cni-storage" } ], "portDefinitions": [ {"port": 0, "name": "health", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"} ], I am getting this error: W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' failed: curl exited with status 7: curl: (7) Failed connect to 127.0.0.1:0; Connection refused I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure of HTTP health check for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in grace period But when I disable the health check, and enter the network namespace of the running task. This localhost check is working [@ c59ea592-322f-4bfc-8981-21215904da58]# curl http://localhost:52684/test 200 OK Service ready. What am I doing wrong?
Ceph support planned?
Is native ceph support in the planning? Libvirt supports ceph with librbd[1]. What is currently the best practice to use ceph storage? [1] https://docs.ceph.com/docs/master/rbd/libvirt/
mesos master default drop acl
Currently I am running on a testing environment with some default acl I found[1]. I have configured mesos-credentials, and afaik everything agents/marathon framework is authenticating. So I thought about converting the acl to default drop/deny. However I see there are quite a few options. Is it advicable to even set the all to deny? Is there an example how to set the url for GetEndpoint? [2] https://github.com/apache/mesos/blob/master/include/mesos/authorizer/acls.proto http://mesos.apache.org/documentation/latest/configuration/master/ [1] { "run_tasks": [ { "principals": { "type": "ANY" }, "users": { "type": "ANY" } } ], "register_frameworks": [ { "principals": { "type": "ANY" }, "roles": { "type": "ANY" } } ] }
Fyi: nginx+ srv lookups also now available in basic nginx
For anyone who is interested. I was surprised that nginx was not offering srv lookups in their free version. I found a module that offered this, however it did not work because of syntax differences in srv lookups on mesos. I adapted this module to force sending a whole srv domain, and tests look promising. You can find this module for now here, but remember if you have groups in mesos use service=_https._synapse.dev._tcp.marathon.mesos. ('.' at the end) https://github.com/f1-outsourcing/ngx_upstream_resolveMK I have asked if the alpine linux guys can add this in their repository. So we do not need to go through this compiling hasle every time.
RE: random string in task groups hostname
2nd I have the impression that SRV records are not correctly implemented should ._tcp not be at the front (after the service) instead of in the middle? Or do I have something incorrect in my mesos configuration that makes these groups act as part of the task name? [@]$ dig +short _sip._udp.sip.voice.google.com SRV 20 1 5060 sip-anycast-2.voice.google.com. 10 1 5060 sip-anycast-1.voice.google.com. bash-5.0# dig +short @192.168.10.14 _tcp.server.temp.test.marathon.mesos bash-5.0# dig +short @192.168.10.14 server._tcp.temp.test.marathon.mesos bash-5.0# dig +short @192.168.10.14 _server._tcp.temp.test.marathon.mesos -Original Message- To: user Subject: random string in task groups hostname I cannot remember seeing this before. I wondered if this is common and is it should to be. I am having in srv lookups random string in the group. Why is test appended with '-grxx9-s0'? [@~]$ dig +short @192.168.10.14 server.temp.test.marathon.mesos 192.168.10.151 [@~]$ dig +short @192.168.10.14 _server.temp.test._tcp.marathon.mesos SRV 0 1 31682 server.temp.test-grxx9-s0.marathon.mesos.
random string in task groups hostname
I cannot remember seeing this before. I wondered if this is common and is it should to be. I am having in srv lookups random string in the group. Why is test appended with '-grxx9-s0'? [@~]$ dig +short @192.168.10.14 server.temp.test.marathon.mesos 192.168.10.151 [@~]$ dig +short @192.168.10.14 _server.temp.test._tcp.marathon.mesos SRV 0 1 31682 server.temp.test-grxx9-s0.marathon.mesos.
RE: getting correct metrics port from SRV records.
Oops ;) [@test2 image-synapse]$ dig +short @192.168.10.14 _metrics._synapse.dev._tcp.marathon.mesos SRV 0 1 31032 synapse.dev-nppzf-s0.marathon.mesos. -Original Message- To: user Subject: getting correct metrics port from SRV records. Is there a way to identify the correct port via dns? I have created a task with two ports[1]. But a dns srv query does not show anything different than the port number. How can I identify the correct port? Mesos-master tasks endpoint[3] shows the port names, is there a way to get these from dns? [1] "networks": [ { "mode": "host"} ], "portDefinitions": [{"port": 0, "name": "https", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"}] [2] [@test2 image-synapse]$ dig +short @192.168.10.14 _synapse.dev._tcp.marathon.mesos SRV 0 1 31031 synapse.dev-nppzf-s0.marathon.mesos. 0 1 31032 synapse.dev-nppzf-s0.marathon.mesos. [3] mesos-master /tasks/ "discovery": { "visibility": "FRAMEWORK", "name": "synapse.dev", "ports": { "ports": [ { "number": 31031, "name": "https", "protocol": "tcp" }, { "number": 31032, "name": "metrics", "protocol": "tcp" } ] } },
getting correct metrics port from SRV records.
Is there a way to identify the correct port via dns? I have created a task with two ports[1]. But a dns srv query does not show anything different than the port number. How can I identify the correct port? Mesos-master tasks endpoint[3] shows the port names, is there a way to get these from dns? [1] "networks": [ { "mode": "host"} ], "portDefinitions": [{"port": 0, "name": "https", "protocol": "tcp"}, {"port": 0, "name": "metrics", "protocol": "tcp"}] [2] [@test2 image-synapse]$ dig +short @192.168.10.14 _synapse.dev._tcp.marathon.mesos SRV 0 1 31031 synapse.dev-nppzf-s0.marathon.mesos. 0 1 31032 synapse.dev-nppzf-s0.marathon.mesos. [3] mesos-master /tasks/ "discovery": { "visibility": "FRAMEWORK", "name": "synapse.dev", "ports": { "ports": [ { "number": 31031, "name": "https", "protocol": "tcp" }, { "number": 31032, "name": "metrics", "protocol": "tcp" } ] } },
RE: fyi: mesos-dns is not registering all ip addresses
Hi Alex, My config.json is quite similar, but having "IPSources": ["netinfo", "mesos", "host"] You will only run into this issue when you have multihomed tasks, having two or more network adapters, eth0, eth1 etc -Original Message- From: Alex Evonosky [mailto:alex.evono...@gmail.com] Sent: maandag 27 juli 2020 14:36 To: user@mesos.apache.org Subject: Re: fyi: mesos-dns is not registering all ip addresses thank you. We have been running mesos-dns for years now without any issues. The docker apps spin up on marathon and automatically gets picked up by mesos-dns... This is our config.json: { "zk": "zk://10.10.10.51:2181,10.10.10.52:2181,10.10.10.53:2181/mesos", "masters": ["10.10.10.51:5050", "10.10.10.52:5050", "10.10.10.53:5050"], "refreshSeconds": 3, "ttl": 3, "domain": "mesos", "port": 53, "resolvers": ["10.10.10.88", "10.10.10.86"], "timeout": 3, "httpon": true, "dnson": true, "httpport": 8123, "externalon": true, "listener": "0.0.0.0", "SOAMname": "ns1.mesos", "SOARname": "root.ns1.mesos", "SOARefresh": 5, "SOARetry": 600, "SOAExpire": 86400, "SOAMinttl": 5, "IPSources":["mesos", "host"] } we just have our main DNS resolvers have a zone "mesos.marathon" and forwards the request to this cluster... On Mon, Jul 27, 2020 at 3:56 AM Marc Roos wrote: I am not sure if mesos-dns is discontinued. But for the ones still using it, in some cases it does not register all tasks ip addresses. The default[2] works, but if you have this setup[1] it will only register one ip address 192.168.122.140 and not the 2nd. I filed issue a year ago or so[3] [3] https://github.com/mesosphere/mesos-dns/issues/54145 https://issues.apache.org/jira/browse/MESOS-10164 [1] "network_infos": [ { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "192.168.122.140" } ] }, { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "192.168.10.17" } ], } ] [2] "network_infos": [ { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "12.0.1.2" }, { "protocol": "IPv6", "ip_address": "fd01:b::1:8000:2" } ], } ]
fyi: mesos-dns is not registering all ip addresses
I am not sure if mesos-dns is discontinued. But for the ones still using it, in some cases it does not register all tasks ip addresses. The default[2] works, but if you have this setup[1] it will only register one ip address 192.168.122.140 and not the 2nd. I filed issue a year ago or so[3] [3] https://github.com/mesosphere/mesos-dns/issues/54145 https://issues.apache.org/jira/browse/MESOS-10164 [1] "network_infos": [ { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "192.168.122.140" } ] }, { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "192.168.10.17" } ], } ] [2] "network_infos": [ { "ip_addresses": [ { "protocol": "IPv4", "ip_address": "12.0.1.2" }, { "protocol": "IPv6", "ip_address": "fd01:b::1:8000:2" } ], } ]
Mesos syslog logging to error level instead of info?
I have my test cluster of mesos on again, and I am having mesos-master logs end up in the wrong logs. I think mesos is not logging to correct levels/facility. (using mesos-1.10.0-2.0.1.el7.x86_64) Eg. I have got this on level error: Jul 24 12:25:16 m01 mesos-master[28922]: I0724 12:25:16.854624 28955 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework 43d5a67d-8c4e-496e-a108-5cfeb10b8967- (marathon) at scheduler-a9897343-98ee-4c31-a715-1b5e96e296bb@192.168.10.22:41009 Jul 24 12:25:20 m01 mesos-master[28922]: I0724 12:25:20.557858 28957 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:24 m01 mesos-master[28922]: I0724 12:25:24.738281 28957 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.547469 28958 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.554080 28961 http.cpp:1436] HTTP GET for /master/state?jsonp=angular.callbacks._fmv from 192.168.10.219:49885 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0' Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.556207 28956 http.cpp:1453] HTTP GET for /master/state?jsonp=angular.callbacks._fmv from 192.168.10.219:49885: '200 OK' after 2.46784ms Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.582295 28955 http.cpp:1436] HTTP GET for /master/maintenance/schedule?jsonp=angular.callbacks._fmw from 192.168.10.219:63372 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0' Jul 24 12:25:30 m01 mesos-master[28922]: I0724 12:25:30.635844 28955 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:31 m01 mesos-master[28922]: I0724 12:25:31.874604 28955 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework 43d5a67d-8c4e-496e-a108-5cfeb10b8967- (marathon) at scheduler-a9897343-98ee-4c31-a715-1b5e96e296bb@192.168.10.22:41009 Jul 24 12:25:34 m01 mesos-master[28922]: I0724 12:25:34.816028 28958 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.625381 28955 authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint '/metrics/snapshot' Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.632581 28956 http.cpp:1436] HTTP GET for /master/state?jsonp=angular.callbacks._fn0 from 192.168.10.219:49885 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0' Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.634801 28959 http.cpp:1453] HTTP GET for /master/state?jsonp=angular.callbacks._fn0 from 192.168.10.219:49885: '200 OK' after 2.55488ms Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.687845 28958 http.cpp:1436] HTTP GET for /master/maintenance/schedule?jsonp=angular.callbacks._fn1 from 192.168.10.219:63372 with User-Agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
RE: Advice on alternative for marathon framework
Thanks Tomek, have it running, giving it a try. -Original Message- To: user Subject: Re: Advice on alternative for marathon framework You can try https://github.com/HubSpot/Singularity Aurora was moved to attic <https://attic.apache.org/> śr., 15 lip 2020 o 16:29 Marc Roos napisał(a): I am having problems[1] getting marathon to run since march (can only run 1.7) and the only emails I receive from d2iq is how to rate their support. I wonder if this Marathon is still best to be used with mesos. I have aurora running, but it looks to have less options. What I like about the Marathon framework is of course the web interface and some plugins that allowed me to use capabilities. I know I should/could launch application directly in mesos via the command line. But I am just starting with the mesos and I prefer now to have a gui. Can anyone advice on a good alternative to marathon? [1] https://jira.d2iq.com/browse/MARATHON-8729 https://github.com/mesosphere/marathon/issues/7136
Advice on alternative for marathon framework
I am having problems[1] getting marathon to run since march (can only run 1.7) and the only emails I receive from d2iq is how to rate their support. I wonder if this Marathon is still best to be used with mesos. I have aurora running, but it looks to have less options. What I like about the Marathon framework is of course the web interface and some plugins that allowed me to use capabilities. I know I should/could launch application directly in mesos via the command line. But I am just starting with the mesos and I prefer now to have a gui. Can anyone advice on a good alternative to marathon? [1] https://jira.d2iq.com/browse/MARATHON-8729 https://github.com/mesosphere/marathon/issues/7136
problems running marathon >=1.8 on mesos
I am cross posting this to mesos-users, hoping someone has came accros this issue, and can help me resolve this issue I have. There are several JIRA issues open with similar symptoms. All of a sudden I having problems with marathon ui getting stuck at 'loading' and end points like http://m01.local:8081/v2/info are not responding (http://m01.local:8081/ping). I have now downgraded the test cluster to one node, running only mesos-master and zookeeper and marathon. Cleaning between tests the /var/lib/zookeeper and the /var/lib/mesos directories. I have also removed many of the configuration options I had, like ssl etc. I am only able to get to run marathon-1.7.216-9e2a9b579. marathon-1.8.222-86475ddac and marathon-1.10.17-c427ce965 are having the above mentioned errors/problem. I have been comparing the marathon 1.7 and marathon 1.8 logs and this what I have noticed. There are quite a bit of log statements missing between 'All services up and running. (mesosphere.marathon.MarathonApp:main' and 'akka://marathon/deadLetters' in the 1.8 log. Anyone had something similar? [@mesos-master]# rpm -qa | grep java python-javapackages-3.4.1-11.el7.noarch tzdata-java-2020a-1.el7.noarch java-1.8.0-openjdk-headless-1.8.0.252.b09-2.el7_8.x86_64 javapackages-tools-3.4.1-11.el7.noarch [@mesos-master]# uname -a Linux m01.local 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux [@mesos-master]# cat /etc/redhat-release CentOS Linux release 7.8.2003 (Core) marathon 1.8 (unresponsive) === Jun 7 17:40:59 m01 marathon: [2020-06-07 17:40:59,696] INFO All services up and running. (mesosphere.marathon.MarathonApp:main) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,833] INFO initiate task reconciliation (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-9) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,854] INFO Requesting task reconciliation with the Mesos master (mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0) Jun 7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.858621 11227 master.cpp:8846] Performing implicit task state reconciliation for framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) at scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941 Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,864] INFO task reconciliation has finished (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default- dispatcher-4) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,879] INFO Message [mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from Actor[akka://marathon/user/MarathonScheduler/$a#1746491390] to Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters encountered. If this is not an expected behavior, then [Actor[akka://marathon/deadLetters]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. (akka.actor.DeadLetterActorRef:marathon-akka.actor.default-dispatcher-7) Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,910] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-7) Jun 7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.914615 11228 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) at scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941 Jun 7 17:41:13 m01 marathon: [2020-06-07 17:41:13,924] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-13) Jun 7 17:41:28 m01 marathon: [2020-06-07 17:41:28,939] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-4) Jun 7 17:41:28 m01 mesos-master[11203]: I0607 17:41:28.946494 11229 master.cpp:8889] Performing explicit task state reconciliation for 1 tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) at scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941 Jun 7 17:41:28 m01 marathon: [2020-06-07 17:41:28,950] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-14) marathon 1.7 (ok) = Jun 7 17:37:02 m01 marathon: [2020-06-07 17:37:02,681] INFO All services up and running. (mesosphere.marathon.MarathonApp:main) Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,222] INFO Received TimedCheck (mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto r.default-dispatcher-8) Jun 7 17:37:06 m01 marathon: [2020-06-07 17:37:06,228] INFO => revive offers NOW, canceling
RE: No offers are being made -- how to debug Mesos?
You already put these on debug? [@ ]# cat /etc/mesos-master/logging_level WARNING [@ ]# cat /etc/mesos-slave/logging_level WARNING -Original Message- From: Benjamin Wulff [mailto:benjamin.wulff...@ieee.org] Sent: zaterdag 6 juni 2020 13:36 To: user@mesos.apache.org Subject: No offers are being made -- how to debug Mesos? Hi all, I’m in the process of setting up my first Mesos cluster with 1x master and 3x slaves on CentOS 8. So far set up Zookeepr and Mesos-master on the master and Mesos-slave on one of the compute nodes. Mesos-master communicates with ZK and becomes leader. Then I started memos-slave on the compute node and can see in the log that it registers at the master with the correct resources reported. The agent and its resources are also displayed in the web UI of the master. So is the framework that I want to use. The crux is that no tasks I schedule in the framework are executed. And I suppose this is because the framework never receives an offer. I can see in the web UI that no offers are made and that all resources remain idle. Now, I’m new to Mesos and I don’t really have an idea how to debug my setup at this point. There is a page called ‘Debugging with the new CLI’ in the documentation but it only explains how to configure the CLI command. Any directions how to debug in my situation in general or on how to use the CLI for debugging would be highly welcome! :) Thanks and best regards, Ben
RE: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)
* ability for an executor to communicate with an agent via Unix domain socket instead of TCP I think this will solve my problem with tasks running on different ip which I was doing via a local route. But somehow this route was not being used in mesos. While ping to the netspace were ok. -Original Message- From: Qian Zhang [mailto:zhq527...@gmail.com] Sent: donderdag 28 mei 2020 2:57 To: user Cc: dev Subject: Re: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1) +1 (binding) Regards, Qian Zhang On Thu, May 28, 2020 at 12:56 AM Benjamin Mahler wrote: +1 (binding) On Mon, May 18, 2020 at 4:36 PM Andrei Sekretenko wrote: Hi all, Please vote on releasing the following candidate as Apache Mesos 1.10.0. 1.10.0 includes the following major improvements: * support for resource bursting (setting task resource limits separately from requests) on Linux * ability for an executor to communicate with an agent via Unix domain socket instead of TCP * ability for operators to modify reservations via the RESERVE_RESOURCES master API call * performance improvements of V1 operator API read-only calls bringing them on par with V0 HTTP endpoints * ability for a scheduler to expect that effects of calls sent through the same connection will not be reordered/interleaved by master NOTE: 1.10.0 includes a breaking change for custom authorizer modules. Now, `ObjectApprover`s may be stored by Mesos indefinitely and must be kept up-to-date by an authorizer throughout their lifetime. This allowed for several bugfixes and performance improvements. The CHANGELOG for the release is available at: https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.10.0-rc1 The candidate for Mesos 1.10.0 release is available at: https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz The tag to be voted on is 1.10.0-rc1: https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.10.0-rc1 The SHA512 checksum of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.sha512 The signature of the tarball can be found at: https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.asc The PGP key used to sign the release is here: https://dist.apache.org/repos/dist/release/mesos/KEYS The JAR is in a staging repository here: https://repository.apache.org/content/repositories/orgapachemesos-1259 Please vote on releasing this package as Apache Mesos 1.10.0! The vote is open until Fri, May 21, 19:00 CEST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Mesos 1.10.0 [ ] -1 Do not release this package because ... Thanks, Andrei Sekretenko
RE: Found no roles suitable for revive repetition.
Hi Benjamin, Do you have a subscribe email address of this mailing list of marathon? Thanks, Marc -Original Message- From: Benjamin Mahler [mailto:bmah...@apache.org] Sent: 18 March 2020 18:32 To: user Subject: Re: Found no roles suitable for revive repetition. Hi Marc, can you contact the marathon mailing list or slack channel. Also, if there is a question here or some more context, please include that so they know what you need help with. On Wed, Mar 18, 2020 at 9:46 AM Marc Roos wrote: Marathon is stuck on 'loading applications' Mar 18 14:43:48 m01 marathon: [2020-03-18 14:43:48,646] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-30 ) Mar 18 14:43:53 m01 marathon: [2020-03-18 14:43:53,321] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-9) Mar 18 14:43:58 m01 marathon: [2020-03-18 14:43:58,324] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-6) Mar 18 14:44:03 m01 marathon: [2020-03-18 14:44:03,321] INFO Found no roles suitable for revive repetition. (mesosphe
registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime.
I am having these, has been reported already on Jira long time ago. How to fix these? der mesosphere.marathon.api.v2.PodsResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,785] WARN A provider mesosphere.marathon.api.v2.AppsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.AppsResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,787] WARN A provider mesosphere.marathon.api.v2.DeploymentsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.DeploymentsResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,789] WARN A provider mesosphere.marathon.api.v2.TasksResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.TasksResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,792] WARN A provider mesosphere.marathon.api.v2.QueueResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.QueueResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,798] WARN A provider mesosphere.marathon.api.v2.InfoResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.InfoResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,800] WARN A provider mesosphere.marathon.api.v2.LeaderResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.LeaderResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,803] WARN A provider mesosphere.marathon.api.v2.PluginsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.PluginsResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,805] WARN A provider mesosphere.marathon.api.SystemResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.SystemResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING) Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,805] WARN A provider mesosphere.marathon.api.v2.GroupsResource registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider mesosphere.marathon.api.v2.GroupsResource will be ignored. (org.glassfish.jersey.internal.inject.Providers:MarathonHttpService STARTING)
Found no roles suitable for revive repetition.
Marathon is stuck on 'loading applications' Mar 18 14:43:48 m01 marathon: [2020-03-18 14:43:48,646] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-30) Mar 18 14:43:53 m01 marathon: [2020-03-18 14:43:53,321] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-9) Mar 18 14:43:58 m01 marathon: [2020-03-18 14:43:58,324] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-6) Mar 18 14:44:03 m01 marathon: [2020-03-18 14:44:03,321] INFO Found no roles suitable for revive repetition. (mesosphe
Failed to send 'mesos.internal.FrameworkErrorMessage'
I am getting these on a test setup, where marathon and mesos-master running on the same node and iptables is not even configured. W0222 23:03:48.829741 1112 process.cpp:1917] Failed to send 'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:35530', connect: Failed connect, connection error: Connection refused W0222 23:03:48.831212 1112 process.cpp:1917] Failed to send 'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:35530', connect: Failed to connect to 192.168.10.151:35530: Connection refused W0222 23:05:41.584399 1112 process.cpp:1917] Failed to send 'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:42877', connect: Failed connect, connection error: Connection refused W0222 23:05:41.584664 1112 process.cpp:1917] Failed to send 'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:42877', connect: Failed to connect to 192.168.10.151:42877: Connection refused Marathon logs these: Feb 22 23:30:54 m01 marathon: [2020-02-22 23:30:54,471] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-5) Feb 22 23:30:59 m01 marathon: [2020-02-22 23:30:59,482] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-12) Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,342] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath on-akka.actor.default-dispatcher-2) Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,347] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-130) Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,471] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv eRepeaterLogic:marathon-akka.actor.default-dispatcher-9)
RE: cni iptables best practice
What about if I pay someone of your team privately, maybe want to do a bit of work at weekends? Maybe you can propose this to members of your team that have been working on this in the past? -Original Message- Sent: 05 February 2020 16:51 To: user Cc: zhq527725; support Subject: Re: cni iptables best practice Hi Marc, CNI3 support is not on Mesosphere's near term roadmap given our other priorities. But if there's anyone in the community willing to work with you to develop it, as the Apache Mesos project, we'll be happy to accept the contribution (of course assuming it adheres to the project's quality standards). On Wed, Feb 5, 2020 at 8:57 AM Marc Roos wrote: Is this possible? I would like to start using mesos in production to be honest. -Original Message- Sent: 30 January 2020 18:46 To: Qian Zhang Cc: user; supp...@mesosphere.com Subject: RE: cni iptables best practice What about when I fund this? How much would it cost? Otherwise I need to spend time/money on making a custom cni plugin that is not even operating via standards. PS. I do not see the point of getting some external programmer, that needs to acquire specific knowledge on this subject first. -Original Message- Cc: user Subject: Re: cni iptables best practice I do not think we plan to do it in short term. Regards, Qian Zhang On Tue, Jan 28, 2020 at 1:54 AM Marc Roos wrote: Hi Qian, Any idea on when this cni 0.3 is going to be implemented? I saw the issue priority is Major, can't remember if it was always like this. But looks promising. Regards, Marc -Original Message- Sent: 14 December 2019 09:46 To: user Subject: RE: cni iptables best practice Yes, yes I know, disaster. I wondered how or even if people are using iptables with tasks. Even on internal environment it could be nice to use not?
RE: cni iptables best practice
Is this possible? I would like to start using mesos in production to be honest. -Original Message- Sent: 30 January 2020 18:46 To: Qian Zhang Cc: user; supp...@mesosphere.com Subject: RE: cni iptables best practice What about when I fund this? How much would it cost? Otherwise I need to spend time/money on making a custom cni plugin that is not even operating via standards. PS. I do not see the point of getting some external programmer, that needs to acquire specific knowledge on this subject first. -Original Message- Cc: user Subject: Re: cni iptables best practice I do not think we plan to do it in short term. Regards, Qian Zhang On Tue, Jan 28, 2020 at 1:54 AM Marc Roos wrote: Hi Qian, Any idea on when this cni 0.3 is going to be implemented? I saw the issue priority is Major, can't remember if it was always like this. But looks promising. Regards, Marc -Original Message- Sent: 14 December 2019 09:46 To: user Subject: RE: cni iptables best practice Yes, yes I know, disaster. I wondered how or even if people are using iptables with tasks. Even on internal environment it could be nice to use not?
Kill task, but not restarted
Because the instance was not showing in the marathon gui. I have killed a task with kill -KILL, assuming it would restart, yet it did not. I think it has to do with these messages. Why do I have these even, when I can just ping them? W0202 14:46:51.215673 359364 process.cpp:1480] Failed to link to '192.168.122.253:35071', connect: Failed connect: connection closed W0202 14:46:51.217136 359364 process.cpp:1480] Failed to link to '192.168.122.95:41400', connect: Failed connect: connection closed W0202 14:46:51.217594 359364 process.cpp:1480] Failed to link to '192.168.122.94:41974', connect: Failed connect: connection closed W0202 14:46:51.218037 359364 process.cpp:1480] Failed to link to '192.168.122.13:33447', connect: Failed connect: connection closed [@mesos]# ping -c 2 192.168.122.95 PING 192.168.122.95 (192.168.122.95) 56(84) bytes of data. 64 bytes from 192.168.122.95: icmp_seq=1 ttl=64 time=0.062 ms 64 bytes from 192.168.122.95: icmp_seq=2 ttl=64 time=0.051 ms [@mesos]# ping -c 2 192.168.122.94 PING 192.168.122.94 (192.168.122.94) 56(84) bytes of data. 64 bytes from 192.168.122.94: icmp_seq=1 ttl=64 time=0.053 ms 64 bytes from 192.168.122.94: icmp_seq=2 ttl=64 time=0.045 ms [@mesos]# ping -c 2 192.168.122.13 PING 192.168.122.13 (192.168.122.13) 56(84) bytes of data. 64 bytes from 192.168.122.13: icmp_seq=1 ttl=64 time=0.069 ms 64 bytes from 192.168.122.13: icmp_seq=2 ttl=64 time=0.051 ms
RE: cni iptables best practice
What about when I fund this? How much would it cost? Otherwise I need to spend time/money on making a custom cni plugin that is not even operating via standards. PS. I do not see the point of getting some external programmer, that needs to acquire specific knowledge on this subject first. -Original Message- Cc: user Subject: Re: cni iptables best practice I do not think we plan to do it in short term. Regards, Qian Zhang On Tue, Jan 28, 2020 at 1:54 AM Marc Roos wrote: Hi Qian, Any idea on when this cni 0.3 is going to be implemented? I saw the issue priority is Major, can't remember if it was always like this. But looks promising. Regards, Marc -Original Message- Sent: 14 December 2019 09:46 To: user Subject: RE: cni iptables best practice Yes, yes I know, disaster. I wondered how or even if people are using iptables with tasks. Even on internal environment it could be nice to use not?
RE: cni iptables best practice
Hi Qian, Any idea on when this cni 0.3 is going to be implemented? I saw the issue priority is Major, can't remember if it was always like this. But looks promising. Regards, Marc -Original Message- Sent: 14 December 2019 09:46 To: user Subject: RE: cni iptables best practice Yes, yes I know, disaster. I wondered how or even if people are using iptables with tasks. Even on internal environment it could be nice to use not? -Original Message- To: user Subject: Re: cni iptables best practice You are right, we do not support CNI chaining plugin yet, and I think there is a ticket to trace it: https://issues.apache.org/jira/browse/MESOS-7079. Regards, Qian Zhang On Sat, Dec 14, 2019 at 7:08 AM Marc Roos wrote: Is anyone applying iptables rules in their cni networking, and how? I wrote a iptables chaining plugin but cannot use it because this cni 0.3.0 is still not supported in mesos 1.9. I wondered how this done currently
RE: cni iptables best practice
Yes, yes I know, disaster. I wondered how or even if people are using iptables with tasks. Even on internal environment it could be nice to use not? -Original Message- To: user Subject: Re: cni iptables best practice You are right, we do not support CNI chaining plugin yet, and I think there is a ticket to trace it: https://issues.apache.org/jira/browse/MESOS-7079. Regards, Qian Zhang On Sat, Dec 14, 2019 at 7:08 AM Marc Roos wrote: Is anyone applying iptables rules in their cni networking, and how? I wrote a iptables chaining plugin but cannot use it because this cni 0.3.0 is still not supported in mesos 1.9. I wondered how this done currently
cni iptables best practice
Is anyone applying iptables rules in their cni networking, and how? I wrote a iptables chaining plugin but cannot use it because this cni 0.3.0 is still not supported in mesos 1.9. I wondered how this done currently
Iptables
How to set iptable rules inside a container? I am getting these Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied Fatal: can't open lock file /run/xtables.lock: Permission denied
Degraded performance container vs vm (-80% !!!)
I have still with mesos 1.9 degraded performance, any help to sort this out would be nice. Makes me also wonder if others have bothered testing this or not? I am testing still with mesos and thus have mostly a default setup. Previously when I opened this thread, there was questioning about resource differences between the vm and container. This is not the case, the vm has allocated 1 vcpu. To counter eventual memory issues I try to use the dns cache by requesting contstantly the same 2 domains in files-2.tst. [0] marathon task resources "cpus": 1, "mem": 300, [1] memory usage vm PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 9671 named 20 0 289m 90m 2948 S 0.0 18.8 12:18.22 named [2] testing vm [@ test-dns]$ dnsperf -f inet -t 2 -s 192.168.10.10 -d files-2.tst -l 10 DNS Performance Testing Tool Nominum Version 2.1.0.0 [Status] Command line: dnsperf -f inet -t 2 -s 192.168.10.10 -d files-2.tst -l 10 [Status] Sending queries (to 192.168.10.10) [Status] Started at: Tue Oct 22 14:30:03 2019 [Status] Stopping after 10.00 seconds [Status] Testing complete (time limit) Statistics: Queries sent: 116834 Queries completed:116834 (100.00%) Queries lost: 0 (0.00%) Response codes: NOERROR 116834 (100.00%) Average packet size: request 27, response 111 Run time (s): 10.011078 Queries per second: 11670.471452 Average Latency (s): 0.008367 (min 0.000778, max 0.020300) Latency StdDev (s): 0.001285 [3] testing container [marc@os0 test-dns]$ dnsperf -f inet -t 2 -s 192.168.10.13 -d files-2.tst -l 10 DNS Performance Testing Tool Nominum Version 2.1.0.0 [Status] Command line: dnsperf -f inet -t 2 -s 192.168.10.13 -d files-2.tst -l 10 [Status] Sending queries (to 192.168.10.13) [Status] Started at: Tue Oct 22 14:29:48 2019 [Status] Stopping after 10.00 seconds [Timeout] Query timed out: msg id 3 [Timeout] Query timed out: msg id 9 [Timeout] Query timed out: msg id 10 [Timeout] Query timed out: msg id 11 ... ... ... [Timeout] Query timed out: msg id 21251 [Timeout] Query timed out: msg id 21253 [Timeout] Query timed out: msg id 21260 [Timeout] Query timed out: msg id 21328 [Timeout] Query timed out: msg id 21344 [Timeout] Query timed out: msg id 21365 [Timeout] Query timed out: msg id 21390 [Status] Testing complete (time limit) Statistics: Queries sent: 24770 Queries completed:24275 (98.00%) Queries lost: 495 (2.00%) Response codes: NOERROR 24275 (100.00%) Average packet size: request 27, response 111 Run time (s): 10.000185 Queries per second: 2427.455092 Average Latency (s): 0.000326 (min 0.000130, max 0.003435) Latency StdDev (s): 0.000139
changing /etc/hosts in container
What are my options to adding a host entry to /etc/hosts in container running not as root?
RE: Is chained cni networks supported in mesos 1.7
Hi Gilbert, How is it going with the chain implementation? Thanks, Marc -Original Message- From: Gilbert Song [mailto:gilb...@apache.org] Sent: woensdag 14 augustus 2019 22:24 To: user Subject: Re: Is chained cni networks supported in mesos 1.7 Are you interested in implementing the CNI chain support? -Gilbert On Wed, Jul 24, 2019 at 12:52 PM Marc Roos wrote: Hmm, I guess I should not get my hopes up this will be there soon? [0] https://issues.apache.org/jira/browse/MESOS-7178 -Original Message- From: Jie Yu [mailto:yujie@gmail.com] Sent: woensdag 24 juli 2019 21:35 To: user Subject: Re: Is chained cni networks supported in mesos 1.7 No, not yet On Wed, Jul 24, 2019 at 12:27 PM Marc Roos wrote: This error message of course E0724 21:19:17.852210 1160 cni.cpp:330] Failed to parse CNI network configuration file '/etc/mesos-cni/93-chain.conflist': Protobuf parse failed: Missing required fields: typ -Original Message- Subject: Is chained cni networks supported in mesos 1.7 I am getting this error, while I don not have problems using it with cnitool. cni.cpp:330] Failed to parse CNI network configuration file '/etc/mesos-cni/93-chain-routing-overwrite.conflist.bak': Protobuf parse failed: Missing required fields: type [@ mesos-cni]# cat 93-chain.conflist { "name": "test-chain", "plugins": [{ "type": "bridge", "bridge": "test-chain0", "isGateway": false, "isDefaultGateway": false, "ipMasq": false, "ipam": { "type": "host-local", "subnet": "10.15.15.0/24" } }, { "type": "portmap", "capabilities": {"portMappings": true}, "snat": false }] } [@ mesos-cni]# CNI_PATH="/usr/libexec/cni/" NETCONFPATH="/etc/mesos-cni" cnitool-0.5.2 add test-chain /var/run/netns/testing { "ip4": { "ip": "10.15.15.2/24", "gateway": "10.15.15.1" }, "dns": {}
RE: Mesos task example json
Thanks Benjamin, I will bookmark these. -Original Message- To: user@mesos.apache.org Subject: Re: Mesos task example json Hi Marc, > You also know how/where to put the capabilities? I am struggling with > that. Have a look at the protobufs which define this API: * `TaskInfo` which is used with `mesos-execute` is defined here, https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L2229-L2285, * capabilities are passed via a task’s `container` field, https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L2239 which has a field `linux_info` whose structure is defined here, https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L3270-L3341 * in there you want to set effective_capabilities and/or `bounding_capabilities`, see the docs document for their semantics and interaction with the agent configuration, e.g., https://mesos.apache.org/documentation/latest/isolators/linux-capabilities/#task-setup. Most of the public Mesos APIs are defined in files under https://github.com/apache/mesos/tree/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos, either in protobuf form or as C++ header files. For questions like yours it often helps to work backwards from an interesting field to a structure (e.g., in this particular case: work out how `CapabilityInfo` is related to `TaskInfo`). HTH, Benjamin
Don't understand how to use mesos capabilities
Don't understand how to use mesos capabilities as described here[0] 1. removed caps from ping with setcap 'cap_net_raw=-p' /usr/bin/ping 2. linux/capabilities in the isolators, 3. mesos-slave running as root, 4. did not set effective_capabilities nor bounding_capabilities 5. Running kernel 3.10.0-957.27.2.el7.x86_64 6. Looks like the task json is correctly configured (output from tasks endpoint) }, "container": { "type": "MESOS", "linux_info": { "effective_capabilities": { "capabilities": [ "NET_RAW" ] } } Yet when I run the task with the command "capsh --print ; ping -c 2 localhost ; sleep 120" I am getting such outputs of capsh[1] yet the ping refuses with "ping: socket: Operation not permitted" [1] Current: = cap_net_raw+eip cap_net_admin,cap_syslog+i Bounding set =cap_net_admin,cap_net_raw,cap_syslog Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=99(nobody) gid=99(nobody) groups=99(nobody) Current: = cap_net_raw+eip Bounding set =cap_net_raw Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=99(nobody) gid=99(nobody) groups=99(nobody) [0] http://mesos.apache.org/documentation/latest/isolators/linux-capabilities/
RE: Mesos task example json
Hi Qian, Thanks! You also know how/where to put the capabilities? I am struggling with that. -Original Message- To: user Subject: Re: Mesos task example json Hi Marc, Here is an example json that I use for testing: { "name": "test", "task_id": {"value" : "test"}, "agent_id": {"value" : ""}, "resources": [ {"name": "cpus", "type": "SCALAR", "scalar": {"value": 1}}, {"name": "mem", "type": "SCALAR", "scalar": {"value": 128}} ], "command": { "value": "sleep 10" }, "container": { "type": "MESOS", "mesos": { "image": { "type": "DOCKER", "docker": { "name": "busybox" } } } } } Regards, Qian Zhang On Sat, Oct 12, 2019 at 6:26 AM Marc Roos wrote: Is there some example json available with all options for use with 'mesos-execute --task='
Mesos task example json
Is there some example json available with all options for use with 'mesos-execute --task='
mesos 1.9 should have mesos task not?
[@~]# mesos help Usage: mesos [OPTIONS] Available commands: help dns daemon.sh agent start-cluster.sh master start-agents.sh start-masters.sh start-slaves.sh stop-agents.sh stop-cluster.sh stop-masters.sh stop-slaves.sh tail cat execute init-wrapper local log ps resolve scp mesos-1.9.0-2.0.1.el7.x86_64
NET_ADMIN permission equivalent for mesos
I have a docker image that requires NET_ADMIN, I have found this[0] (for the docker containerizer?), but what is the syntax for the mesos containerizer. [0] { "cpus": 0.1, "mem": 50, "id": "/openvpn", "instances": 1, "container": { "docker": { "image": "docker-registry.marathon.mesos:5000/openvpn", "network": "BRIDGE", "forcePullImage": true, "parameters": [{"key":"cap-add", "value":"NET_ADMIN"}], "portMappings": [{"containerPort": 1194, "servicePort": 1194}] } }, "dependencies": ["/mesos-dns", "/docker-registry"], "healthChecks": [{"protocol": "TCP"}] }
Kernel module restrictions on launced task?
Are there any restrictions on a launched task that could block access to ipsec in the kernel? I am getting this in the launched task Oct 8 16:05:19 c02 ipsec_starter[695921]: no netkey IPsec stack detected Oct 8 16:05:19 c02 ipsec_starter[695921]: no KLIPS IPsec stack detected Oct 8 16:05:19 c02 ipsec_starter[695921]: no known IPsec stack detected, ignoring! While launching directly on the host seems to be ok. Linux c04 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.7.1908 (Core) mesos-1.9.0-2.0.1.el7.x86_64
RE: Task list node
Yes thanks managed to get them with this curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X GET https://m01.local:5050/state | jq '.frameworks[].tasks[] | select(.state=="TASK_RUNNING") | del(.statuses, .discovery, .container, .health_check) | "\(.name) \(.state) \(.slave_id)" ' -Original Message- To: user Subject: Re: Task list node You can just mimic UI behaviour and use /state endpoint and filter it with jq. wt., 1 paź 2019 o 13:56 Marc Roos napisał(a): Hmmm, if do something like this[0] I get only 3 tasks, and the mesos gui on 5050 is showing all (I guess, at least more than three) Also if I grep the unfiltered json output for a task string, it does not find it. [0] curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X GET https://m01.local:5050/tasks | jq '.tasks[] | select(.state=="TASK_RUNNING")' curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X GET https://m01.local:5050/master/tasks | jq '.tasks[] | select(.state=="TASK_RUNNING") | del(.statuses, .discovery, .health_check, .container) | "\(.name) \(.state) \(.slave_id)" ' -Original Message- To: user Subject: Re: Task list node You can list them with agent containers endpoint http://mesos.apache.org/documentation/latest/endpoints/slave/contai ners/ Or with master tasks endpoint and filter them localy with jq http://mesos.apache.org/documentation/latest/endpoints/master/tasks / czw., 26 wrz 2019 o 22:09 Marc Roos napisał(a): What would be the easiest way to list running tasks on a node/agent/slave?
RE: Task list node
Hmmm, if do something like this[0] I get only 3 tasks, and the mesos gui on 5050 is showing all (I guess, at least more than three) Also if I grep the unfiltered json output for a task string, it does not find it. [0] curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X GET https://m01.local:5050/tasks | jq '.tasks[] | select(.state=="TASK_RUNNING")' curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X GET https://m01.local:5050/master/tasks | jq '.tasks[] | select(.state=="TASK_RUNNING") | del(.statuses, .discovery, .health_check, .container) | "\(.name) \(.state) \(.slave_id)" ' -Original Message- To: user Subject: Re: Task list node You can list them with agent containers endpoint http://mesos.apache.org/documentation/latest/endpoints/slave/containers/ Or with master tasks endpoint and filter them localy with jq http://mesos.apache.org/documentation/latest/endpoints/master/tasks/ czw., 26 wrz 2019 o 22:09 Marc Roos napisał(a): What would be the easiest way to list running tasks on a node/agent/slave?
Maybe new feature/option for the health check
I have a few tasks that take a while before they get started. Sendmail eg. Is not to happy you cannot set the hostname (in marathon) and then gives a timeout of 1 minute. I think there is something similar when starting openldap. If I enable a regular health check there, it will fail the task before it finished launching. Maybe it is interesting to add an option for this initDelay? { "path": "/api/health", "portIndex": 0, "protocol": "MESOS_HTTP", "initDelay": 60, < "gracePeriodSeconds": 300, "intervalSeconds": 60, "timeoutSeconds": 20, "maxConsecutiveFailures": 3 }
Problems with tasks and cni networking after upgrading from 1.8 to 1.9
Looks like my tasks that have dual networking, a gateway and cni_args assigned ip address are not able to start anymore on mesos 1.9. During deployment I am able to ping these assigned ip addresses. Why can't this executor reach the task then? I guess something has changed in how the executor is connecting with container since 1.8? I0929 00:54:55.658519 469057 slave.cpp:2130] Got assigned task 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- I0929 00:54:55.664753 469057 slave.cpp:2504] Authorizing task 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- I0929 00:54:55.667817 469057 slave.cpp:2977] Launching task 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- I0929 00:54:55.668617 469057 paths.cpp:817] Creating sandbox '/var/lib/mesos/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/framework s/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-storag e-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/51e76 e90-5244-4c67-98bf-cf30b78d6fe6' for user 'nobody' I0929 00:54:55.669667 469057 paths.cpp:820] Creating sandbox '/var/lib/mesos/meta/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/fram eworks/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-s torage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/ 51e76e90-5244-4c67-98bf-cf30b78d6fe6' I0929 00:54:55.669903 469057 slave.cpp:10002] Launching executor 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- with resources [{"allocation_info":{"role":"marathon"},"name":"cpus","scalar":{"value": 0.1},"type":"SCALAR"},{"allocation_info":{"role":"marathon"},"name":"mem ","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/framework s/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-storag e-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/51e76 e90-5244-4c67-98bf-cf30b78d6fe6' I0929 00:54:55.670975 469057 slave.cpp:3209] Queued task 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' for executor 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- I0929 00:54:55.671329 469057 slave.cpp:3657] Launching container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 for executor 'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a 1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- I0929 00:54:55.671910 469042 containerizer.cpp:1396] Starting container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 I0929 00:54:55.672205 469042 containerizer.cpp:3323] Transitioning the state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from STARTING to PROVISIONING after 162048ns I0929 00:54:55.973661 469039 provisioner.cpp:551] Provisioning image rootfs '/var/lib/mesos/provisioner/containers/51e76e90-5244-4c67-98bf-cf30b78d6 fe6/backends/copy/rootfses/292ee89d-1112-4b79-83e7-953571b3cd3b' for container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 using copy backend I0929 00:54:57.279662 469057 containerizer.cpp:3323] Transitioning the state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from PROVISIONING to PREPARING after 1.607405056secs I0929 00:54:57.285080 469064 cpu.cpp:92] Updated 'cpu.shares' to 204 (cpus 0.2) for container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 I0929 00:54:57.296098 469057 switchboard.cpp:316] Container logger module finished preparing container 51e76e90-5244-4c67-98bf-cf30b78d6fe6; IOSwitchboard server is not required I0929 00:54:57.298223 469064 linux_launcher.cpp:492] Launching container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 and cloning with namespaces CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWNET I0929 00:54:57.307914 469057 containerizer.cpp:2209] Checkpointing container's forked pid 469525 to '/var/lib/mesos/meta/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/fram eworks/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-s torage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/ 51e76e90-5244-4c67-98bf-cf30b78d6fe6/pids/forked.pid' I0929 00:54:57.308645 469057 containerizer.cpp:3323] Transitioning the state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from PREPARING to ISOLATING after 29.022976ms I0929 00:54:57.310788 469040 cni.cpp:974] Bind mounted '/proc/469525/ns/net' to '/run/mesos/isolators/network/cni/51e76e90-5244-4c67-98bf-cf30b78d6fe6/n s' for container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 I0929 00:54:57.311151 469040 cni.cpp:1320] Invoking CNI plugin '/usr/libexec/cni/mesos' to attach container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 to network
How to clean up "Failed to find 'libprocess.pid' or 'http.marker'"
W0929 00:45:10.676910 468993 process.cpp:1055] Failed SSL connections will be downgraded to a non-SSL socket W0929 00:45:10.901372 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 8bf306d5-a10c-4787-9258-4198ea80bbec of executor W0929 00:45:10.902492 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 4b171278-bcd4-4014-abc4-c330912bd87f of executor W0929 00:45:10.903398 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 8dadcf4d-35ab-4cd7-b313-f4de1175282e of executor W0929 00:45:10.904371 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 586a55ff-89b7-4fd1-8bed-045300baee47 of executor W0929 00:45:10.905293 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container ea295608-bf62-447d-aae6-a3596dae8a13 of executor W0929 00:45:10.906186 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 957f1879-2718-400b-9348-ec35c89f51a6 of executor W0929 00:45:10.907114 469057 state.cpp:657] Failed to find 'libprocess.pid' or 'http.marker' for container 938dbd1c-1468-480e-9554-65c9e415d163 of executor
Task list node
What would be the easiest way to list running tasks on a node/agent/slave?
BUG /tmp/mesos losing files add /usr/lib/tmpfiles.d/mesos.conf
For the developers, tmp in centos6/7 (and probably more distros) is being cleaned automatically! Read this: https://www.thegeekdiary.com/centos-rhel-67-why-the-files-in-tmp-directory-gets-deleted-periodically/ https://developers.redhat.com/blog/2016/09/20/managing-temporary-files-with-systemd-tmpfiles-on-rhel7/ , maybe add something like this file to your rpms. cat << EOF >> /usr/lib/tmpfiles.d/mesos.conf x /tmp/mesos/store/docker/ EOF Now mesos fails often even when being used with "forcePullImage": true I am having this error: Task id chat_openfire.instance-5c6ba784-d7a7-11e9-b799-0050563001a1._app.11 State TASK_FAILED Message Failed to launch container: Failed to read manifest from '/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191 cb85dbc1aa187e4e480021c1/json': No such file or directory mesos-1.8.1-2.0.1.el7.x86_64 -Original Message- From: Marc Roos Sent: maandag 19 augustus 2019 21:47 To: user Subject: "Failed to launch container" "No such file or directory" Some temp folders gone? How to resolve this? Failed to launch container: Failed to read manifest from '/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191 cb85dbc1aa187e4e480021c1/json': No such file or directory
RE: Please some help regression testing a task
No it is not throttled, besides changing the runtime cgroups of the task to user.slice should have revealed some difference then not? [@~]# cat /sys/fs/cgroup/cpuacct/mesos/d0923b5a-5b96-41cc-b291-4effc0bfcbb9/cpu.st at nr_periods 0 nr_throttled 0 throttled_time 0 -Original Message- To: user Subject: Re: Please some help regression testing a task Can you check if the task is throttled? You can run the command `/proc//cgroup` to get the cgroups of the task, and then check the `cpu.stat` file under task's CPU cgroups, e.g.: $ cat /sys/fs/cgroup/cpuacct/mesos/bd5bc588-7565-4c7e-a5f0-d33850b2ec0a/cpu.st at nr_periods 118 nr_throttled 37 throttled_time 633829202 If `nr_throttled` is greater than 0, then that means the task was throttled which may affect its performance. Regards, Qian Zhang On Sat, Aug 31, 2019 at 11:48 PM Marc Roos wrote: mesos-1.8.1-2.0.1.el7.x86_64 CentOS Linux release 7.6.1810 (Core) -Original Message- To: user Subject: Please some help regression testing a task I have a task that under performs. I am unable to discover what is causing it. Could this be something mesos specific? Performance difference is 1k q/s vs 20k q/s 1. If manually I run the task on the host the performance is ok > I think one could rule out network connectivity on/of the host and > host issues 2. If I manually run a task in the same netns as the under performing task, the performance is ok. ip netns exec bind bash chroot 04a81d99-9b99-410d-bf83-d6d70ef2c7bb/ (changed only the config port to 54) named -u named > I think we can rule out netns issues 3. If I manually remove or change the cgroups of the mesos/marathon task, the performance is still bad echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks or echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks [@]# cat /proc/2936696/cgroup 11:hugetlb:/ 10:memory:/user.slice 9:devices:/user.slice 8:cpuacct,cpu:/user.slice 7:perf_event:/ 6:cpuset:/ 5:pids:/user.slice 4:freezer:/ 3:blkio:/user.slice 2:net_prio,net_cls:/ 1:name=systemd:/user.slice/user-0.slice/session-17385.scope [@]# cat /proc/2932859/cgroup 11:hugetlb:/ 10:memory:/user.slice 9:devices:/user.slice 8:cpuacct,cpu:/user.slice 7:perf_event:/ 6:cpuset:/ 5:pids:/user.slice 4:freezer:/ 3:blkio:/user.slice 2:net_prio,net_cls:/ 1:name=systemd:/mesos/812c481b-c0a4-444a-aafa-de98da9698e2
W0831 containerizer.cpp:2375] Ignoring update for unknown container
Why do get this message? How to resolve this? W0831 18:01:45.403295 2943686 containerizer.cpp:2375] Ignoring update for unknown container 48d9b77c-7348-4404-9845-211be74bad1d mesos-1.8.1-2.0.1.el7.x86_64
RE: Please some help regression testing a task
mesos-1.8.1-2.0.1.el7.x86_64 CentOS Linux release 7.6.1810 (Core) -Original Message- To: user Subject: Please some help regression testing a task I have a task that under performs. I am unable to discover what is causing it. Could this be something mesos specific? Performance difference is 1k q/s vs 20k q/s 1. If manually I run the task on the host the performance is ok > I think one could rule out network connectivity on/of the host and > host issues 2. If I manually run a task in the same netns as the under performing task, the performance is ok. ip netns exec bind bash chroot 04a81d99-9b99-410d-bf83-d6d70ef2c7bb/ (changed only the config port to 54) named -u named > I think we can rule out netns issues 3. If I manually remove or change the cgroups of the mesos/marathon task, the performance is still bad echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks or echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks [@]# cat /proc/2936696/cgroup 11:hugetlb:/ 10:memory:/user.slice 9:devices:/user.slice 8:cpuacct,cpu:/user.slice 7:perf_event:/ 6:cpuset:/ 5:pids:/user.slice 4:freezer:/ 3:blkio:/user.slice 2:net_prio,net_cls:/ 1:name=systemd:/user.slice/user-0.slice/session-17385.scope [@]# cat /proc/2932859/cgroup 11:hugetlb:/ 10:memory:/user.slice 9:devices:/user.slice 8:cpuacct,cpu:/user.slice 7:perf_event:/ 6:cpuset:/ 5:pids:/user.slice 4:freezer:/ 3:blkio:/user.slice 2:net_prio,net_cls:/ 1:name=systemd:/mesos/812c481b-c0a4-444a-aafa-de98da9698e2
RE: Large container image failing to start 'first' time
I only have these two messages mesos-slave.ERROR:E0828 12:51:46.146246 2663200 slave.cpp:6486] Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is being destroyed during provisioning mesos-slave.INFO:E0828 12:51:46.146246 2663200 slave.cpp:6486] Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is being destroyed during provisioning mesos-slave.INFO:W0828 12:51:46.650323 2663184 containerizer.cpp:2375] Ignoring update for unknown container 680d3849-2b2a-4549-8842-8ef358599478 mesos-slave.WARNING:E0828 12:51:46.146246 2663200 slave.cpp:6486] Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is being destroyed during provisioning mesos-slave.WARNING:W0828 12:51:46.650323 2663184 containerizer.cpp:2375] Ignoring update for unknown container 680d3849-2b2a-4549-8842-8ef358599478 -Original Message- From: Qian Zhang [mailto:zhq527...@gmail.com] Sent: woensdag 28 augustus 2019 15:07 To: Marc Roos Cc: user Subject: Re: Large container image failing to start 'first' time Can you please send the full logs about this container (just grep 680d3849-2b2a-4549-8842-8ef358599478 in agent log)? And is there anything left in the staging directory (`--docker_store_dir/staging/`) when this issue happens? Regards, Qian Zhang On Wed, Aug 28, 2019 at 7:07 PM Marc Roos wrote: I had this again. E0828 12:51:46.146246 2663200 slave.cpp:6486] Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is being destroyed during provisioning -Original Message- From: Qian Zhang [mailto:zhq527...@gmail.com] Sent: dinsdag 20 augustus 2019 1:12 To: user Subject: Re: Large container image failing to start 'first' time > Large container image failing to start 'first' time Did you see any errors/warnings in agent logs when the container failed to start? Regards, Qian Zhang On Mon, Aug 19, 2019 at 10:46 PM Marc Roos wrote: I have a container image of around 800MB. I am not sure if that is a lot. But I have noticed it is probably to big for a default setup to get it to launch. I think the only reason it launches eventually is because data is cached and no timeout expires. The container will launch eventually when you constrain it to a host. How can I trace where this timeout occurs? Are there options to specify timeouts?
Converting vm to task (performance degraded)
I am testing converting a nameserver vm to a task on mesos. If I query just one domain (so the results comes from cache) for 30 seconds I can do around 450.000 queries on the vm, and only 17.000 on the task. When I look at top output on the host where task is running I see this task only using 17% cpu time (vm allocates 100% cpu). I have launched the task with cpus: 1 How/where/what should I check that causes this reduced performance? I think some configuration is limiting because I can easily get 10k q/s on the vm and the task is only getting 1,8k q/s Is there a configuration guide on how to change a hosts settings to optimize it for using with mesos?
RE: Large container image failing to start 'first' time
I had this again. E0828 12:51:46.146246 2663200 slave.cpp:6486] Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is being destroyed during provisioning -Original Message- From: Qian Zhang [mailto:zhq527...@gmail.com] Sent: dinsdag 20 augustus 2019 1:12 To: user Subject: Re: Large container image failing to start 'first' time > Large container image failing to start 'first' time Did you see any errors/warnings in agent logs when the container failed to start? Regards, Qian Zhang On Mon, Aug 19, 2019 at 10:46 PM Marc Roos wrote: I have a container image of around 800MB. I am not sure if that is a lot. But I have noticed it is probably to big for a default setup to get it to launch. I think the only reason it launches eventually is because data is cached and no timeout expires. The container will launch eventually when you constrain it to a host. How can I trace where this timeout occurs? Are there options to specify timeouts?
Exporting socket from container to host
I was wondering if it is possible to export a socket of a container to the host, so I can then share it again with another container. (Without using pods eg. (I like to scale these applications independent from each other))
W0823 14:20:30.101281 2663193 containerizer.cpp:2375] Ignoring update for unknown container
When scaling a task from 0 to 1 with two cni networks one of them having a gateway, I have quite a lot of failures Step 1: deploying Step 2: DHCPREQUEST and DHCPACK (fast) Step 3: Right after DHCPACK this error from the agent W0823 14:58:18.440388 2663180 containerizer.cpp:2375] Ignoring update for unknown container 2000cbfc-7ca8-4f61-bcfd-f43e248ba130 Step 4: Delayed Why do I get the unknow container?
W0823 12:00:46. process.cpp:1453] Failed to link to '192.168.142.50:40746', connect: Failed connect: connection closed
When scaling the task from 0 to 1, it takes sometimes quite a while for it to become active. Waiting maybe 10-20 seconds on the first waiting reported by marathon. Step 1: deploying (fast) Step 2: sometimes fast / sometimes 10-20 seconds Step 3: DHCPREQUEST and DHCPACK (fast) Step 4: Right after DHCPACK this error from the agent W0823 13:08:16.595113 2663211 process.cpp:1453] Failed to link to '192.168.142.53:41715', connect: Failed connect: connection closed Step 5: waiting (fast) Step 6: running (fast) Sometimes the same task is instantly running. Although having the same 'connection closed' error after the DHCPACK.
RE: Large container image failing to start 'first' time
I have found several, related to also having multiple networks, I will start new threat -Original Message- To: user Subject: Re: Large container image failing to start 'first' time > Large container image failing to start 'first' time Did you see any errors/warnings in agent logs when the container failed to start? Regards, Qian Zhang On Mon, Aug 19, 2019 at 10:46 PM Marc Roos wrote: I have a container image of around 800MB. I am not sure if that is a lot. But I have noticed it is probably to big for a default setup to get it to launch. I think the only reason it launches eventually is because data is cached and no timeout expires. The container will launch eventually when you constrain it to a host. How can I trace where this timeout occurs? Are there options to specify timeouts?
RE: "Failed to launch container" "No such file or directory" /tmp files are being cleaned
Hmmm, I have deleted layers there, now it seems ok and the image is pulled from the nfs share. 1. should the forced pull not be forcing to get the image regardless what is in /tmp? 2. I think from centos7 /tmp is not cleared just after a reboot, but also scheduled. So maybe not such a good place to store docker layers? -Original Message- From: Marc Roos Sent: maandag 19 augustus 2019 21:47 To: user Subject: "Failed to launch container" "No such file or directory" Some temp folders gone? How to resolve this? Failed to launch container: Failed to read manifest from '/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191 cb85dbc1aa187e4e480021c1/json': No such file or directory
"Failed to launch container" "No such file or directory"
Some temp folders gone? How to resolve this? Failed to launch container: Failed to read manifest from '/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191 cb85dbc1aa187e4e480021c1/json': No such file or directory
Large container image failing to start 'first' time
I have a container image of around 800MB. I am not sure if that is a lot. But I have noticed it is probably to big for a default setup to get it to launch. I think the only reason it launches eventually is because data is cached and no timeout expires. The container will launch eventually when you constrain it to a host. How can I trace where this timeout occurs? Are there options to specify timeouts?
RE: Provisioning containers with configuration file via sandbox mount or copy via entrypoint.sh
Hi Gilbert, thanks for the detailed reply, this secrets is very interesting. > * Fetch via URI - you probably do not need your application entrypoint to fetch. Instead Mesos > and marathon supports fetching URIs to your container sandbox. > http://mesos.apache.org/documentation/latest/fetcher/ This fetching is what I am doing now. I have containers with a default configuration file. But when I need updates I am fetching with something like this. "fetch": [ { "uri": "file:///mnt/docker-images/haproxy.cfg", "executable": false, "extract": false, "cache": false, "destPath": "haproxy.cfg" }, { "uri": "file:///mnt/docker-images/.crt", "executable": false, "extract": false, "cache": false, "destPath": ".crt" } ], But this file goes into the sandbox directory /mnt/sandbox, I just wonder why it can't go directly to the 'container rootfs'? This is what I now have to do in the entrypoint.sh if [ ! -z "${MESOS_SANDBOX}" ] && [ -f "${MESOS_SANDBOX}/haproxy.cfg" ] -Original Message- To: user Subject: Re: Provisioning containers with configuration file via sandbox mount or copy via entrypoint.sh It depends on how do you want to manage the configuration files for your containers - dynamic or static. * Dynamic * Fetch via URI - you probably do not need your application entrypoint to fetch. Instead Mesos and marathon supports fetching URIs to your container sandbox. http://mesos.apache.org/documentation/latest/fetcher/ * Pass into the container as a file based secret if it is sensitive. http://mesos.apache.org/documentation/latest/secrets/#file-based-secrets * Environment Variable. * Static * Host_path volume - mounting a host path or file into your container. http://mesos.apache.org/documentation/latest/container-volume/#host_path-volume-source * Build it in your container image if those configurations are not expected to be changed. > Furthermore this page[1] says the sandbox is considered read only, yet the stdout and stderr are located there??? I think the document <http://mesos.apache.org/documentation/latest/sandbox/#using-the-sandbox> means that sandbox is not expected to be touched by any 3rd party software or people other than Mesos, executor and task/application. -Gilbert On Sun, Jul 21, 2019 at 3:22 AM Marc Roos wrote: What would be the adviced way to add a configuration file to a container being used at startup. I am now fetching the files and then create an entrypoint.sh that copies this from the sandbox. Creating these custom entrypoints.sh is cumbersome. I thought about mounting the path's of the sandbox in the container but don't have good example to get this working[0]. Furthermore this page[1] says the sandbox is considered read only, yet the stdout and stderr are located there??? Is there a (security) advantage copying files from the sandbox at startup or just use a mount point? [0] https://www.mail-archive.com/user@mesos.apache.org/msg10445.html [1] http://mesos.apache.org/documentation/latest/sandbox/
RE: Is chained cni networks supported in mesos 1.7
Hi Gilbert, Yes indeed. I have written already a netfilter chain plugin[0] I wanted to use. But also the default tuning pluging of cni, which requires chaining I would like to use. -Marc [0] https://github.com/f1-outsourcing/plugins/tree/hostrouteif/plugins/meta/firewallnetns -Original Message- To: user Subject: Re: Is chained cni networks supported in mesos 1.7 Are you interested in implementing the CNI chain support? -Gilbert On Wed, Jul 24, 2019 at 12:52 PM Marc Roos wrote: Hmm, I guess I should not get my hopes up this will be there soon? [0] https://issues.apache.org/jira/browse/MESOS-7178 -Original Message- From: Jie Yu [mailto:yujie@gmail.com] Sent: woensdag 24 juli 2019 21:35 To: user Subject: Re: Is chained cni networks supported in mesos 1.7 No, not yet On Wed, Jul 24, 2019 at 12:27 PM Marc Roos wrote: This error message of course E0724 21:19:17.852210 1160 cni.cpp:330] Failed to parse CNI network configuration file '/etc/mesos-cni/93-chain.conflist': Protobuf parse failed: Missing required fields: typ -Original Message- Subject: Is chained cni networks supported in mesos 1.7 I am getting this error, while I don not have problems using it with cnitool. cni.cpp:330] Failed to parse CNI network configuration file '/etc/mesos-cni/93-chain-routing-overwrite.conflist.bak': Protobuf parse failed: Missing required fields: type [@ mesos-cni]# cat 93-chain.conflist { "name": "test-chain", "plugins": [{ "type": "bridge", "bridge": "test-chain0", "isGateway": false, "isDefaultGateway": false, "ipMasq": false, "ipam": { "type": "host-local", "subnet": "10.15.15.0/24" } }, { "type": "portmap", "capabilities": {"portMappings": true}, "snat": false }] } [@ mesos-cni]# CNI_PATH="/usr/libexec/cni/" NETCONFPATH="/etc/mesos-cni" cnitool-0.5.2 add test-chain /var/run/netns/testing { "ip4": { "ip": "10.15.15.2/24", "gateway": "10.15.15.1" }, "dns": {}
Should mesos 1.8 (and marathon 1.8) drain/migrate tasks or not?
I don’t get from this page http://mesos.apache.org/documentation/latest/maintenance/ if mesos should be 'moving' tasks to another node when it is marked as draining. I know DRAIN_AGENT is only for mesos 1.9. But what use it to post a maintenance schedule, see the node being marked as draining, and nothing happens with the tasks? On the marathon page the say "draining is not yet implemented", yet they refer to an issue that has been resolved. https://mesosphere.github.io/marathon/docs/maintenance-mode.html On stackoverflow there is the same question, and again referencing issue that have been resolved. https://stackoverflow.com/questions/37194123/marathon-tasks-not-migrating-off-mesos-node-goes-into-draining-mode https://jira.mesosphere.com/browse/MARATHON-3216 https://phabricator.mesosphere.com/D1069 -Original Message- From: Vinod Kone [mailto:vinodk...@apache.org] Sent: donderdag 8 augustus 2019 0:35 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Please read the "maintenace primitives" section in this doc http://mesos.apache.org/documentation/latest/maintenance/ and let us know if you have unanswered questions. On Wed, Aug 7, 2019 at 4:59 PM Marc Roos wrote: I seem to be able to add a maintenance schedule, and get also a report on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see tasks migrate to other hosts. Or is this not the purpose of maintenance mode in 1.8? Just to make sure no new tasks will be launched on hosts scheduled for maintenance? -Original Message- From: Chun-Hung Hsiao [mailto:chhs...@apache.org] Sent: woensdag 7 augustus 2019 22:59 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Hi Marc. Agent draining is a Mesos 1.9 feature and is only available on the current Mesos master branch. Please see https://issues.apache.org/jira/browse/MESOS-9814. Best, Chun-Hung On Wed, Aug 7, 2019 at 1:35 PM Marc Roos wrote: Should this be working in mesos 1.8? [@m01 ~]# curl --user test:x -X POST \ > https://m01.local:5050/api/v1 \ > --cacert /etc/pki/ca-trust/source/ca.crt \ > -H 'Accept: application/json' \ > -H 'content-type: application/json' -d '{ > "type": "DRAIN_AGENT", > "drain_agent": {"agent_id": { > "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1" > }}}' Failed to validate master::Call: Expecting 'type' to be present
RE: Draining: Failed to validate master::Call: Expecting 'type' to be present
Now I am getting the draining state, don’t know why I did not get this before. {"draining_machines":[{"id":{"hostname":"m02.local"}}]} But no tasks are migrating, nothing happens After a while, brought the agent down {"down_machines":[{"hostname":"m02.local"}]} Tasks are still there. I assume this automatic draining is not related to the mesos 1.9 DRAIN_AGENT? And tasks should migrate to other nodes? -Original Message- To: user Subject: RE: Draining: Failed to validate master::Call: Expecting 'type' to be present I have scheduled a maintenance (from date now), how can I verify if the agent is indeed in 'draining' mode? -Original Message- From: Vinod Kone [mailto:vinodk...@apache.org] Sent: donderdag 8 augustus 2019 0:35 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Please read the "maintenace primitives" section in this doc http://mesos.apache.org/documentation/latest/maintenance/ and let us know if you have unanswered questions. On Wed, Aug 7, 2019 at 4:59 PM Marc Roos wrote: I seem to be able to add a maintenance schedule, and get also a report on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see tasks migrate to other hosts. Or is this not the purpose of maintenance mode in 1.8? Just to make sure no new tasks will be launched on hosts scheduled for maintenance? -Original Message- From: Chun-Hung Hsiao [mailto:chhs...@apache.org] Sent: woensdag 7 augustus 2019 22:59 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Hi Marc. Agent draining is a Mesos 1.9 feature and is only available on the current Mesos master branch. Please see https://issues.apache.org/jira/browse/MESOS-9814. Best, Chun-Hung On Wed, Aug 7, 2019 at 1:35 PM Marc Roos wrote: Should this be working in mesos 1.8? [@m01 ~]# curl --user test:x -X POST \ > https://m01.local:5050/api/v1 \ > --cacert /etc/pki/ca-trust/source/ca.crt \ > -H 'Accept: application/json' \ > -H 'content-type: application/json' -d '{ > "type": "DRAIN_AGENT", > "drain_agent": {"agent_id": { > "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1" > }}}' Failed to validate master::Call: Expecting 'type' to be present
RE: Draining: Failed to validate master::Call: Expecting 'type' to be present
I have scheduled a maintenance (from date now), how can I verify if the agent is indeed in 'draining' mode? -Original Message- From: Vinod Kone [mailto:vinodk...@apache.org] Sent: donderdag 8 augustus 2019 0:35 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Please read the "maintenace primitives" section in this doc http://mesos.apache.org/documentation/latest/maintenance/ and let us know if you have unanswered questions. On Wed, Aug 7, 2019 at 4:59 PM Marc Roos wrote: I seem to be able to add a maintenance schedule, and get also a report on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see tasks migrate to other hosts. Or is this not the purpose of maintenance mode in 1.8? Just to make sure no new tasks will be launched on hosts scheduled for maintenance? -Original Message- From: Chun-Hung Hsiao [mailto:chhs...@apache.org] Sent: woensdag 7 augustus 2019 22:59 To: user Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' to be present Hi Marc. Agent draining is a Mesos 1.9 feature and is only available on the current Mesos master branch. Please see https://issues.apache.org/jira/browse/MESOS-9814. Best, Chun-Hung On Wed, Aug 7, 2019 at 1:35 PM Marc Roos wrote: Should this be working in mesos 1.8? [@m01 ~]# curl --user test:x -X POST \ > https://m01.local:5050/api/v1 \ > --cacert /etc/pki/ca-trust/source/ca.crt \ > -H 'Accept: application/json' \ > -H 'content-type: application/json' -d '{ > "type": "DRAIN_AGENT", > "drain_agent": {"agent_id": { > "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1" > }}}' Failed to validate master::Call: Expecting 'type' to be present