How can I enable forwarding in a mesos container

2020-12-31 Thread Marc Roos



If I run it as root I am getting 

sysctl: error setting key 'net.ipv4.ip_forward': Read-only file system

Looking for something like this: docker run --sysctl 
net.ipv4.ip_forward=1 someimage




Problems getting the new mvp csi working

2020-12-21 Thread Marc Roos


I have been looking forward to the update of mesos offering this mvp 
csi, mainly to finally be able to use ceph. But unfortunately I am still 
not able to get a simple rbd image attached to a container.


I am able to use the csilvm by adding the volume like this[2], but the 
cephcsi keeps failing. It looks like the secrets are not being send to 
the driver, it keeps complaining about 'stage secrets cannot be nil or 
empty'[1], with this config[3] having staging secrets. I have also tried 
using a secrets plugin doing something like "username": { "secret": 
"secretpassword"}. Any hints on what I am doing wrong are very welcome!

[1]
I1221 21:54:36.932030   10356 utils.go:132] ID: 14 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC call: /csi.v1.Node/NodeStageVolume
I1221 21:54:36.932302   10356 utils.go:133] ID: 14 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC request: 
{"staging_target_path":"/var/lib/mesos/csi/rbd.csi.ceph.io/default/mount
s/0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1/s
taging","volume_capability":{"AccessType":{"Block":{}},"access_mode":{"m
ode":1}},"volume_context":{"clusterID":"ceph","pool":"app"},"volume_id":
"0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1"}
E1221 21:54:36.932316   10356 utils.go:136] ID: 14 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC error: rpc error: code = InvalidArgument desc = stage secrets 
cannot be nil or empty
I1221 21:54:36.976159   10356 utils.go:132] ID: 15 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC call: /csi.v1.Node/NodeUnstageVolume
I1221 21:54:36.976308   10356 utils.go:133] ID: 15 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC request: 
{"staging_target_path":"/var/lib/mesos/csi/rbd.csi.ceph.io/default/mount
s/0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1/s
taging","volume_id":"0001-0004-ceph-0016-7957e938-405a-11eb-
bfd0-0050563001a1"}
I1221 21:54:36.976465   10356 nodeserver.go:666] ID: 15 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
failed to find image metadata: missing stash: open 
/var/lib/mesos/csi/rbd.csi.ceph.io/default/mounts/0001-0004-ceph-000
00016-7957e938-405a-11eb-bfd0-0050563001a1/staging/image-meta.json: 
no such file or directory
I1221 21:54:36.976537   10356 utils.go:138] ID: 15 Req-ID: 
0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1 
GRPC response: {}


[3]
"volumes": [
  {
"containerPath": "xxx",
"mode": "rw",
"external": {
  "provider": "csi",
  "name": 
"0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1",
  "options": { 
"pluginName": "rbd.csi.ceph.io",
"capability": {
  "accessType": "block",
  "accessMode": "SINGLE_NODE_WRITER",
  "fsType": ""
},
"volumeContext": {
  "clusterID": "ceph",
  "pool": "app" 
},
"nodeStageSecret": {
  "username": "userID",
  "password": "asdfasdfasdfasdfasdfasdf"
}
  }
}
  }
]



[2]
"volumes": [
  {
"containerPath": "xxx",
"mode": "rw",
"external": {
  "provider": "csi",
  "name": "LVtestman1",
  "options": { 
"pluginName": "lvm.csi.mesosphere.io",
"capability": {
  "accessType": "mount",
  "accessMode": "SINGLE_NODE_WRITER",
  "fsType": "xfs" 
}

  }
}
  }
]


new mvp how to use block csi lvm

2020-12-18 Thread Marc Roos



If I use the csilvm driver, I am able to use a published volume with 
this task[1] with xfs fs. However when I try and ad the volume as block 
device the task[3] fails to deploy the log[4] however seems ok and does 
a mount and unmount. Should I change more than just accessType and 
fsType? The stderr of the task mentions 'Failed to prepare mounts: 
Failed to mount' and 'directory does not exist'.

The driver seems ok, because it respons to a publishing request from the 
command line[2] 


[2]
csc -e unix:///tmp/mesos-csi-v5WtjJ/endpoint.sock node publish --cap 
SINGLE_NODE_WRITER,block --target-path /mnt/testman1  'LVtestman1'

Log:
...
[VGtest]2020/12/18 17:50:18 lvm.go:893: stderr:
[VGtest]2020/12/18 17:50:18 server.go:1069: Volume path is 
/dev/VGtest/LVtestman1
[VGtest]2020/12/18 17:50:18 server.go:1071: Target path is /mnt/testman1
[VGtest]2020/12/18 17:50:18 server.go:1074: Mounting readonly: false
[VGtest]2020/12/18 17:50:18 server.go:1094: Attempting to publish volume 
/dev/VGtest/LVtestman1 as BLOCK_DEVICE to /mnt/testman1
[VGtest]2020/12/18 17:50:18 server.go:1095: Determining mount info at 
/mnt/testman1
[VGtest]2020/12/18 17:50:18 server.go:1104: Mount info at /mnt/testman1: 

[VGtest]2020/12/18 17:50:18 server.go:1135: Creating Mount Target  
/mnt/testman1
[VGtest]2020/12/18 17:50:18 server.go:1143: Nothing mounted at 
targetPath /mnt/testman1 yet
[VGtest]2020/12/18 17:50:18 server.go:1148: Performing bind mount of 
/dev/VGtest/LVtestman1 -> /mnt/testman1
[VGtest]2020/12/18 17:50:18 logging.go:30: Served 
/csi.v1.Node/NodePublishVolume: resp=



[1]
{
  "id": "/app5",
  "instances": 1,
  "cpus": 1,
  "mem": 32,
  "cmd": "echo $(date +'%m%d %H%M%S'): $HOSTNAME >> xxx/file ; sleep 
3600",
  "acceptedResourceRoles": ["*"],
  "constraints": [["hostname", "CLUSTER", "m01.local"]],
  "backoffSeconds": 10,
  "networks": [ { "mode": "host" } ],
  "container": {
"type": "MESOS",
"volumes": [
  {
"containerPath": "xxx",
"mode": "rw",
"external": {
  "provider": "csi",
  "name": "LVtestman1",
  "options": { 
"pluginName": "lvm.csi.mesosphere.io",
"capability": {
  "accessType": "mount",
  "accessMode": "SINGLE_NODE_WRITER",
  "fsType": "xfs"
}
  }
}
  }
]
  }
}

[3]
{
  "id": "/app4",
  "instances": 1,
  "cpus": 1,
  "mem": 32,
  "cmd": "echo $(date +'%m%d %H%M%S'): $HOSTNAME >> file ; sleep 3600",
  "acceptedResourceRoles": ["*"],
  "constraints": [["hostname", "CLUSTER", "m01.local"]],
  "backoffSeconds": 10,
  "networks": [ { "mode": "host" } ],
  "container": {
"type": "MESOS",
"volumes": [
  {
"containerPath": "xxx",
"mode": "rw",
"external": {
  "provider": "csi",
  "name": "LVtestman1",
  "options": { 
"pluginName": "lvm.csi.mesosphere.io",
"capability": {
  "accessType": "block",
  "accessMode": "SINGLE_NODE_WRITER",
  "fsType": ""
}
  }
}
  }
]
  }
}

[4]
[VGtest]2020/12/18 17:53:22 server.go:1329: Determining mount info at 
/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe
t
[VGtest]2020/12/18 17:53:22 server.go:1337: Mount info at 
/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe
t: &{root:/dm-2 
path:/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/
target fstype:devtmpfs mountopts:[rw nosuid] mountsource:devtmpfs}
[VGtest]2020/12/18 17:53:22 server.go:1346: Unmounting 
/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe
t
[VGtest]2020/12/18 17:53:22 server.go:1361: Deleting Mount Target  
/var/lib/mesos/csi/lvm.csi.mesosphere.io/default/mounts/LVtestman1/targe
t
[VGtest]2020/12/18 17:53:22 logging.go:30: Served 
/csi.v1.Node/NodeUnpublishVolume: resp=


csi specification handles volumeid(?)

2020-12-17 Thread Marc Roos


I hope nobody minds putting this here since the csi mailing list is on 
invitation only, and Jie Yu seems to be everywhere ;)

I am having some problems understanding how the cephcsi plugin works. I 
am using the csc[1] from the rexray people which I believe to have quite 
some history with the development of provisioning storage. I seem to be 
able to use this ok with several plugins like csilvm and csinfs and 
publish some volumes. 

However with the cephcsi plugin I seem to have to use a different 
arguments with the StageVolume/PublishVolume event and the CreateVolume 
event. 
When I create a volume with 'csc -e unix:///tmp/csiceph.sock controller 
create-volume' I can supply a volume name 'app-test2'. 
But when I want to stage/publish this volume with 'csc -e 
unix:///tmp/csiceph.sock node stage' I have to supply a volume ID 
'0001-0004-ceph-0016-7957e938-405a-11eb-bfd0-0050563001a1'.

Question: is this indeed correct according to the csi specification? It 
just looks weird to me especially that other plugins do not behave like 
this, or is this new?
 
I do not know why, but I get a bit the impression that cephcsi plugin is 
implementing fixes that need to be fixed in kubernetes. I honestly do 
not know why a csi plugin is trying to generate random image names, 
volume ids etc. If something needs to randomized then the OC is 
responsible for this not?


[1]
https://github.com/rexray/gocsi

[2]
https://github.com/ceph/ceph-csi/issues/1802


csi volume

2020-12-15 Thread Marc Roos


When I create this task[2], I am getting the error message: 

"There was a problem with your configuration
general: App creation unsuccessful. Check your app settings and try 
again."

I have the csi managed plugin running and can mount with the command 
line csc[1]. What should I look at to fix this?

[1]
# csc -e unix:///tmp/mesos-csi-Xxe00V/endpoint.sock  node publish --cap 
SINGLE_NODE_WRITER,mount,nfs --target-path /mnt/test --vol-context 
'server=192.168.10.58,share=/test' 192.168.10.58/test


[2]
{
  "id": "app1",
  "instances": 1,
  "cpus": 1,
  "mem": 32,
  "cmd": "echo yes > xxx/file && sleep 3600",
  "container": {
"type": "MESOS",
"volumes": [
  {
"containerPath": "xxx",
"mode": "rw",
"external": {
  "provider": "csi",
  "name": "no-need",
  "options": { 
"pluginName": "nfs.csi.k8s.io",
"capability": {
  "accessType": "mount",
  "accessMode": "MULTI_NODE_MULTI_WRITER",
  "fsType": "nfs"
},
"volumeContext": {
  "server": "192.168.10.58",
  "share": "/mnt/test"
}
  }
}
  }
]
  }
}


virtual memory task on mesos ~40GB while on docker ~3G

2020-12-13 Thread Marc Roos



When I launch a task via docker with:
docker run --memory 2G --memory-swappiness 0 -v /dev/log:/dev/log -it 
--network host marathon:1.11.24

This task seems to use ~400MB resident, and 2,8GB virtual. 

When I launch the same task on mesos I am getting

This task seems to use ~900MB resident, and 47GB virtual.

Is this difference normal? Or should I configure other settings in 
mesos?

On the mesos agent I have cgroups_limit_swap=true and to isolation I 
have added cgroups/mem





marathon plugin interface for mesos 1.11

2020-12-12 Thread Marc Roos


I wanted to test csi in mesos 1.11, but noticed that I am using a plugin 
for marathon that does not load any more. It has this in the "build.sbt" 
file:

libraryDependencies += "mesosphere.marathon" %% "plugin-interface" % 
"1.6.325" % "provided"

I assume this needs to be changed to a newer marathon plugin interface 
version? Where can I find what versions are available?





RE: hostname in task

2020-12-08 Thread Marc Roos
 
Hi James,

Sorry to bring this up again. But I have marathon constantly logging 
because of using the host name from the host networking, instead of 
using its own task name marathon.xxx.xxx.xxx.mesos as a host name for 
which there is a certificate.

Do you have an example of setting the hostname via mesos json? Because I 
have no idea how to interpret the github link.


[1]
Dec  8 11:03:39 c02 marathon: ERROR Connection to leader refused.
Dec  8 11:03:39 c02 #011akka.stream.ConnectionException: Hostname 
verification failed! Expected session to be for c03
Dec  8 11:03:40 c03 marathon: ERROR Connection to leader refused.
Dec  8 11:03:40 c03 #011akka.stream.ConnectionException: Hostname 
verification failed! Expected session to be for c03
Dec  8 11:03:54 c02 marathon: ERROR Connection to leader refused.
Dec  8 11:03:54 c02 #011akka.stream.ConnectionException: Hostname 
verification failed! Expected session to be for c03
Dec  8 11:03:55 c03 marathon: ERROR Connection to leader refused.
Dec  8 11:03:55 c03 #011akka.stream.ConnectionException: Hostname 
verification failed! Expected session to be for c03
Dec  8 11:04:09 c02 marathon: ERROR Connection to leader refused.
Dec  8 11:04:09 c02 #011akka.stream.ConnectionException: Hostname 
verification failed! Expected session to be for c03
Dec  8 11:04:10 c03 marathon: ERROR Connection to leader refused.

-Original Message-
Subject: Re: hostname in task

> 
> 
> I read you can add a hostname option to the container in this 
> issue[0], however I still have the uuid. Is this in available in mesos 
1.8?

Yep.

> Can I
> somewhere read all these options? Like here[1]

The Mesos API is defined in the ContainerInfo protobuf, but Im not 
sure how marathon maps that

https://github.com/apache/mesos/blob/master/include/mesos/v1/mesos.proto#L3395


> 
> 
> [@ cni]# cat 2f261fa8-4985-4614-b712-f0785ca6ce04/hosts
> 127.0.0.1 localhost
> 192.168.123.32 2f261fa8-4985-4614-b712-f0785ca6ce04
> 
> [0]
> https://reviews.apache.org/r/55191/
> [1]
> http://mesosphere.github.io/marathon/api-console/index.html
> 
> Using mesos 1.8
> And
> 
> "container": {
>"type": "MESOS",
>"hostname": "test.example.com",
>"docker": {
>"image": "test",
>"credential": null,
>"forcePullImage": true
>},
>   "volumes": [
>  {
>  "mode": "RW",
>  "containerPath": "/dev/log",
>  "hostPath": "/dev/log" 
>  }
>  ]
>  },





Marathon shutdown after master connection lost

2020-11-29 Thread Marc Roos




I hope nobody minds that I am crossposting this to mesos, since there is 
not much activity on the marathon mailing list. 

Is there an option to keep marathon running, having it try to reconnect 
to the mesos-master after it lost connection?

Currently I am running sort of a test cluster with only 1 zookeeper and 
1 mesos-master. This is quite ok for now, only annoying thing is that 
when I update the mesos-master my marathon tasks are gone, and I have to 
manually start one, that in turn re starts the failed ones.
It would be nice if they could try reconnect to the mesos-master for 
15-30 min and then go down. 




Package mesos-1.11.0-2.0.1.el7.x86_64.rpm is not signed

2020-11-28 Thread Marc Roos





Package mesos-1.11.0-2.0.1.el7.x86_64.rpm is not signed



RE: Suddenly all tasks gone, framework at completed, cannot start framework

2020-11-11 Thread Marc Roos


Is there a way to change this failover_timeout after the framework is 
running? Via the api or so? I see it is changed when the leader is 
changing.


-Original Message-
To: user
Cc: cf.natali; janiszt
Subject: RE: Suddenly all tasks gone, framework at completed, cannot 
start framework


Thanks Tomek, Charles, I increased my MARATHON_FAILOVER_TIMEOUT from a 
day to a week. I almost cannot believe something happened yesterday that 
made everything go down today. However I have recently been testing with 
JAVA_OPTS to prevent oom's from the marathon tasks.





Changing logging timestamp

2020-09-20 Thread Marc Roos


I have default remote syslog setup on centos all applications and server 
log the same timestamp (zone), except mesos and marathon tasks. I assume 
UTC times are send from them. How can I set this back to the 'hosts 
default'?






RE: Paid help for getting csi ceph working

2020-09-14 Thread Marc Roos
 
Hi Vinod, thanks for the link, I had a look at the design document also, 
looks promissing and clears also up some questions I had. Can't wait to 
give it a try. Good luck with this!




-Original Message-
To: user
Subject: Re: Paid help for getting csi ceph working

SERP is not available yet.

We are currently working on an alternative way to get external storage 
into Mesos instead of using SLRP.  Please watch the progress here: 
https://issues.apache.org/jira/browse/MESOS-10141 . MVP support will 
land in the upcoming release of Mesos.

On Mon, Sep 7, 2020 at 2:08 PM Marc Roos  
wrote:




Is there anyone interested in giving some paid help to get me up 
and 
running with an slrp with ceph? I assume this serp is not available 

still not?








Paid help for getting csi ceph working

2020-09-07 Thread Marc Roos



Is there anyone interested in giving some paid help to get me up and 
running with an slrp with ceph? I assume this serp is not available 
still not?





slrp csi ceph rbd static volume possible with mesos 1.10

2020-08-26 Thread Marc Roos


I would like to map a ceph rbd device to a task as a static/pre-existing 
volume. Is there any guide on how to do this?


recommended ceph csi plugin?

2020-08-26 Thread Marc Roos



Is there a recommended csi ceph plugin? I found this one[1] but I think 
it is only usable for kubernetes since it requires secrets to be stored 
in some kubernetes property.


[1]
https://github.com/ceph/ceph-csi


RE: marathon (or java) container contantly oom

2020-08-26 Thread Marc Roos
 

Thanks for this, I will stop trying then. What I noticed, but I am not 
sure about this. Is that it starts getting worse with consuming memory 
after the webinterface has been accessed by me. Then the usage climbs 
more rapidly. 



-Original Message-
To: user
Cc: marathon-framework
Subject: Re: marathon (or java) container contantly oom

Hi,

it is a known issue with Marathon:

https://jira.d2iq.com/browse/MARATHON-8180


AFAIK it hasn't been fixed yet. You can tune GC or increase memory 
limits, but the memory usage will grow indefinitely with a higher number 
of tasks.

Regards,
Tomas

On Wed, 26 Aug 2020 at 11:11, Marc Roos  
wrote:



Recently I enabled the cpu and memory isolators on my test cluster. 
And 
since then I have been seeing the marathon containers (when 
becoming 
leader) increase memory usage from ~400MB until they oom at 850MB 
(checking vi systemd-cgtop).

Now I am testing with these settings from this page[1]

JAVA_OPTS "-Xshare:off -XX:+UseSerialGC -XX:+TieredCompilation 
-XX:TieredStopAtLevel=1 -Xint -XX:+UnlockExperimentalVMOptions 
-XX:+UseJVMCICompiler"
LD_PRELOAD "/usr/lib64/libjemalloc.so.1"

Is someone able to share an efficient config? Or is it not possible 
to 
get marathon running below 1GB? At the moment I have only ~10 
tasks.

[1]
https://stackoverflow.com/questions/53451103/java-using-much-more-m
emory-than-heap-size-or-size-correctly-docker-memory-limi







marathon (or java) container contantly oom

2020-08-26 Thread Marc Roos


Recently I enabled the cpu and memory isolators on my test cluster. And 
since then I have been seeing the marathon containers (when becoming 
leader) increase memory usage from ~400MB until they oom at 850MB 
(checking vi systemd-cgtop).

Now I am testing with these settings from this page[1]

JAVA_OPTS "-Xshare:off -XX:+UseSerialGC -XX:+TieredCompilation 
-XX:TieredStopAtLevel=1 -Xint -XX:+UnlockExperimentalVMOptions 
-XX:+UseJVMCICompiler"
LD_PRELOAD "/usr/lib64/libjemalloc.so.1"

Is someone able to share an efficient config? Or is it not possible to 
get marathon running below 1GB? At the moment I have only ~10 tasks.

[1]
https://stackoverflow.com/questions/53451103/java-using-much-more-memory-than-heap-size-or-size-correctly-docker-memory-limi




FW: How to configure a pre-existing slrp volume/disk

2020-08-25 Thread Marc Roos
 

On this dcos manual[1] there is only listed how to use a profile from an 
slrp. Any one know how to change this to a pre-existing (lvm) volume? 
(mesos example is also welcome ;)


cat > app2.json <> data/foo && cat data/foo && sleep 5000",
  "container": {
"docker": {
  "image": "alpine"
},
"type": "MESOS",
"volumes": [
  {
"containerPath": "data",
"mode": "RW",
"persistent": {
  "size": 100,
  "profileName": "fast",
  "type": "mount"
}
  }
]
  },
  "cpus": 0.1,
  "id": "/app-persistent-stable-good-profile-2",
  "instances": 1,
  "mem": 128,
  "residency": {
"taskLostBehavior": "WAIT_FOREVER",
"relaunchEscalationTimeoutSeconds": 3600
  },
  "unreachableStrategy": "disabled",
  "upgradeStrategy": {
"maximumOverCapacity": 0,
"minimumHealthCapacity": 0
  }
}
EOF


[1]
https://docs.d2iq.com/mesosphere/dcos/services/storage/1.0.0/tutorials/manage-local-disks/



RE: Suddenly all tasks gone, framework at completed, cannot start framework

2020-08-25 Thread Marc Roos


Thanks Tomek, Charles, I increased my MARATHON_FAILOVER_TIMEOUT from a 
day to a week. I almost cannot believe something happened yesterday that 
made everything go down today. However I have recently been testing with 
JAVA_OPTS to prevent oom's from the marathon tasks.




-Original Message-
From: Tomek Janiszewski [mailto:jani...@gmail.com] 
Sent: dinsdag 25 augustus 2020 16:55
To: user
Subject: Re: Suddenly all tasks gone, framework at completed, cannot 
start framework

See: https://stackoverflow.com/a/42544023/1387612

wt., 25 sie 2020 o 15:07 Marc Roos  
napisał(a):




Today all my tasks are down and framework marathon is at completed. 
Any 
idea how this can happen?



ed.cpp:520] Successfully authenticated with master 
master@192.168.10.151:5050
I0825 13:03:27.961248   108 sched.cpp:1188] Got error 'Framework 
has 
been removed'






RE: Suddenly all tasks gone, framework at completed, cannot start framework -

2020-08-25 Thread Marc Roos




I assume this was because something happened with zookeeper, and it 
restarted loading the wrong configuration file without the quorum=1. 
Because I was testing with different zookeeper rpms (mesos rpm conf is 
not standard location)

Question: Is this by design that all tasks are terminated when zookeeper 
is gone? Is there some timeout setting that allows tasks to run for a 
day without zookeeper





-Original Message-
To: user
Subject: Suddenly all tasks gone, framework at completed, cannot start 
framework



Today all my tasks are down and framework marathon is at completed. Any 
idea how this can happen?



ed.cpp:520] Successfully authenticated with master 
master@192.168.10.151:5050
I0825 13:03:27.961248   108 sched.cpp:1188] Got error 'Framework has 
been removed'





Suddenly all tasks gone, framework at completed, cannot start framework

2020-08-25 Thread Marc Roos



Today all my tasks are down and framework marathon is at completed. Any 
idea how this can happen?



ed.cpp:520] Successfully authenticated with master 
master@192.168.10.151:5050
I0825 13:03:27.961248   108 sched.cpp:1188] Got error 'Framework has 
been removed'



mesosphere csilvm doesn't have socket after startup

2020-08-22 Thread Marc Roos


I am not sure if the csi standard requires that CSI_ENDPOINT should be 
set, in any case.

- csilvm does not work without specifically setting -unix-addr-env 
CSI_ENDPOINT. So either document this or make it default.

- I could not test with csc on csilvm master branch only this update csi 
1.2(?). So merge this pull request finally, so other people do not waste 
time trying to get this master working.


E0822 19:00:42.980020 18830 provider.cpp:541] Failed to recover resource 
provider with type 'org.apache.mesos.rp.local.storage' and name 
'local_vg': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock'
E0822 19:00:42.980559 18828 container_daemon.cpp:150] Failed to launch 
container 
'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv
m--CONTROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock'
E0822 19:00:42.980654 18828 service_manager.cpp:751] Container daemon 
for 
'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv
m--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for 
endpoint 'unix:///tmp/mesos-csi-Q16CR1/endpoint.sock'


csi drivers endpoint errors, maybe update slrp page with info on how to configure these csi endpoints

2020-08-22 Thread Marc Roos



E0815 18:43:38.774154  1073 service_manager.cpp:751] Container daemon 
for 
'org-apache-mesos-rp-local-storage-local_blockdevices--nfs-csi-k8s-io-cs
i_blockdevices--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out 
waiting for endpoint 'unix:///tmp/mesos-csi-iJusqh/endpoint.sock'
E0815 18:43:38.780150  1070 provider.cpp:541] Failed to recover resource 
provider with type 'org.apache.mesos.rp.local.storage' and name 
'local_nfs': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-bik1bT/endpoint.sock'
E0815 18:43:38.780278  1075 container_daemon.cpp:150] Failed to launch 
container 
'org-apache-mesos-rp-local-storage-local_nfs--nfs-csi-k8s-io-csilvm--CON
TROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-bik1bT/endpoint.sock'
E0815 18:43:38.780364  1075 service_manager.cpp:751] Container daemon 
for 
'org-apache-mesos-rp-local-storage-local_nfs--nfs-csi-k8s-io-csilvm--CON
TROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-bik1bT/endpoint.sock'
E0815 18:43:38.783254  1076 provider.cpp:541] Failed to recover resource 
provider with type 'org.apache.mesos.rp.local.storage' and name 
'local_vg': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock'
E0815 18:43:38.783386  1075 container_daemon.cpp:150] Failed to launch 
container 
'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv
m--CONTROLLER_SERVICE-NODE_SERVICE': Timed out waiting for endpoint 
'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock'
E0815 18:43:38.783461  1075 service_manager.cpp:751] Container daemon 
for 
'org-apache-mesos-rp-local-storage-local_vg--io-mesosphere-csi-lvm-csilv
m--CONTROLLER_SERVICE-NODE_SERVICE' failed: Timed out waiting for 
endpoint 'unix:///tmp/mesos-csi-ugRVyg/endpoint.sock'


[1]
http://mesos.apache.org/documentation/latest/csi/


cni chaining, bandwitdh plugin

2020-08-21 Thread Marc Roos


You should reconsider supporting cni 0.3.0, so people can use this cni 
bandwidth plugin[1]

[1]
https://github.com/containernetworking/plugins/tree/master/plugins/meta/bandwidth






RE: How to test if slrp is working correctly

2020-08-20 Thread Marc Roos



No one able to help? ;)


-Original Message-
To: user
Subject: How to test if slrp is working correctly



I am testing with slrp and csi drivers after watching this video[1] of 
mesosphere. I would like to know how I can verify that the slrp is 
properly configured and working.

1. Can I use an api endpoint to query controller/list-volumes or do a 
controller/create-volume. I found this csc tool that can use a socket, 
however it does not work with some csi drivers (only the csinfs)[2]

After I disabled the endpoint authentication, the slrp seem to launch 
these cni drivers. I have processes like this

   793   790  0 Aug15 ?00:00:00 ./csi-blockdevices
 15298 15292  0 Aug15 ?00:01:00 ./test-csi-plugin 
--available_capacity=2GB --work_dir=workdir
 16292 16283  0 Aug15 ?00:00:05 ./csilvm 
-unix-addr=unix:///run/csilvm.sock -volume-group VGtest
 17639 17636  0 Aug15 ?00:00:08 ./csinfs --endpoint 
unix://run/csinfs.sock --nodeid test --alsologtostderr --log_dir /tmp




[1]
https://www.youtube.com/watch?v=zhALmyC3Om4

[2]
[root@m01 resource-providers]# csc --endpoint unix:///run/csinfs.sock 
identity plugin-info "nfs.csi.k8s.io" "2.0.0"

[root@m01 resource-providers]# csc --endpoint unix:///run/csilvm.sock 
identity plugin-info unknown service csi.v1.Identity

[root@m01 resource-providers]# csc --endpoint unix:///run/csiblock.sock 
identity plugin-info unknown service csi.v1.Identity




How to test if slrp is working correctly

2020-08-17 Thread Marc Roos



I am testing with slrp and csi drivers after watching this video[1] of 
mesosphere. I would like to know how I can verify that the slrp is 
properly configured and working.

1. Can I use an api endpoint to query controller/list-volumes or do a 
controller/create-volume. I found this csc tool that can use a socket, 
however it does not work with some csi drivers (only the csinfs)[2]

After I disabled the endpoint authentication, the slrp seem to launch 
these cni drivers. I have processes like this

   793   790  0 Aug15 ?00:00:00 ./csi-blockdevices
 15298 15292  0 Aug15 ?00:01:00 ./test-csi-plugin 
--available_capacity=2GB --work_dir=workdir
 16292 16283  0 Aug15 ?00:00:05 ./csilvm 
-unix-addr=unix:///run/csilvm.sock -volume-group VGtest
 17639 17636  0 Aug15 ?00:00:08 ./csinfs --endpoint 
unix://run/csinfs.sock --nodeid test --alsologtostderr --log_dir /tmp




[1]
https://www.youtube.com/watch?v=zhALmyC3Om4

[2]
[root@m01 resource-providers]# csc --endpoint unix:///run/csinfs.sock 
identity plugin-info
"nfs.csi.k8s.io" "2.0.0"

[root@m01 resource-providers]# csc --endpoint unix:///run/csilvm.sock 
identity plugin-info
unknown service csi.v1.Identity

[root@m01 resource-providers]# csc --endpoint unix:///run/csiblock.sock 
identity plugin-info
unknown service csi.v1.Identity


RE: mesos csi test plugin slrp 401 Unauthorized

2020-08-15 Thread Marc Roos
 

If I disable authenticate_http_readwrite authenticate_http_readonly. My 
test slrp's are indeed loaded and I see tasks running. 

Launching these tasks as described on the manual page via curl[1] also 
fails. The task is not running, but I see that curl commands json is 
being put in the resource-providers dir.

So please some info on how to get this working with having the 
authenticate_http_readwrite authenticate_http_readonly enabled.

[1]
curl --user xxx:xxx -X POST -H 'Content-Type: application/json' 
http://m01.local:5051/api/v1 -d 
'{"type":"ADD_RESOURCE_PROVIDER_CONFIG","add_resource_provider_config":{
"info":





-Original Message-
To: user
Subject: mesos csi test plugin slrp 401 Unauthorized


I am testing with this 

Failed to recover resource provider with type 
'org.apache.mesos.rp.local.storage' and name 'test_slrp': Failed to get
containers: Unexpected response '401 Unauthorized' (401 Unauthorized.)

Is this because I am having authentication on, and the standalone 
container cannot launch? How to resolve this?


[1]
http://mesos.apache.org/documentation/latest/csi/




A more practical guide on how to configure and get csi working (preferably with ceph)

2020-08-14 Thread Marc Roos



Can anyone point me to a more practical guide on how to configure and 
get csi working (preferably with ceph)



test-csi-plugin should work?

2020-08-14 Thread Marc Roos




   This option has no effect when 
using the HTTP scheduler/executor APIs.
   By default, this option is true. 
(default: true)
  --log_dir=VALUE  Location to put log files.  By 
default, nothing is written to disk.
   Does not affect logging to 
stderr.
   If specified, the log file will 
appear in the Mesos WebUI.
   NOTE: 3rd party log messages 
(e.g. ZooKeeper) are
   only written to stderr!
  --logbufsecs=VALUE   Maximum number of seconds that 
logs may be buffered for.
   By default, logs are flushed 
immediately. (default: 0)
  --logging_level=VALUELog message at or above this 
level.
   Possible values: `INFO`, 
`WARNING`, `ERROR`.
   If `--quiet` is specified, this 
will only affect the logs
   written to `--log_dir`, if 
specified. (default: INFO)
  --[no-]quiet Disable logging to stderr. 
(default: false)
  --volume_metadata=VALUE  The static properties to add to 
the contextual information of each
   volume. The metadata are 
specified as a semicolon-delimited list of
   prop=value pairs. (Example: 
'prop1=value1;prop2=value2')
  --volumes=VALUE  Creates preprovisioned volumes 
upon start-up. The volumes are
   specified as a 
semicolon-delimited list of name:capacity pairs.
   If a volume with the same name 
already exists, the pair will be
   ignored. (Example: 
'volume1:1GB;volume2:2GB')
  --work_dir=VALUE Path to the work directory of the 
plugin. (default: )

*** Error in `/usr/libexec/cni/test-csi-plugin': free(): invalid 
pointer: 0x7f5e1ea25a10 ***
=== Backtrace: =
/lib64/libc.so.6(+0x81299)[0x7f5e18dcc299]
/usr/libexec/cni/test-csi-plugin(_ZN9__gnu_cxx13new_allocatorIPNSt8__det
ail15_Hash_node_baseEE10deallocateEPS3_m+0x20)[0x5631f93bc1b0]
/usr/libexec/cni/test-csi-plugin(_ZNSt10_HashtableISsSt4pairIKSsSsESaIS2
_ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS4_18_Mod_range_ha
shingENS4_20_Default_ranged_hashENS4_20_Prime_rehash_policyENS4_17_Hasht
able_traitsILb1ELb0ELb121_M_deallocate_bucketsEPPNS4_15_Hash_node_ba
seEm+0x58)[0x5631f93b2772]
/usr/libexec/cni/test-csi-plugin(_ZNSt10_HashtableISsSt4pairIKSsSsESaIS2
_ENSt8__detail10_Select1stESt8equal_toISsESt4hashISsENS4_18_Mod_range_ha
shingENS4_20_Default_ranged_hashENS4_20_Prime_rehash_policyENS4_17_Hasht
able_traitsILb1ELb0ELb1D2Ev+0x36)[0x5631f93a597c]
/usr/libexec/cni/test-csi-plugin(_ZNSt13unordered_mapISsSsSt4hashISsESt8
equal_toISsESaISt4pairIKSsSsEEED1Ev+0x18)[0x5631f9399eb0]
/usr/libexec/cni/test-csi-plugin(_ZN7hashmapISsSsSt4hashISsESt8equal_toI
SsEED1Ev+0x18)[0x5631f9399eca]
/lib64/libc.so.6(__cxa_finalize+0x9a)[0x7f5e18d8505a]
/usr/local/lib/libmesos-1.10.0.so(+0x22b34f3)[0x7f5e1be074f3]
=== Memory map: 
5631f9315000-5631f9442000 r-xp  fd:00 507586 
/usr/libexec/cni/test-csi-plugin
5631f9642000-5631f9646000 r--p 0012d000 fd:00 507586 
/usr/libexec/cni/test-csi-plugin
5631f9646000-5631f9647000 rw-p 00131000 fd:00 507586 
/usr/libexec/cni/test-csi-plugin
5631fb041000-5631fb0a4000 rw-p  00:00 0  
[heap]
7f5e0c00-7f5e0c021000 rw-p  00:00 0
7f5e0c021000-7f5e1000 ---p  00:00 0
7f5e130ea000-7f5e1314a000 r-xp  fd:00 16872768   
/usr/lib64/libpcre.so.1.2.0
7f5e1314a000-7f5e1334a000 ---p 0006 fd:00 16872768   
/usr/lib64/libpcre.so.1.2.0
7f5e1334a000-7f5e1334b000 r--p 0006 fd:00 16872768   
/usr/lib64/libpcre.so.1.2.0
7f5e1334b000-7f5e1334c000 rw-p 00061000 fd:00 16872768   
/usr/lib64/libpcre.so.1.2.0


mesos csi test plugin slrp 401 Unauthorized

2020-08-14 Thread Marc Roos


I am testing with this 

Failed to recover resource provider with type 
'org.apache.mesos.rp.local.storage' and name 'test_slrp': Failed to get 
containers: Unexpected response '401 Unauthorized' (401 Unauthorized.)

Is this because I am having authentication on, and the standalone 
container cannot launch? How to resolve this?


[1]
http://mesos.apache.org/documentation/latest/csi/


RE: crv port lookups on tasks with cni networks

2020-08-12 Thread Marc Roos
 
"container": {
"type": "MESOS",
"portMappings": [
{"hostPort": 0, "name": "https",  "protocol": "tcp", 
"networkNames": ["cni-apps"]}, 
{"hostPort": 0, "name": "metrics",  "protocol": "tcp", 
"networkNames": ["cni-apps"]}
],



-Original Message-
To: user
Subject: crv port lookups on tasks with cni networks


How can I assign random ports to a cni network and read these back from 
srv. What is the equivalent of portDefinitions at network/host for 
network/container?



-Original Message-
To: user
Subject: health check not working after changing host network


If I change a task from: 

  "networks": [ 
{ "mode": "host" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],

To: 

  "networks": [ 
{ "mode": "container", "name": "cni-storage" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],


I am getting this error:

W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check 
for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' 
failed: curl exited with status 7: curl: (7) Failed connect to 
127.0.0.1:0; Connection refused
I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure 
of HTTP health check for task
'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in 
grace period

But when I disable the health check, and enter the network namespace of 
the running task. This localhost check is working [@ 
c59ea592-322f-4bfc-8981-21215904da58]# curl http://localhost:52684/test 
200 OK Service ready.


What am I doing wrong?














Are cni networks launch in sequence of how to they are configured?

2020-08-11 Thread Marc Roos


 
I was wondering if cni networks were always applied in sequence. I am 
seeing the same order of eth0, eth1 etc. But is it true that the 2nd 
network is only created when the first was successfully 
completed/attached?




RE: Assymetric route possible between agent and container?

2020-08-08 Thread Marc Roos
 
I was just giving this setting a try. But some test task does not want 
to launch. Should I always have to set this in combination with 
domain_socket_location? I am getting "Failed to synchronize with agent 
(it's probably exited)"
This should not affect the way cni networks are provisioned, eg via 
dhcp?



-Original Message-
To: user
Subject: Re: Assymetric route possible between agent and container?

I think you could try this flag `http_executor_domain_sockets` which 
was introduced in Mesos 1.10.0.


--http_executor_domain_sockets: If true, the agent will provide a 
unix domain socket that the executor can use to connect to the agent, 
instead of relying on a TCP connection.



Regards,
Qian Zhang


On Sat, Aug 8, 2020 at 4:59 PM Marc Roos  
wrote:



"it is imperative that the Agent IP is reachable from the container 
IP 
and vice versa."

Anyone know/tested if this can be an asymmetric route when you are 
having multiple networks?

[1]
http://mesos.apache.org/documentation/latest/cni/






Assymetric route possible between agent and container?

2020-08-08 Thread Marc Roos


"it is imperative that the Agent IP is reachable from the container IP 
and vice versa."

Anyone know/tested if this can be an asymmetric route when you are 
having multiple networks?

[1]
http://mesos.apache.org/documentation/latest/cni/



Error "networkNames must be a single item list when hostPort is specified and more than 1 container network is defined"

2020-08-05 Thread Marc Roos


I am getting this error message. When launching a task with portMappings 
and two container networks. What is the proper way to configure this?

general: networkNames must be a single item list when hostPort is 
specified and more than 1 container network is defined

  "networks": [ 
{ "mode": "container", "name": "cni-storage" },
{ "mode": "container", "name": "cni-apps-public", "labels": 
{"vendorid": "ext-testing"}}
  ],
  "container": {
"type": "MESOS",
"portMappings": [
  {"hostPort": 0, "name": "health", "protocol": "tcp"},
  {"hostPort": 0, "name": "metrics", "protocol": "tcp"}
]
  }

[1]
https://jira.d2iq.com/browse/MARATHON-8760


crv port lookups on tasks with cni networks

2020-08-05 Thread Marc Roos


How can I assign random ports to a cni network and read these back from 
srv. What is the equivalent of portDefinitions at network/host for 
network/container?



-Original Message-
To: user
Subject: health check not working after changing host network


If I change a task from: 

  "networks": [ 
{ "mode": "host" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],

To: 

  "networks": [ 
{ "mode": "container", "name": "cni-storage" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],


I am getting this error:

W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check 
for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' 
failed: curl exited with status 7: curl: (7) Failed connect to 
127.0.0.1:0; Connection refused
I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure 
of HTTP health check for task
'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in 
grace period

But when I disable the health check, and enter the network namespace of 
the running task. This localhost check is working [@ 
c59ea592-322f-4bfc-8981-21215904da58]# curl http://localhost:52684/test 
200 OK Service ready.


What am I doing wrong?












health check not working after changing host network

2020-08-04 Thread Marc Roos


If I change a task from: 

  "networks": [ 
{ "mode": "host" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],

To: 

  "networks": [ 
{ "mode": "container", "name": "cni-storage" }
  ],
  "portDefinitions": [
{"port": 0, "name": "health",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}
  ],


I am getting this error:

W0804 23:18:12.942282 3421440 health_checker.cpp:273] HTTP health check 
for task 'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1' 
failed: curl exited with status 7: curl: (7) Failed connect to 
127.0.0.1:0; Connection refused
I0804 23:18:12.942337 3421440 health_checker.cpp:299] Ignoring failure 
of HTTP health check for task 
'dev_.instance-78cc92f7-d697-11ea-b815-e41d2d0c3e20._app.1': still in 
grace period

But when I disable the health check, and enter the network namespace of 
the running task. This localhost check is working
[@ c59ea592-322f-4bfc-8981-21215904da58]# curl 
http://localhost:52684/test
200 OK
Service ready.


What am I doing wrong?










Ceph support planned?

2020-07-31 Thread Marc Roos



Is native ceph support in the planning? Libvirt supports ceph with 
librbd[1]. What is currently the best practice to use ceph storage?

[1]
https://docs.ceph.com/docs/master/rbd/libvirt/




mesos master default drop acl

2020-07-30 Thread Marc Roos



Currently I am running on a testing environment with some default acl I 
found[1]. I have configured  mesos-credentials, and afaik everything 
agents/marathon framework is authenticating. So I thought about 
converting the acl to default drop/deny. However I see there are quite a 
few options.

Is it advicable to even set the all to deny? Is there an example how to 
set the url for GetEndpoint?

[2]
https://github.com/apache/mesos/blob/master/include/mesos/authorizer/acls.proto
http://mesos.apache.org/documentation/latest/configuration/master/

[1]
{
  "run_tasks": [
{
  "principals": {
"type": "ANY"
  },
  "users": {
"type": "ANY"
  }
}
  ],
  "register_frameworks": [
{
  "principals": {
"type": "ANY"
  },
  "roles": {
"type": "ANY"
  }
}
  ]
}


Fyi: nginx+ srv lookups also now available in basic nginx

2020-07-29 Thread Marc Roos


For anyone who is interested. I was surprised that nginx was not 
offering srv lookups in their free version. I found a module that 
offered this, however it did not work because of syntax differences in 
srv lookups on mesos. I adapted this module to force sending a whole srv 
domain, and tests look promising.

You can find this module for now here, but remember if you have groups 
in mesos use
 service=_https._synapse.dev._tcp.marathon.mesos. ('.' at the end)
https://github.com/f1-outsourcing/ngx_upstream_resolveMK

I have asked if the alpine linux guys can add this in their repository. 
So we do not need to go through this compiling hasle every time.




RE: random string in task groups hostname

2020-07-28 Thread Marc Roos
 
2nd I have the impression that SRV records are not correctly implemented 
should ._tcp not be at the front (after the service) instead of in the 
middle? Or do I have something incorrect in my mesos configuration that 
makes these groups act as part of the task name?

[@]$ dig +short _sip._udp.sip.voice.google.com SRV
20 1 5060 sip-anycast-2.voice.google.com.
10 1 5060 sip-anycast-1.voice.google.com.

bash-5.0# dig +short @192.168.10.14 _tcp.server.temp.test.marathon.mesos
bash-5.0# dig +short @192.168.10.14 server._tcp.temp.test.marathon.mesos
bash-5.0# dig +short @192.168.10.14 
_server._tcp.temp.test.marathon.mesos



-Original Message-
To: user
Subject: random string in task groups hostname



I cannot remember seeing this before. I wondered if this is common and 
is it should to be. I am having in srv lookups random string in the 
group. Why is test appended with '-grxx9-s0'?


[@~]$ dig +short @192.168.10.14 server.temp.test.marathon.mesos
192.168.10.151

[@~]$ dig +short @192.168.10.14 _server.temp.test._tcp.marathon.mesos
SRV
0 1 31682 server.temp.test-grxx9-s0.marathon.mesos.







random string in task groups hostname

2020-07-28 Thread Marc Roos



I cannot remember seeing this before. I wondered if this is common and 
is it should to be. I am having in srv lookups random string in the 
group. Why is test appended with '-grxx9-s0'?


[@~]$ dig +short @192.168.10.14 server.temp.test.marathon.mesos
192.168.10.151

[@~]$ dig +short @192.168.10.14 _server.temp.test._tcp.marathon.mesos 
SRV
0 1 31682 server.temp.test-grxx9-s0.marathon.mesos.





RE: getting correct metrics port from SRV records.

2020-07-27 Thread Marc Roos
 
Oops ;)

[@test2 image-synapse]$ dig +short @192.168.10.14 
_metrics._synapse.dev._tcp.marathon.mesos SRV
0 1 31032 synapse.dev-nppzf-s0.marathon.mesos.



-Original Message-
To: user
Subject: getting correct metrics port from SRV records.


Is there a way to identify the correct port via dns? I have created a 
task with two ports[1]. But a dns srv query does not show anything 
different than the port number. How can I identify the correct port? 
Mesos-master tasks endpoint[3] shows the port names, is there a way to 
get these from dns?


[1]
"networks": [ { "mode": "host"} ],
"portDefinitions": [{"port": 0, "name": "https",  "protocol": "tcp"},
{"port": 0, "name": "metrics",  "protocol": "tcp"}]


[2]
[@test2 image-synapse]$ dig +short @192.168.10.14 
_synapse.dev._tcp.marathon.mesos SRV
0 1 31031 synapse.dev-nppzf-s0.marathon.mesos.
0 1 31032 synapse.dev-nppzf-s0.marathon.mesos.


[3]
mesos-master /tasks/

  "discovery": {
"visibility": "FRAMEWORK",
"name": "synapse.dev",
"ports": {
  "ports": [
{
  "number": 31031,
  "name": "https",
  "protocol": "tcp"
},
{
  "number": 31032,
  "name": "metrics",
  "protocol": "tcp"
}
  ]
}
  },





getting correct metrics port from SRV records.

2020-07-27 Thread Marc Roos


Is there a way to identify the correct port via dns? I have created a 
task with two ports[1]. But a dns srv query does not show anything 
different than the port number. How can I identify the correct port? 
Mesos-master tasks endpoint[3] shows the port names, is there a way to 
get these from dns?


[1]
"networks": [ { "mode": "host"} ],
"portDefinitions": [{"port": 0, "name": "https",  "protocol": "tcp"}, 
{"port": 0, "name": "metrics",  "protocol": "tcp"}]


[2]
[@test2 image-synapse]$ dig +short @192.168.10.14 
_synapse.dev._tcp.marathon.mesos SRV
0 1 31031 synapse.dev-nppzf-s0.marathon.mesos.
0 1 31032 synapse.dev-nppzf-s0.marathon.mesos.


[3]
mesos-master /tasks/

  "discovery": {
"visibility": "FRAMEWORK",
"name": "synapse.dev",
"ports": {
  "ports": [
{
  "number": 31031,
  "name": "https",
  "protocol": "tcp"
},
{
  "number": 31032,
  "name": "metrics",
  "protocol": "tcp"
}
  ]
}
  },



RE: fyi: mesos-dns is not registering all ip addresses

2020-07-27 Thread Marc Roos


Hi Alex,

My config.json is quite similar, but having "IPSources": ["netinfo", 
"mesos", "host"]

You will only run into this issue when you have multihomed tasks, having 
two or more network adapters, eth0, eth1 etc 




-Original Message-
From: Alex Evonosky [mailto:alex.evono...@gmail.com] 
Sent: maandag 27 juli 2020 14:36
To: user@mesos.apache.org
Subject: Re: fyi: mesos-dns is not registering all ip addresses

thank you.

We have been running mesos-dns for years now without any issues.  The 
docker apps spin up on marathon and automatically gets picked up by 
mesos-dns...

This is our config.json:


{
  "zk": "zk://10.10.10.51:2181,10.10.10.52:2181,10.10.10.53:2181/mesos",
  "masters": ["10.10.10.51:5050", "10.10.10.52:5050", 
"10.10.10.53:5050"],
  "refreshSeconds": 3,
  "ttl": 3,
  "domain": "mesos",
  "port": 53,
  "resolvers": ["10.10.10.88", "10.10.10.86"],
  "timeout": 3,
  "httpon": true,
  "dnson": true,
  "httpport": 8123,
  "externalon": true,
  "listener": "0.0.0.0",
  "SOAMname": "ns1.mesos",
  "SOARname": "root.ns1.mesos",
  "SOARefresh": 5,
  "SOARetry":   600,
  "SOAExpire":  86400,
  "SOAMinttl": 5,
  "IPSources":["mesos", "host"]
}




we just have our main DNS resolvers have a zone  "mesos.marathon" and 
forwards the request to this cluster...



On Mon, Jul 27, 2020 at 3:56 AM Marc Roos  
wrote:




I am not sure if mesos-dns is discontinued. But for the ones still 
using 
it, in some cases it does not register all tasks ip addresses.

The default[2] works, but if you have this setup[1] it will only 
register one ip address 192.168.122.140 and not the 2nd. I filed 
issue a 
year ago or so[3]



[3]
https://github.com/mesosphere/mesos-dns/issues/54145
https://issues.apache.org/jira/browse/MESOS-10164

[1]
"network_infos": [
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "192.168.122.140"
  }
]
  },
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "192.168.10.17"
  }
],
  }
]


[2]
"network_infos": [
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "12.0.1.2"
  },
  {
"protocol": "IPv6",
"ip_address": "fd01:b::1:8000:2"
  }
],
  }
]







fyi: mesos-dns is not registering all ip addresses

2020-07-27 Thread Marc Roos



I am not sure if mesos-dns is discontinued. But for the ones still using 
it, in some cases it does not register all tasks ip addresses.

The default[2] works, but if you have this setup[1] it will only 
register one ip address 192.168.122.140 and not the 2nd. I filed issue a 
year ago or so[3]



[3]
https://github.com/mesosphere/mesos-dns/issues/54145
https://issues.apache.org/jira/browse/MESOS-10164

[1]
"network_infos": [
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "192.168.122.140"
  }
]
  },
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "192.168.10.17"
  }
],
  }
]


[2]
"network_infos": [
  {
"ip_addresses": [
  {
"protocol": "IPv4",
"ip_address": "12.0.1.2"
  },
  {
"protocol": "IPv6",
"ip_address": "fd01:b::1:8000:2"
  }
],
  }
]




Mesos syslog logging to error level instead of info?

2020-07-24 Thread Marc Roos
 

I have my test cluster of mesos on again, and I am having mesos-master 
logs end up in the wrong logs. I think mesos is not logging to correct 
levels/facility. (using mesos-1.10.0-2.0.1.el7.x86_64)

Eg. I have got this on level error:

Jul 24 12:25:16 m01 mesos-master[28922]: I0724 12:25:16.854624 28955 
master.cpp:8889] Performing explicit task state reconciliation for 1 
tasks of framework 43d5a67d-8c4e-496e-a108-5cfeb10b8967- (marathon) 
at scheduler-a9897343-98ee-4c31-a715-1b5e96e296bb@192.168.10.22:41009
Jul 24 12:25:20 m01 mesos-master[28922]: I0724 12:25:20.557858 28957 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:24 m01 mesos-master[28922]: I0724 12:25:24.738281 28957 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.547469 28958 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.554080 28961 
http.cpp:1436] HTTP GET for /master/state?jsonp=angular.callbacks._fmv 
from 192.168.10.219:49885 with User-Agent='Mozilla/5.0 (Windows NT 6.1; 
Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.556207 28956 
http.cpp:1453] HTTP GET for /master/state?jsonp=angular.callbacks._fmv 
from 192.168.10.219:49885: '200 OK' after 2.46784ms
Jul 24 12:25:26 m01 mesos-master[28922]: I0724 12:25:26.582295 28955 
http.cpp:1436] HTTP GET for 
/master/maintenance/schedule?jsonp=angular.callbacks._fmw from 
192.168.10.219:63372 with User-Agent='Mozilla/5.0 (Windows NT 6.1; 
Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
Jul 24 12:25:30 m01 mesos-master[28922]: I0724 12:25:30.635844 28955 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:31 m01 mesos-master[28922]: I0724 12:25:31.874604 28955 
master.cpp:8889] Performing explicit task state reconciliation for 1 
tasks of framework 43d5a67d-8c4e-496e-a108-5cfeb10b8967- (marathon) 
at scheduler-a9897343-98ee-4c31-a715-1b5e96e296bb@192.168.10.22:41009
Jul 24 12:25:34 m01 mesos-master[28922]: I0724 12:25:34.816028 28958 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.625381 28955 
authorization.cpp:136] Authorizing principal 'ANY' to GET the endpoint 
'/metrics/snapshot'
Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.632581 28956 
http.cpp:1436] HTTP GET for /master/state?jsonp=angular.callbacks._fn0 
from 192.168.10.219:49885 with User-Agent='Mozilla/5.0 (Windows NT 6.1; 
Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.634801 28959 
http.cpp:1453] HTTP GET for /master/state?jsonp=angular.callbacks._fn0 
from 192.168.10.219:49885: '200 OK' after 2.55488ms
Jul 24 12:25:36 m01 mesos-master[28922]: I0724 12:25:36.687845 28958 
http.cpp:1436] HTTP GET for 
/master/maintenance/schedule?jsonp=angular.callbacks._fn1 from 
192.168.10.219:63372 with User-Agent='Mozilla/5.0 (Windows NT 6.1; 
Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'


RE: Advice on alternative for marathon framework

2020-07-15 Thread Marc Roos
 

Thanks Tomek, have it running, giving it a try.


-Original Message-
To: user
Subject: Re: Advice on alternative for marathon framework

You can try https://github.com/HubSpot/Singularity
Aurora was moved to attic <https://attic.apache.org/> 


śr., 15 lip 2020 o 16:29 Marc Roos  
napisał(a):




I am having problems[1] getting marathon to run since march (can 
only 
run 1.7) and the only emails I receive from d2iq is how to rate 
their 
support. I wonder if this Marathon is still best to be used with 
mesos. 
I have aurora running, but it looks to have less options.

What I like about the Marathon framework is of course the web 
interface 
and some plugins that allowed me to use capabilities. I know I 
should/could launch application directly in mesos via the command 
line. 
But I am just starting with the mesos and I prefer now to have a 
gui.

Can anyone advice on a good alternative to marathon?


[1]
https://jira.d2iq.com/browse/MARATHON-8729
https://github.com/mesosphere/marathon/issues/7136





Advice on alternative for marathon framework

2020-07-15 Thread Marc Roos



I am having problems[1] getting marathon to run since march (can only 
run 1.7) and the only emails I receive from d2iq is how to rate their 
support. I wonder if this Marathon is still best to be used with mesos. 
I have aurora running, but it looks to have less options.

What I like about the Marathon framework is of course the web interface 
and some plugins that allowed me to use capabilities. I know I 
should/could launch application directly in mesos via the command line. 
But I am just starting with the mesos and I prefer now to have a gui.

Can anyone advice on a good alternative to marathon?


[1]
https://jira.d2iq.com/browse/MARATHON-8729
https://github.com/mesosphere/marathon/issues/7136


problems running marathon >=1.8 on mesos

2020-06-07 Thread Marc Roos


I am cross posting this to mesos-users, hoping someone has came accros 
this issue, and can help me resolve this issue I have. There are several 
JIRA issues open with similar symptoms.


All of a sudden I having problems with marathon ui getting stuck at 
'loading' and end points like http://m01.local:8081/v2/info are not 
responding (http://m01.local:8081/ping). I have now downgraded the test 
cluster to one node, running only mesos-master and zookeeper and 
marathon. Cleaning between tests the /var/lib/zookeeper and the 
/var/lib/mesos directories. I have also removed many of the 
configuration options I had, like ssl etc.

I am only able to get to run marathon-1.7.216-9e2a9b579. 
marathon-1.8.222-86475ddac and marathon-1.10.17-c427ce965 are having the 
above mentioned errors/problem.

I have been comparing the marathon 1.7 and marathon 1.8 logs and this 
what I have noticed. There are quite a bit of log statements missing 
between 'All services up and running. 
(mesosphere.marathon.MarathonApp:main' and 'akka://marathon/deadLetters' 
in the 1.8 log.

Anyone had something similar?


[@mesos-master]# rpm -qa  | grep java
python-javapackages-3.4.1-11.el7.noarch
tzdata-java-2020a-1.el7.noarch
java-1.8.0-openjdk-headless-1.8.0.252.b09-2.el7_8.x86_64
javapackages-tools-3.4.1-11.el7.noarch

[@mesos-master]# uname -a
Linux m01.local 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 
UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[@mesos-master]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)





marathon 1.8 (unresponsive)
===
Jun  7 17:40:59 m01 marathon: [2020-06-07 17:40:59,696] INFO  All 
services up and running. (mesosphere.marathon.MarathonApp:main)
Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,833] INFO  initiate 
task reconciliation 
(mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-
dispatcher-9)
Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,854] INFO  Requesting 
task reconciliation with the Mesos master 
(mesosphere.marathon.SchedulerActions:scheduler-actions-thread-0)
Jun  7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.858621 11227 
master.cpp:8846] Performing implicit task state reconciliation for 
framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) at 
scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941
Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,864] INFO  task 
reconciliation has finished 
(mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-
dispatcher-4)

Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,879] INFO  Message 
[mesosphere.marathon.MarathonSchedulerActor$TasksReconciled$] from 
Actor[akka://marathon/user/MarathonScheduler/$a#1746491390] to 
Actor[akka://marathon/deadLetters] was not delivered. [1] dead letters 
encountered. If this is not an expected behavior, then 
[Actor[akka://marathon/deadLetters]] may have terminated unexpectedly, 
This logging can be turned off or adjusted with configuration settings 
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 
(akka.actor.DeadLetterActorRef:marathon-akka.actor.default-dispatcher-7)
Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,910] INFO  Prompting 
Mesos for a heartbeat via explicit task reconciliation 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath
on-akka.actor.default-dispatcher-7)
Jun  7 17:41:13 m01 mesos-master[11203]: I0607 17:41:13.914615 11228 
master.cpp:8889] Performing explicit task state reconciliation for 1 
tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) 
at scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941
Jun  7 17:41:13 m01 marathon: [2020-06-07 17:41:13,924] INFO  Received 
fake heartbeat task-status update 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-13)
Jun  7 17:41:28 m01 marathon: [2020-06-07 17:41:28,939] INFO  Prompting 
Mesos for a heartbeat via explicit task reconciliation 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath
on-akka.actor.default-dispatcher-4)
Jun  7 17:41:28 m01 mesos-master[11203]: I0607 17:41:28.946494 11229 
master.cpp:8889] Performing explicit task state reconciliation for 1 
tasks of framework f5d67e06-6600-4fb9-94dc-a878be2563be- (marathon) 
at scheduler-6d98d1e0-a7d2-4517-a0ce-5819a36414c9@192.168.10.151:36941
Jun  7 17:41:28 m01 marathon: [2020-06-07 17:41:28,950] INFO  Received 
fake heartbeat task-status update 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-14)


marathon 1.7 (ok)
=
Jun  7 17:37:02 m01 marathon: [2020-06-07 17:37:02,681] INFO  All 
services up and running. (mesosphere.marathon.MarathonApp:main)
Jun  7 17:37:06 m01 marathon: [2020-06-07 17:37:06,222] INFO  Received 
TimedCheck 
(mesosphere.marathon.core.flow.impl.ReviveOffersActor:marathon-akka.acto
r.default-dispatcher-8)
Jun  7 17:37:06 m01 marathon: [2020-06-07 17:37:06,228] INFO  => revive 
offers NOW, canceling 

RE: No offers are being made -- how to debug Mesos?

2020-06-06 Thread Marc Roos
 

 
You already put these on debug?

[@ ]# cat /etc/mesos-master/logging_level
WARNING
[@ ]# cat /etc/mesos-slave/logging_level
WARNING




-Original Message-
From: Benjamin Wulff [mailto:benjamin.wulff...@ieee.org] 
Sent: zaterdag 6 juni 2020 13:36
To: user@mesos.apache.org
Subject: No offers are being made -- how to debug Mesos?

Hi all,

I’m in the process of setting up my first Mesos cluster with 1x master 
and 3x slaves on CentOS 8.

So far set up Zookeepr and Mesos-master on the master and Mesos-slave on 
one of the compute nodes. Mesos-master communicates with ZK and becomes 
leader. Then I started memos-slave on the compute node and can see in 
the log that it registers at the master with the correct resources 
reported. The agent and its resources are also displayed in the web UI 
of the master. So is the framework that I want to use.

The crux is that no tasks I schedule in the framework are executed. And 
I suppose this is because the framework never receives an offer. I can 
see in the web UI that no offers are made and that all resources remain 
idle.

Now, I’m new to Mesos and I don’t really have an idea how to debug my 
setup at this point. 

There is a page called ‘Debugging with the new CLI’ in the 
documentation but it only explains how to configure  the CLI command. 

Any directions how to debug in my situation in general or on how to use 
the CLI for debugging would be highly welcome! :)

Thanks and best regards,
Ben





RE: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)

2020-05-28 Thread Marc Roos
 
 * ability for an executor to communicate with an agent via Unix domain 
socket instead of TCP

I think this will solve my problem with tasks running on different ip 
which I was doing via a local route. But somehow this route was not 
being used in mesos. While ping to the netspace were ok. 

 




-Original Message-
From: Qian Zhang [mailto:zhq527...@gmail.com] 
Sent: donderdag 28 mei 2020 2:57
To: user
Cc: dev
Subject: Re: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)

+1 (binding)


Regards,
Qian Zhang


On Thu, May 28, 2020 at 12:56 AM Benjamin Mahler  
wrote:


+1 (binding)


On Mon, May 18, 2020 at 4:36 PM Andrei Sekretenko 
 wrote:


Hi all,

Please vote on releasing the following candidate as Apache 
Mesos 1.10.0.

1.10.0 includes the following major improvements:



* support for resource bursting (setting task resource limits 
separately from requests) on Linux

* ability for an executor to communicate with an agent via 
Unix domain socket instead of TCP

* ability for operators to modify reservations via the 
RESERVE_RESOURCES master API call

* performance improvements of V1 operator API read-only calls 
bringing them on par with V0 HTTP endpoints

* ability for a scheduler to expect that effects of calls sent 
through the same connection will not be reordered/interleaved by master


NOTE: 1.10.0 includes a breaking change for custom authorizer 
modules.
Now, `ObjectApprover`s may be stored by Mesos indefinitely and 
must be kept up-to-date by an authorizer throughout their lifetime.

This allowed for several bugfixes and performance 
improvements.

The CHANGELOG for the release is available at:

https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.10.0-rc1




The candidate for Mesos 1.10.0 release is available at:

https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz

The tag to be voted on is 1.10.0-rc1:

https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.10.0-rc1

The SHA512 checksum of the tarball can be found at:

https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.sha512

The signature of the tarball can be found at:

https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:

https://repository.apache.org/content/repositories/orgapachemesos-1259

Please vote on releasing this package as Apache Mesos 1.10.0!

The vote is open until Fri, May 21, 19:00 CEST  and passes if 
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.10.0
[ ] -1 Do not release this package because ...

Thanks,

Andrei Sekretenko




RE: Found no roles suitable for revive repetition.

2020-03-18 Thread Marc Roos


Hi Benjamin,

Do you have a subscribe email address of this mailing list of marathon?

Thanks,
Marc

 

-Original Message-
From: Benjamin Mahler [mailto:bmah...@apache.org] 
Sent: 18 March 2020 18:32
To: user
Subject: Re: Found no roles suitable for revive repetition.

Hi Marc, can you contact the marathon mailing list or slack channel.

Also, if there is a question here or some more context, please include 
that so they know what you need help with.

On Wed, Mar 18, 2020 at 9:46 AM Marc Roos  
wrote:




Marathon is stuck on 'loading applications'


Mar 18 14:43:48 m01 marathon: [2020-03-18 14:43:48,646] INFO  
Received 
fake heartbeat task-status update 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-30
)
Mar 18 14:43:53 m01 marathon: [2020-03-18 14:43:53,321] INFO  Found 
no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$
Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-9)
Mar 18 14:43:58 m01 marathon: [2020-03-18 14:43:58,324] INFO  Found 
no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$
Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-6)
Mar 18 14:44:03 m01 marathon: [2020-03-18 14:44:03,321] INFO  Found 
no 
roles suitable for revive repetition. (mesosphe





registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime.

2020-03-18 Thread Marc Roos


I am having these, has been reported already on Jira long time ago. How 
to fix these?



der mesosphere.marathon.api.v2.PodsResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,785] WARN  A provider 
mesosphere.marathon.api.v2.AppsResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.AppsResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,787] WARN  A provider 
mesosphere.marathon.api.v2.DeploymentsResource registered in SERVER 
runtime does not implement any provider interfaces applicable in the 
SERVER runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.DeploymentsResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,789] WARN  A provider 
mesosphere.marathon.api.v2.TasksResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.TasksResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,792] WARN  A provider 
mesosphere.marathon.api.v2.QueueResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.QueueResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,798] WARN  A provider 
mesosphere.marathon.api.v2.InfoResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.InfoResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,800] WARN  A provider 
mesosphere.marathon.api.v2.LeaderResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.LeaderResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,803] WARN  A provider 
mesosphere.marathon.api.v2.PluginsResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.PluginsResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,805] WARN  A provider 
mesosphere.marathon.api.SystemResource registered in SERVER runtime does 
not implement any provider interfaces applicable in the SERVER runtime. 
Due to constraint configuration problems the provider 
mesosphere.marathon.api.SystemResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)
Mar 18 16:38:21 m01 marathon: [2020-03-18 16:38:21,805] WARN  A provider 
mesosphere.marathon.api.v2.GroupsResource registered in SERVER runtime 
does not implement any provider interfaces applicable in the SERVER 
runtime. Due to constraint configuration problems the provider 
mesosphere.marathon.api.v2.GroupsResource will be ignored.  
(org.glassfish.jersey.internal.inject.Providers:MarathonHttpService 
STARTING)


Found no roles suitable for revive repetition.

2020-03-18 Thread Marc Roos



Marathon is stuck on 'loading applications'


Mar 18 14:43:48 m01 marathon: [2020-03-18 14:43:48,646] INFO  Received 
fake heartbeat task-status update 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-30)
Mar 18 14:43:53 m01 marathon: [2020-03-18 14:43:53,321] INFO  Found no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-9)
Mar 18 14:43:58 m01 marathon: [2020-03-18 14:43:58,324] INFO  Found no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-6)
Mar 18 14:44:03 m01 marathon: [2020-03-18 14:44:03,321] INFO  Found no 
roles suitable for revive repetition. (mesosphe


Failed to send 'mesos.internal.FrameworkErrorMessage'

2020-02-22 Thread Marc Roos


I am getting these on a test setup, where marathon and mesos-master 
running on the same node and iptables is not even configured.

W0222 23:03:48.829741  1112 process.cpp:1917] Failed to send 
'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:35530', 
connect: Failed connect, connection error: Connection refused
W0222 23:03:48.831212  1112 process.cpp:1917] Failed to send 
'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:35530', 
connect: Failed to connect to 192.168.10.151:35530: Connection refused
W0222 23:05:41.584399  1112 process.cpp:1917] Failed to send 
'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:42877', 
connect: Failed connect, connection error: Connection refused
W0222 23:05:41.584664  1112 process.cpp:1917] Failed to send 
'mesos.internal.FrameworkErrorMessage' to '192.168.10.151:42877', 
connect: Failed to connect to 192.168.10.151:42877: Connection refused

Marathon logs these:
Feb 22 23:30:54 m01 marathon: [2020-02-22 23:30:54,471] INFO  Found no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-5)
Feb 22 23:30:59 m01 marathon: [2020-02-22 23:30:59,482] INFO  Found no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-12)
Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,342] INFO  Prompting 
Mesos for a heartbeat via explicit task reconciliation 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marath
on-akka.actor.default-dispatcher-2)
Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,347] INFO  Received 
fake heartbeat task-status update 
(mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-130)
Feb 22 23:31:04 m01 marathon: [2020-02-22 23:31:04,471] INFO  Found no 
roles suitable for revive repetition. 
(mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$Reviv
eRepeaterLogic:marathon-akka.actor.default-dispatcher-9)


RE: cni iptables best practice

2020-02-05 Thread Marc Roos


What about if I pay someone of your team privately, maybe want to do a 
bit of work at weekends? Maybe you can propose this to members of your 
team that have been working on this in the past?

 

-Original Message-
Sent: 05 February 2020 16:51
To: user
Cc: zhq527725; support
Subject: Re: cni iptables best practice

Hi Marc,

CNI3 support is not on Mesosphere's near term roadmap given our other 
priorities. But if there's anyone in the community willing to work with 
you to develop it, as the Apache Mesos project, we'll be happy to accept 
the contribution (of course assuming it adheres to the project's quality 
standards).

On Wed, Feb 5, 2020 at 8:57 AM Marc Roos  
wrote:


 
Is this possible? I would like to start using mesos in production 
to be 
honest. 



-Original Message-
Sent: 30 January 2020 18:46
To: Qian Zhang
Cc: user; supp...@mesosphere.com
Subject: RE: cni iptables best practice


What about when I fund this? How much would it cost? Otherwise I 
need to 
spend time/money on making a custom cni plugin that is not even 
operating via standards.

PS. I do not see the point of getting some external programmer, 
that 
needs to acquire specific knowledge on this subject first. 



-Original Message-
Cc: user
Subject: Re: cni iptables best practice

I do not think we plan to do it in short term.


Regards,
Qian Zhang


On Tue, Jan 28, 2020 at 1:54 AM Marc Roos 
 
wrote:



 Hi Qian, 

Any idea on when this cni 0.3 is going to be implemented? I 
saw the 

issue priority is Major, can't remember if it was always 
like this. 
But 
looks promising.

Regards,
Marc




-Original Message-
Sent: 14 December 2019 09:46
To: user
Subject: RE: cni iptables best practice


Yes, yes I know, disaster. I wondered how or even if people 
are 
using 
iptables with tasks. Even on internal environment it could 
be nice 
to 
use not? 









RE: cni iptables best practice

2020-02-05 Thread Marc Roos
 
Is this possible? I would like to start using mesos in production to be 
honest. 



-Original Message-
Sent: 30 January 2020 18:46
To: Qian Zhang
Cc: user; supp...@mesosphere.com
Subject: RE: cni iptables best practice

 
What about when I fund this? How much would it cost? Otherwise I need to 
spend time/money on making a custom cni plugin that is not even 
operating via standards.

PS. I do not see the point of getting some external programmer, that 
needs to acquire specific knowledge on this subject first. 



-Original Message-
Cc: user
Subject: Re: cni iptables best practice

I do not think we plan to do it in short term.


Regards,
Qian Zhang


On Tue, Jan 28, 2020 at 1:54 AM Marc Roos  
wrote:



 Hi Qian, 

Any idea on when this cni 0.3 is going to be implemented? I saw the 

issue priority is Major, can't remember if it was always like this. 
But 
looks promising.

Regards,
Marc




-Original Message-
Sent: 14 December 2019 09:46
To: user
Subject: RE: cni iptables best practice


Yes, yes I know, disaster. I wondered how or even if people are 
using 
iptables with tasks. Even on internal environment it could be nice 
to 
use not? 






Kill task, but not restarted

2020-02-02 Thread Marc Roos




Because the instance was not showing in the marathon gui. I have killed 
a task with kill -KILL, assuming it would restart, yet it did not.

I think it has to do with these messages. Why do I have these even, when 
I can just ping them?

W0202 14:46:51.215673 359364 process.cpp:1480] Failed to link to 
'192.168.122.253:35071', connect: Failed connect: connection closed
W0202 14:46:51.217136 359364 process.cpp:1480] Failed to link to 
'192.168.122.95:41400', connect: Failed connect: connection closed
W0202 14:46:51.217594 359364 process.cpp:1480] Failed to link to 
'192.168.122.94:41974', connect: Failed connect: connection closed
W0202 14:46:51.218037 359364 process.cpp:1480] Failed to link to 
'192.168.122.13:33447', connect: Failed connect: connection closed



[@mesos]# ping -c 2 192.168.122.95
PING 192.168.122.95 (192.168.122.95) 56(84) bytes of data.
64 bytes from 192.168.122.95: icmp_seq=1 ttl=64 time=0.062 ms
64 bytes from 192.168.122.95: icmp_seq=2 ttl=64 time=0.051 ms

[@mesos]# ping -c 2 192.168.122.94
PING 192.168.122.94 (192.168.122.94) 56(84) bytes of data.
64 bytes from 192.168.122.94: icmp_seq=1 ttl=64 time=0.053 ms
64 bytes from 192.168.122.94: icmp_seq=2 ttl=64 time=0.045 ms

[@mesos]# ping -c 2 192.168.122.13
PING 192.168.122.13 (192.168.122.13) 56(84) bytes of data.
64 bytes from 192.168.122.13: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.122.13: icmp_seq=2 ttl=64 time=0.051 ms



RE: cni iptables best practice

2020-01-30 Thread Marc Roos
 
What about when I fund this? How much would it cost? Otherwise I need to 
spend time/money on making a custom cni plugin that is not even 
operating via standards.

PS. I do not see the point of getting some external programmer, that 
needs to acquire specific knowledge on this subject first. 



-Original Message-
Cc: user
Subject: Re: cni iptables best practice

I do not think we plan to do it in short term.


Regards,
Qian Zhang


On Tue, Jan 28, 2020 at 1:54 AM Marc Roos  
wrote:



 Hi Qian, 

Any idea on when this cni 0.3 is going to be implemented? I saw the 

issue priority is Major, can't remember if it was always like this. 
But 
looks promising.

Regards,
Marc




-Original Message-
Sent: 14 December 2019 09:46
To: user
Subject: RE: cni iptables best practice


Yes, yes I know, disaster. I wondered how or even if people are 
using 
iptables with tasks. Even on internal environment it could be nice 
to 
use not? 





RE: cni iptables best practice

2020-01-27 Thread Marc Roos


 Hi Qian, 

Any idea on when this cni 0.3 is going to be implemented? I saw the 
issue priority is Major, can't remember if it was always like this. But 
looks promising.

Regards,
Marc




-Original Message-
Sent: 14 December 2019 09:46
To: user
Subject: RE: cni iptables best practice

 
Yes, yes I know, disaster. I wondered how or even if people are using 
iptables with tasks. Even on internal environment it could be nice to 
use not? 



-Original Message-
To: user
Subject: Re: cni iptables best practice

You are right, we do not support CNI chaining plugin yet, and I think 
there is a ticket to trace it: 
https://issues.apache.org/jira/browse/MESOS-7079.


Regards,
Qian Zhang


On Sat, Dec 14, 2019 at 7:08 AM Marc Roos  
wrote:




Is anyone applying iptables rules in their cni networking, and how? 

I 
wrote a iptables chaining plugin but cannot use it because this cni 


0.3.0 is still not supported in mesos 1.9. I wondered how this done 


currently














RE: cni iptables best practice

2019-12-14 Thread Marc Roos
 
Yes, yes I know, disaster. I wondered how or even if people are using 
iptables with tasks. Even on internal environment it could be nice to 
use not? 



-Original Message-
To: user
Subject: Re: cni iptables best practice

You are right, we do not support CNI chaining plugin yet, and I think 
there is a ticket to trace it: 
https://issues.apache.org/jira/browse/MESOS-7079.


Regards,
Qian Zhang


On Sat, Dec 14, 2019 at 7:08 AM Marc Roos  
wrote:




Is anyone applying iptables rules in their cni networking, and how? 
I 
wrote a iptables chaining plugin but cannot use it because this cni 

0.3.0 is still not supported in mesos 1.9. I wondered how this done 

currently












cni iptables best practice

2019-12-13 Thread Marc Roos



Is anyone applying iptables rules in their cni networking, and how? I 
wrote a iptables chaining plugin but cannot use it because this cni 
0.3.0 is still not supported in mesos 1.9. I wondered how this done 
currently









Iptables

2019-11-03 Thread Marc Roos


How to set iptable rules inside a container? I am getting these

Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied
Fatal: can't open lock file /run/xtables.lock: Permission denied



Degraded performance container vs vm (-80% !!!)

2019-10-22 Thread Marc Roos


I have still with mesos 1.9 degraded performance, any help to sort this 
out would be nice. Makes me also wonder if others have bothered testing 
this or not? I am testing still with mesos and thus have mostly a 
default setup.

Previously when I opened this thread, there was questioning about 
resource differences between the vm and container. This is not the case, 
the vm has allocated 1 vcpu. To counter eventual memory issues I try to 
use the dns cache by requesting contstantly the same 2 domains in 
files-2.tst.



[0] marathon task resources
  "cpus": 1,
  "mem": 300,


[1] memory usage vm
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 9671 named 20   0  289m  90m 2948 S  0.0 18.8  12:18.22 named


[2] testing vm
[@ test-dns]$ dnsperf -f inet -t 2 -s 192.168.10.10 -d files-2.tst -l 10
DNS Performance Testing Tool
Nominum Version 2.1.0.0

[Status] Command line: dnsperf -f inet -t 2 -s 192.168.10.10 -d 
files-2.tst -l 10
[Status] Sending queries (to 192.168.10.10)
[Status] Started at: Tue Oct 22 14:30:03 2019
[Status] Stopping after 10.00 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent: 116834
  Queries completed:116834 (100.00%)
  Queries lost: 0 (0.00%)

  Response codes:   NOERROR 116834 (100.00%)
  Average packet size:  request 27, response 111
  Run time (s): 10.011078
  Queries per second:   11670.471452

  Average Latency (s):  0.008367 (min 0.000778, max 0.020300)
  Latency StdDev (s):   0.001285


[3] testing container

[marc@os0 test-dns]$ dnsperf -f inet -t 2 -s 192.168.10.13 -d 
files-2.tst -l 10
DNS Performance Testing Tool
Nominum Version 2.1.0.0

[Status] Command line: dnsperf -f inet -t 2 -s 192.168.10.13 -d 
files-2.tst -l 10
[Status] Sending queries (to 192.168.10.13)
[Status] Started at: Tue Oct 22 14:29:48 2019
[Status] Stopping after 10.00 seconds
[Timeout] Query timed out: msg id 3
[Timeout] Query timed out: msg id 9
[Timeout] Query timed out: msg id 10
[Timeout] Query timed out: msg id 11
...
...
...
[Timeout] Query timed out: msg id 21251
[Timeout] Query timed out: msg id 21253
[Timeout] Query timed out: msg id 21260
[Timeout] Query timed out: msg id 21328
[Timeout] Query timed out: msg id 21344
[Timeout] Query timed out: msg id 21365
[Timeout] Query timed out: msg id 21390
[Status] Testing complete (time limit)

Statistics:

  Queries sent: 24770
  Queries completed:24275 (98.00%)
  Queries lost: 495 (2.00%)

  Response codes:   NOERROR 24275 (100.00%)
  Average packet size:  request 27, response 111
  Run time (s): 10.000185
  Queries per second:   2427.455092

  Average Latency (s):  0.000326 (min 0.000130, max 0.003435)
  Latency StdDev (s):   0.000139


changing /etc/hosts in container

2019-10-21 Thread Marc Roos


What are my options to adding a host entry to /etc/hosts in container 
running not as root?







RE: Is chained cni networks supported in mesos 1.7

2019-10-21 Thread Marc Roos
Hi Gilbert,

How is it going with the chain implementation?

 Thanks,
Marc





-Original Message-
From: Gilbert Song [mailto:gilb...@apache.org] 
Sent: woensdag 14 augustus 2019 22:24
To: user
Subject: Re: Is chained cni networks supported in mesos 1.7

Are you interested in implementing the CNI chain support?

-Gilbert

On Wed, Jul 24, 2019 at 12:52 PM Marc Roos  
wrote:


 
Hmm, I guess I should not get my hopes up this will be there soon?
[0]
https://issues.apache.org/jira/browse/MESOS-7178



-Original Message-
From: Jie Yu [mailto:yujie@gmail.com] 
Sent: woensdag 24 juli 2019 21:35
To: user
Subject: Re: Is chained cni networks supported in mesos 1.7

No, not yet

On Wed, Jul 24, 2019 at 12:27 PM Marc Roos 
 
wrote:







This error message of course
E0724 21:19:17.852210  1160 cni.cpp:330] Failed to parse 
CNI 
network 
configuration file '/etc/mesos-cni/93-chain.conflist': 
Protobuf 
parse 
failed: Missing required fields: typ


-Original Message-
Subject: Is chained cni networks supported in mesos 1.7


I am getting this error, while I don not have problems 
using it 
with 
cnitool.

 cni.cpp:330] Failed to parse CNI network configuration 
file
'/etc/mesos-cni/93-chain-routing-overwrite.conflist.bak': 
Protobuf 
parse
failed: Missing required fields: type

[@ mesos-cni]# cat 93-chain.conflist
{
  "name": "test-chain",
  "plugins": [{
"type": "bridge",
"bridge": "test-chain0",
"isGateway": false,
"isDefaultGateway": false,
"ipMasq": false,
"ipam": {
"type": "host-local",
"subnet": "10.15.15.0/24"
}
},
{
  "type": "portmap",
  "capabilities": {"portMappings": true},
  "snat": false
}]
}


[@ mesos-cni]#  CNI_PATH="/usr/libexec/cni/"  
NETCONFPATH="/etc/mesos-cni" cnitool-0.5.2 add test-chain 
/var/run/netns/testing {
"ip4": {
"ip": "10.15.15.2/24",
"gateway": "10.15.15.1"
},
"dns": {}










RE: Mesos task example json

2019-10-14 Thread Marc Roos
 

Thanks Benjamin, I will bookmark these.




-Original Message-
To: user@mesos.apache.org
Subject: Re: Mesos task example json

Hi Marc,

> You also know how/where to put the capabilities? I am struggling with 
> that.


Have a look at the protobufs which define this API:

* `TaskInfo` which is used with `mesos-execute` is defined here, 
https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L2229-L2285,
* capabilities are passed via a task’s `container` field, 
https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L2239
 
which has a field `linux_info` whose structure is defined here, 
https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos/mesos.proto#L3270-L3341
* in there you want to set effective_capabilities and/or 
`bounding_capabilities`, see the docs document for their semantics and 
interaction with the agent configuration, e.g., 
https://mesos.apache.org/documentation/latest/isolators/linux-capabilities/#task-setup.

Most of the public Mesos APIs are defined in files under 
https://github.com/apache/mesos/tree/558829eb24f4ad636348497075bbc0428a4794a4/include/mesos,
 
either in protobuf form or as C++ header files. For questions like yours 
it often helps to work backwards from an interesting field to a 
structure (e.g., in this particular case: work out how `CapabilityInfo` 
is related to `TaskInfo`).


HTH,

Benjamin




Don't understand how to use mesos capabilities

2019-10-14 Thread Marc Roos




Don't understand how to use mesos capabilities as described here[0]


1. removed caps from ping with
setcap 'cap_net_raw=-p' /usr/bin/ping
2. linux/capabilities in the isolators, 
3. mesos-slave running as root, 
4. did not set effective_capabilities nor bounding_capabilities
5. Running kernel 3.10.0-957.27.2.el7.x86_64
6. Looks like the task json is correctly configured (output from tasks 
endpoint)
  },
  "container": {
"type": "MESOS",
"linux_info": {
  "effective_capabilities": {
"capabilities": [
  "NET_RAW"
]
  }
}

Yet when I run the task with the command "capsh --print ; ping -c 2 
localhost ; sleep 120" I am getting such outputs of capsh[1] yet the 
ping refuses with "ping: socket: Operation not permitted" 


[1]  
Current: = cap_net_raw+eip cap_net_admin,cap_syslog+i
Bounding set =cap_net_admin,cap_net_raw,cap_syslog
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=99(nobody)
gid=99(nobody)
groups=99(nobody)

Current: = cap_net_raw+eip
Bounding set =cap_net_raw
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=99(nobody)
gid=99(nobody)
groups=99(nobody)



[0]
http://mesos.apache.org/documentation/latest/isolators/linux-capabilities/




RE: Mesos task example json

2019-10-14 Thread Marc Roos
 
Hi Qian, 

Thanks! You also know how/where to put the capabilities? I am struggling 
with that.



-Original Message-
To: user
Subject: Re: Mesos task example json

Hi Marc,

Here is an example json that I use for testing:

{
  "name": "test",
  "task_id": {"value" : "test"},
  "agent_id": {"value" : ""},
  "resources": [
{"name": "cpus", "type": "SCALAR", "scalar": {"value": 1}},
{"name": "mem", "type": "SCALAR", "scalar": {"value": 128}}
  ],
  "command": {
"value": "sleep 10"
  },
  "container": {
"type": "MESOS",
"mesos": {
  "image": {
"type": "DOCKER",
"docker": {
  "name": "busybox"
}
  }
}
  }
}



Regards,
Qian Zhang


On Sat, Oct 12, 2019 at 6:26 AM Marc Roos  
wrote:




Is there some example json available with all options for use with 
'mesos-execute --task='






Mesos task example json

2019-10-11 Thread Marc Roos



Is there some example json available with all options for use with 
'mesos-execute --task='



mesos 1.9 should have mesos task not?

2019-10-11 Thread Marc Roos



[@~]# mesos help
Usage: mesos  [OPTIONS]

Available commands:
help
dns
daemon.sh
agent
start-cluster.sh
master
start-agents.sh
start-masters.sh
start-slaves.sh
stop-agents.sh
stop-cluster.sh
stop-masters.sh
stop-slaves.sh
tail
cat
execute
init-wrapper
local
log
ps
resolve
scp


mesos-1.9.0-2.0.1.el7.x86_64


NET_ADMIN permission equivalent for mesos

2019-10-10 Thread Marc Roos


I have a docker image that requires NET_ADMIN, I have found this[0] (for 
the docker containerizer?), but what is the syntax for the mesos 
containerizer.



[0]
{
  "cpus": 0.1,
  "mem": 50,
  "id": "/openvpn",
  "instances": 1,
  "container": {
"docker": {
  "image": "docker-registry.marathon.mesos:5000/openvpn",
  "network": "BRIDGE",
  "forcePullImage": true,
  "parameters": [{"key":"cap-add", "value":"NET_ADMIN"}],
  "portMappings": [{"containerPort": 1194, "servicePort": 1194}]
}
  },
  "dependencies": ["/mesos-dns", "/docker-registry"],
  "healthChecks": [{"protocol": "TCP"}]
}


Kernel module restrictions on launced task?

2019-10-08 Thread Marc Roos


Are there any restrictions on a launched task that could block access to 
ipsec in the kernel?

I am getting this in the launched task

Oct  8 16:05:19 c02 ipsec_starter[695921]: no netkey IPsec stack 
detected
Oct  8 16:05:19 c02 ipsec_starter[695921]: no KLIPS IPsec stack detected
Oct  8 16:05:19 c02 ipsec_starter[695921]: no known IPsec stack 
detected, ignoring!


While launching directly on the host seems to be ok.

Linux c04 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.7.1908 (Core)
mesos-1.9.0-2.0.1.el7.x86_64




RE: Task list node

2019-10-01 Thread Marc Roos
 
Yes thanks managed to get them with this

curl -s --user test:xxx
  --cacert /etc/pki/ca-trust/source/ca-test.crt
  -X GET https://m01.local:5050/state | jq '.frameworks[].tasks[] | 
select(.state=="TASK_RUNNING") | del(.statuses, .discovery, .container, 
.health_check) | "\(.name) \(.state) \(.slave_id)" '


-Original Message-
To: user
Subject: Re: Task list node

You can just mimic UI behaviour and use /state endpoint and filter it 
with jq.


wt., 1 paź 2019 o 13:56 Marc Roos  napisał(a):


 

Hmmm, if do something like this[0] I get only 3 tasks, and the 
mesos gui 
on 5050 is showing all (I guess, at least more than three) Also if 
I 
grep the unfiltered json output for a task string, it does not find 
it.

[0]
curl -s --user test:xxx --cacert 
/etc/pki/ca-trust/source/ca-test.crt -X 
GET https://m01.local:5050/tasks  | jq '.tasks[] | 
select(.state=="TASK_RUNNING")' 

curl -s --user test:xxx --cacert 
/etc/pki/ca-trust/source/ca-test.crt -X 
GET https://m01.local:5050/master/tasks  | jq '.tasks[] | 
select(.state=="TASK_RUNNING") | del(.statuses, .discovery, 
.health_check, .container)  | "\(.name) \(.state) \(.slave_id)" '


-Original Message-
To: user
Subject: Re: Task list node

You can list them with agent containers endpoint 
http://mesos.apache.org/documentation/latest/endpoints/slave/contai
ners/
Or with master tasks endpoint and filter them localy with jq 
http://mesos.apache.org/documentation/latest/endpoints/master/tasks
/
    
czw., 26 wrz 2019 o 22:09 Marc Roos  
napisał(a):



What would be the easiest way to list running tasks on a 
node/agent/slave?















RE: Task list node

2019-10-01 Thread Marc Roos
 

Hmmm, if do something like this[0] I get only 3 tasks, and the mesos gui 
on 5050 is showing all (I guess, at least more than three) Also if I 
grep the unfiltered json output for a task string, it does not find it.

[0]
curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X 
GET https://m01.local:5050/tasks  | jq '.tasks[] | 
select(.state=="TASK_RUNNING")' 

curl -s --user test:xxx --cacert /etc/pki/ca-trust/source/ca-test.crt -X 
GET https://m01.local:5050/master/tasks  | jq '.tasks[] | 
select(.state=="TASK_RUNNING") | del(.statuses, .discovery, 
.health_check, .container)  | "\(.name) \(.state) \(.slave_id)" '


-Original Message-
To: user
Subject: Re: Task list node

You can list them with agent containers endpoint 
http://mesos.apache.org/documentation/latest/endpoints/slave/containers/
Or with master tasks endpoint and filter them localy with jq 
http://mesos.apache.org/documentation/latest/endpoints/master/tasks/

czw., 26 wrz 2019 o 22:09 Marc Roos  
napisał(a):



What would be the easiest way to list running tasks on a 
node/agent/slave?












Maybe new feature/option for the health check

2019-09-30 Thread Marc Roos


I have a few tasks that take a while before they get started. Sendmail 
eg. Is not to happy you cannot set the hostname (in marathon) and then 
gives a timeout of 1 minute. I think there is something similar when 
starting openldap. If I enable a regular health check there, it will 
fail the task before it finished launching. Maybe it is interesting to 
add an option for this initDelay?


{
  "path": "/api/health",
  "portIndex": 0,
  "protocol": "MESOS_HTTP",
  "initDelay": 60,  < 
  "gracePeriodSeconds": 300,
  "intervalSeconds": 60,
  "timeoutSeconds": 20,
  "maxConsecutiveFailures": 3
}



Problems with tasks and cni networking after upgrading from 1.8 to 1.9

2019-09-28 Thread Marc Roos


Looks like my tasks that have dual networking, a gateway and cni_args 
assigned ip address are not able to start anymore on mesos 1.9. During 
deployment I am able to ping these assigned ip addresses. Why can't this 
executor reach the task then? I guess something has changed in how the 
executor is connecting with container since 1.8?



I0929 00:54:55.658519 469057 slave.cpp:2130] Got assigned task 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e-
I0929 00:54:55.664753 469057 slave.cpp:2504] Authorizing task 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e-
I0929 00:54:55.667817 469057 slave.cpp:2977] Launching task 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' for framework d5168fcd-51be-48c3-ba64-ade27ab23c4e-
I0929 00:54:55.668617 469057 paths.cpp:817] Creating sandbox 
'/var/lib/mesos/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/framework
s/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-storag
e-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/51e76
e90-5244-4c67-98bf-cf30b78d6fe6' for user 'nobody'
I0929 00:54:55.669667 469057 paths.cpp:820] Creating sandbox 
'/var/lib/mesos/meta/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/fram
eworks/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-s
torage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/
51e76e90-5244-4c67-98bf-cf30b78d6fe6'
I0929 00:54:55.669903 469057 slave.cpp:10002] Launching executor 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e- with 
resources 
[{"allocation_info":{"role":"marathon"},"name":"cpus","scalar":{"value":
0.1},"type":"SCALAR"},{"allocation_info":{"role":"marathon"},"name":"mem
","scalar":{"value":32.0},"type":"SCALAR"}] in work directory 
'/var/lib/mesos/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/framework
s/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-storag
e-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/51e76
e90-5244-4c67-98bf-cf30b78d6fe6'
I0929 00:54:55.670975 469057 slave.cpp:3209] Queued task 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' for executor 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e-
I0929 00:54:55.671329 469057 slave.cpp:3657] Launching container 
51e76e90-5244-4c67-98bf-cf30b78d6fe6 for executor 
'demo_server-storage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a
1._app.3' of framework d5168fcd-51be-48c3-ba64-ade27ab23c4e-
I0929 00:54:55.671910 469042 containerizer.cpp:1396] Starting container 
51e76e90-5244-4c67-98bf-cf30b78d6fe6
I0929 00:54:55.672205 469042 containerizer.cpp:3323] Transitioning the 
state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from STARTING to 
PROVISIONING after 162048ns
I0929 00:54:55.973661 469039 provisioner.cpp:551] Provisioning image 
rootfs 
'/var/lib/mesos/provisioner/containers/51e76e90-5244-4c67-98bf-cf30b78d6
fe6/backends/copy/rootfses/292ee89d-1112-4b79-83e7-953571b3cd3b' for 
container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 using copy backend
I0929 00:54:57.279662 469057 containerizer.cpp:3323] Transitioning the 
state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from 
PROVISIONING to PREPARING after 1.607405056secs
I0929 00:54:57.285080 469064 cpu.cpp:92] Updated 'cpu.shares' to 204 
(cpus 0.2) for container 51e76e90-5244-4c67-98bf-cf30b78d6fe6
I0929 00:54:57.296098 469057 switchboard.cpp:316] Container logger 
module finished preparing container 
51e76e90-5244-4c67-98bf-cf30b78d6fe6; IOSwitchboard server is not 
required
I0929 00:54:57.298223 469064 linux_launcher.cpp:492] Launching container 
51e76e90-5244-4c67-98bf-cf30b78d6fe6 and cloning with namespaces 
CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWNET
I0929 00:54:57.307914 469057 containerizer.cpp:2209] Checkpointing 
container's forked pid 469525 to 
'/var/lib/mesos/meta/slaves/0c15c45f-310b-4fc2-8275-b0bfa1bdfdcb-S0/fram
eworks/d5168fcd-51be-48c3-ba64-ade27ab23c4e-/executors/demo_server-s
torage-appsgw.instance-7239a528-e242-11e9-ac3c-0050563001a1._app.3/runs/
51e76e90-5244-4c67-98bf-cf30b78d6fe6/pids/forked.pid'
I0929 00:54:57.308645 469057 containerizer.cpp:3323] Transitioning the 
state of container 51e76e90-5244-4c67-98bf-cf30b78d6fe6 from PREPARING 
to ISOLATING after 29.022976ms
I0929 00:54:57.310788 469040 cni.cpp:974] Bind mounted 
'/proc/469525/ns/net' to 
'/run/mesos/isolators/network/cni/51e76e90-5244-4c67-98bf-cf30b78d6fe6/n
s' for container 51e76e90-5244-4c67-98bf-cf30b78d6fe6
I0929 00:54:57.311151 469040 cni.cpp:1320] Invoking CNI plugin 
'/usr/libexec/cni/mesos' to attach container 
51e76e90-5244-4c67-98bf-cf30b78d6fe6 to network 

How to clean up "Failed to find 'libprocess.pid' or 'http.marker'"

2019-09-28 Thread Marc Roos


W0929 00:45:10.676910 468993 process.cpp:1055] Failed SSL connections 
will be downgraded to a non-SSL socket
W0929 00:45:10.901372 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
8bf306d5-a10c-4787-9258-4198ea80bbec of executor 
W0929 00:45:10.902492 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
4b171278-bcd4-4014-abc4-c330912bd87f of executor 
W0929 00:45:10.903398 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
8dadcf4d-35ab-4cd7-b313-f4de1175282e of executor 
W0929 00:45:10.904371 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
586a55ff-89b7-4fd1-8bed-045300baee47 of executor 
W0929 00:45:10.905293 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
ea295608-bf62-447d-aae6-a3596dae8a13 of executor 
W0929 00:45:10.906186 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
957f1879-2718-400b-9348-ec35c89f51a6 of executor 
W0929 00:45:10.907114 469057 state.cpp:657] Failed to find 
'libprocess.pid' or 'http.marker' for container 
938dbd1c-1468-480e-9554-65c9e415d163 of executor 


Task list node

2019-09-26 Thread Marc Roos


What would be the easiest way to list running tasks on a 
node/agent/slave?









BUG /tmp/mesos losing files add /usr/lib/tmpfiles.d/mesos.conf

2019-09-15 Thread Marc Roos


For the developers, tmp in centos6/7 (and probably more distros) is 
being cleaned automatically! Read this:
https://www.thegeekdiary.com/centos-rhel-67-why-the-files-in-tmp-directory-gets-deleted-periodically/
https://developers.redhat.com/blog/2016/09/20/managing-temporary-files-with-systemd-tmpfiles-on-rhel7/

, maybe add something like this file to your rpms.

cat << EOF >> /usr/lib/tmpfiles.d/mesos.conf
x /tmp/mesos/store/docker/
EOF

Now mesos fails often even when being used with "forcePullImage": true

I am having this error:

Task id
chat_openfire.instance-5c6ba784-d7a7-11e9-b799-0050563001a1._app.11 
State
TASK_FAILED 
Message
Failed to launch container: Failed to read manifest from 
'/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191
cb85dbc1aa187e4e480021c1/json': No such file or directory


mesos-1.8.1-2.0.1.el7.x86_64


-Original Message-
From: Marc Roos 
Sent: maandag 19 augustus 2019 21:47
To: user
Subject: "Failed to launch container" "No such file or directory"


Some temp folders gone? How to resolve this?

Failed to launch container: Failed to read manifest from
'/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191
cb85dbc1aa187e4e480021c1/json': No such file or directory








RE: Please some help regression testing a task

2019-09-02 Thread Marc Roos
 
No it is not throttled, besides changing the runtime cgroups of the task 
to user.slice should have revealed some difference then not?
[@~]# cat 
/sys/fs/cgroup/cpuacct/mesos/d0923b5a-5b96-41cc-b291-4effc0bfcbb9/cpu.st
at
nr_periods 0
nr_throttled 0
throttled_time 0


-Original Message-
To: user
Subject: Re: Please some help regression testing a task

Can you check if the task is throttled? You can run the command 
`/proc//cgroup` to get the cgroups of the task, and then check 
the `cpu.stat` file under task's CPU cgroups, e.g.:


$ cat 
/sys/fs/cgroup/cpuacct/mesos/bd5bc588-7565-4c7e-a5f0-d33850b2ec0a/cpu.st
at 
nr_periods 118
nr_throttled 37
throttled_time 633829202



If `nr_throttled` is greater than 0, then that means the task was 
throttled which may affect its performance.


Regards,
Qian Zhang


On Sat, Aug 31, 2019 at 11:48 PM Marc Roos  
wrote:


 

mesos-1.8.1-2.0.1.el7.x86_64
CentOS Linux release 7.6.1810 (Core)



-Original Message-
To: user
Subject: Please some help regression testing a task


I have a task that under performs. I am unable to discover what is 
causing it. Could this be something mesos specific?
Performance difference is 1k q/s vs 20k q/s


1. If manually I run the task on the host the performance is ok
> I think one could rule out network connectivity on/of the host 
and 
> host issues


2. If I manually run a task in the same netns as the under 
performing 
task, the performance is ok.
  ip netns exec bind bash
  chroot 04a81d99-9b99-410d-bf83-d6d70ef2c7bb/
  (changed only the config port to 54)
  named -u named
> I think we can rule out netns issues


3. If I manually remove or change the cgroups of the mesos/marathon 

task, the performance is still bad

echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks

or

echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks


[@]# cat /proc/2936696/cgroup
11:hugetlb:/
10:memory:/user.slice
9:devices:/user.slice
8:cpuacct,cpu:/user.slice
7:perf_event:/
6:cpuset:/
5:pids:/user.slice
4:freezer:/
3:blkio:/user.slice
2:net_prio,net_cls:/
1:name=systemd:/user.slice/user-0.slice/session-17385.scope

[@]# cat /proc/2932859/cgroup
11:hugetlb:/
10:memory:/user.slice
9:devices:/user.slice
8:cpuacct,cpu:/user.slice
7:perf_event:/
6:cpuset:/
5:pids:/user.slice
4:freezer:/
3:blkio:/user.slice
2:net_prio,net_cls:/
1:name=systemd:/mesos/812c481b-c0a4-444a-aafa-de98da9698e2








W0831 containerizer.cpp:2375] Ignoring update for unknown container

2019-08-31 Thread Marc Roos


Why do get this message? How to resolve this?

W0831 18:01:45.403295 2943686 containerizer.cpp:2375] Ignoring update 
for unknown container 48d9b77c-7348-4404-9845-211be74bad1d

mesos-1.8.1-2.0.1.el7.x86_64






RE: Please some help regression testing a task

2019-08-31 Thread Marc Roos
 

mesos-1.8.1-2.0.1.el7.x86_64
CentOS Linux release 7.6.1810 (Core)



-Original Message-
To: user
Subject: Please some help regression testing a task


I have a task that under performs. I am unable to discover what is 
causing it. Could this be something mesos specific?
Performance difference is 1k q/s vs 20k q/s


1. If manually I run the task on the host the performance is ok
> I think one could rule out network connectivity on/of the host and 
> host issues


2. If I manually run a task in the same netns as the under performing 
task, the performance is ok.
  ip netns exec bind bash
  chroot 04a81d99-9b99-410d-bf83-d6d70ef2c7bb/
  (changed only the config port to 54)
  named -u named
> I think we can rule out netns issues


3. If I manually remove or change the cgroups of the mesos/marathon 
task, the performance is still bad

echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks

or

echo 2932859 > /sys/fs/cgroup/memory/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/devices/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpu/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/cpuacct/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/pids/user.slice/tasks
echo 2932859 > /sys/fs/cgroup/blkio/user.slice/tasks


[@]# cat /proc/2936696/cgroup
11:hugetlb:/
10:memory:/user.slice
9:devices:/user.slice
8:cpuacct,cpu:/user.slice
7:perf_event:/
6:cpuset:/
5:pids:/user.slice
4:freezer:/
3:blkio:/user.slice
2:net_prio,net_cls:/
1:name=systemd:/user.slice/user-0.slice/session-17385.scope

[@]# cat /proc/2932859/cgroup
11:hugetlb:/
10:memory:/user.slice
9:devices:/user.slice
8:cpuacct,cpu:/user.slice
7:perf_event:/
6:cpuset:/
5:pids:/user.slice
4:freezer:/
3:blkio:/user.slice
2:net_prio,net_cls:/
1:name=systemd:/mesos/812c481b-c0a4-444a-aafa-de98da9698e2





RE: Large container image failing to start 'first' time

2019-08-30 Thread Marc Roos


I only have these two messages 


mesos-slave.ERROR:E0828 12:51:46.146246 2663200 slave.cpp:6486] 
Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 
'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework 
d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is 
being destroyed during provisioning
mesos-slave.INFO:E0828 12:51:46.146246 2663200 slave.cpp:6486] Container 
'680d3849-2b2a-4549-8842-8ef358599478' for executor 
'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework 
d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is 
being destroyed during provisioning
mesos-slave.INFO:W0828 12:51:46.650323 2663184 containerizer.cpp:2375] 
Ignoring update for unknown container 
680d3849-2b2a-4549-8842-8ef358599478
mesos-slave.WARNING:E0828 12:51:46.146246 2663200 slave.cpp:6486] 
Container '680d3849-2b2a-4549-8842-8ef358599478' for executor 
'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework 
d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is 
being destroyed during provisioning
mesos-slave.WARNING:W0828 12:51:46.650323 2663184 
containerizer.cpp:2375] Ignoring update for unknown container 
680d3849-2b2a-4549-8842-8ef358599478 




-Original Message-
From: Qian Zhang [mailto:zhq527...@gmail.com] 
Sent: woensdag 28 augustus 2019 15:07
To: Marc Roos
Cc: user
Subject: Re: Large container image failing to start 'first' time

Can you please send the full logs about this container (just grep 
680d3849-2b2a-4549-8842-8ef358599478 in agent log)? And is there 
anything left in the staging directory (`--docker_store_dir/staging/`) 
when this issue happens?


Regards,
Qian Zhang


On Wed, Aug 28, 2019 at 7:07 PM Marc Roos  
wrote:


 I had this again.

E0828 12:51:46.146246 2663200 slave.cpp:6486] Container 
'680d3849-2b2a-4549-8842-8ef358599478' for executor 
'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of 
framework 
d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: 
Container is 
being destroyed during provisioning



-Original Message-
From: Qian Zhang [mailto:zhq527...@gmail.com] 
Sent: dinsdag 20 augustus 2019 1:12
To: user
Subject: Re: Large container image failing to start 'first' time

> 

 Large container image failing to start 'first' time Did you see 
any 
errors/warnings in agent logs when the container failed to start?


Regards,
Qian Zhang


On Mon, Aug 19, 2019 at 10:46 PM Marc Roos 
 
wrote:



I have a container image of around 800MB. I am not sure if 
that is 
a 
lot. But I have noticed it is probably to big for a default 
setup 
to get 
it to launch. I think the only reason it launches 
eventually is 
because 
data is cached and no timeout expires. The container will 
launch 
eventually when you constrain it to a host.

How can I trace where this timeout occurs? Are there 
options to 
specify 
timeouts?















Converting vm to task (performance degraded)

2019-08-29 Thread Marc Roos


I am testing converting a nameserver vm to a task on mesos. If I query 
just one domain (so the results comes from cache) for 30 seconds I can 
do around 450.000 queries on the vm, and only 17.000 on the task. 
When I look at top output on the host where task is running I see this 
task only using 17% cpu time (vm allocates 100% cpu). I have launched 
the task with cpus: 1

How/where/what should I check that causes this reduced performance?  I 
think some configuration is limiting because I can easily get 10k q/s on 
the vm and the task is only getting 1,8k q/s

Is there a configuration guide on how to change a hosts settings to 
optimize it for using with mesos?











RE: Large container image failing to start 'first' time

2019-08-28 Thread Marc Roos
 I had this again.

E0828 12:51:46.146246 2663200 slave.cpp:6486] Container 
'680d3849-2b2a-4549-8842-8ef358599478' for executor 
'ldap.instance-afee8840-c981-11e9-8333-0050563001a1._app.1' of framework 
d5168fcd-51be-48c3-ba64-ade27ab23c4e- failed to start: Container is 
being destroyed during provisioning



-Original Message-
From: Qian Zhang [mailto:zhq527...@gmail.com] 
Sent: dinsdag 20 augustus 2019 1:12
To: user
Subject: Re: Large container image failing to start 'first' time

> 

 Large container image failing to start 'first' time Did you see any 
errors/warnings in agent logs when the container failed to start?


Regards,
Qian Zhang


On Mon, Aug 19, 2019 at 10:46 PM Marc Roos  
wrote:



I have a container image of around 800MB. I am not sure if that is 
a 
lot. But I have noticed it is probably to big for a default setup 
to get 
it to launch. I think the only reason it launches eventually is 
because 
data is cached and no timeout expires. The container will launch 
eventually when you constrain it to a host.

How can I trace where this timeout occurs? Are there options to 
specify 
timeouts?












Exporting socket from container to host

2019-08-25 Thread Marc Roos


I was wondering if it is possible to export a socket of a container to 
the host, so I can then share it again with another container. (Without 
using pods eg. (I like to scale these applications independent from each 
other))










W0823 14:20:30.101281 2663193 containerizer.cpp:2375] Ignoring update for unknown container

2019-08-23 Thread Marc Roos



When scaling a task from 0 to 1 with two cni networks one of them having 
a gateway, I have quite a lot of failures

Step 1: deploying
Step 2: DHCPREQUEST and DHCPACK (fast)
Step 3: Right after DHCPACK this error from the agent
W0823 14:58:18.440388 2663180 containerizer.cpp:2375] Ignoring 
update for unknown container 2000cbfc-7ca8-4f61-bcfd-f43e248ba130
Step 4: Delayed

Why do I get the unknow container?


W0823 12:00:46. process.cpp:1453] Failed to link to '192.168.142.50:40746', connect: Failed connect: connection closed

2019-08-23 Thread Marc Roos



When scaling the task from 0 to 1, it takes sometimes quite a while for 
it to become active. Waiting maybe 10-20 seconds on the first waiting 
reported by marathon. 

Step 1: deploying (fast)
Step 2: sometimes fast / sometimes 10-20 seconds
Step 3: DHCPREQUEST and DHCPACK (fast)
Step 4: Right after DHCPACK this error from the agent
W0823 13:08:16.595113 2663211 process.cpp:1453] Failed to link 
to '192.168.142.53:41715', connect: Failed connect: connection closed
Step 5: waiting (fast)
Step 6: running (fast)

Sometimes the same task is instantly running. Although having the same 
'connection closed' error after the DHCPACK. 




RE: Large container image failing to start 'first' time

2019-08-23 Thread Marc Roos
 
I have found several, related to also having multiple networks, I will 
start new threat


-Original Message-
To: user
Subject: Re: Large container image failing to start 'first' time

> 

 Large container image failing to start 'first' time Did you see any 
errors/warnings in agent logs when the container failed to start?


Regards,
Qian Zhang


On Mon, Aug 19, 2019 at 10:46 PM Marc Roos  
wrote:



I have a container image of around 800MB. I am not sure if that is 
a 
lot. But I have noticed it is probably to big for a default setup 
to get 
it to launch. I think the only reason it launches eventually is 
because 
data is cached and no timeout expires. The container will launch 
eventually when you constrain it to a host.

How can I trace where this timeout occurs? Are there options to 
specify 
timeouts?












RE: "Failed to launch container" "No such file or directory" /tmp files are being cleaned

2019-08-19 Thread Marc Roos
 
Hmmm, I have deleted layers there, now it seems ok and the image is 
pulled from the nfs share.

1. should the forced pull not be forcing to get the image regardless 
what is in /tmp?
2. I think from centos7 /tmp is not cleared just after a reboot, but 
also scheduled. So maybe not such a good place to store docker layers?








-Original Message-
From: Marc Roos 
Sent: maandag 19 augustus 2019 21:47
To: user
Subject: "Failed to launch container" "No such file or directory"


Some temp folders gone? How to resolve this?

Failed to launch container: Failed to read manifest from
'/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191
cb85dbc1aa187e4e480021c1/json': No such file or directory








"Failed to launch container" "No such file or directory"

2019-08-19 Thread Marc Roos


Some temp folders gone? How to resolve this?

Failed to launch container: Failed to read manifest from 
'/tmp/mesos/store/docker/layers/8c49e24d4aba93c77354143366e2427e0e2e7191
cb85dbc1aa187e4e480021c1/json': No such file or directory






Large container image failing to start 'first' time

2019-08-19 Thread Marc Roos


I have a container image of around 800MB. I am not sure if that is a 
lot. But I have noticed it is probably to big for a default setup to get 
it to launch. I think the only reason it launches eventually is because 
data is cached and no timeout expires. The container will launch 
eventually when you constrain it to a host.

How can I trace where this timeout occurs? Are there options to specify 
timeouts?









RE: Provisioning containers with configuration file via sandbox mount or copy via entrypoint.sh

2019-08-14 Thread Marc Roos
 
Hi Gilbert, thanks for the detailed reply, this secrets is very 
interesting. 


>   *   Fetch via URI - you probably do not need your application 
entrypoint to fetch. Instead Mesos > and marathon supports fetching URIs 
to your container sandbox.
>   http://mesos.apache.org/documentation/latest/fetcher/

This fetching is what I am doing now. I have containers with a default 
configuration file. But when I need updates I am fetching with something 
like this. 

 "fetch": [
{ "uri": "file:///mnt/docker-images/haproxy.cfg",
  "executable": false,
  "extract": false,
  "cache": false,
  "destPath": "haproxy.cfg" },
{ "uri": "file:///mnt/docker-images/.crt",
  "executable": false,
  "extract": false,
  "cache": false,
  "destPath": ".crt" }
  ],

But this file goes into the sandbox directory /mnt/sandbox, I just 
wonder why it can't go directly to the 'container rootfs'?

This is what I now have to do in the entrypoint.sh

if [ ! -z "${MESOS_SANDBOX}" ] && [ -f "${MESOS_SANDBOX}/haproxy.cfg" ]



-Original Message-
To: user
Subject: Re: Provisioning containers with configuration file via sandbox 
mount or copy via entrypoint.sh

It depends on how do you want to manage the configuration files for your 
containers - dynamic or static.

*   Dynamic

*   Fetch via URI - you probably do not need your application 
entrypoint to fetch. Instead Mesos and marathon supports fetching URIs 
to your container sandbox.
http://mesos.apache.org/documentation/latest/fetcher/

*   Pass into the container as a file based secret if it is 
sensitive.

http://mesos.apache.org/documentation/latest/secrets/#file-based-secrets

*   Environment Variable.

*   Static

*   Host_path volume - mounting a host path or file into your 
container.

http://mesos.apache.org/documentation/latest/container-volume/#host_path-volume-source

*   Build it in your container image if those configurations are 
not expected to be changed.

> Furthermore this page[1] says the sandbox is considered read only, yet 
the stdout and stderr are located there???
I think the document 
<http://mesos.apache.org/documentation/latest/sandbox/#using-the-sandbox>  
means that sandbox is not expected to be touched by any 3rd party software or 
people other than Mesos, executor and task/application.

-Gilbert

On Sun, Jul 21, 2019 at 3:22 AM Marc Roos  
wrote:




What would be the adviced way to add a configuration file to a 
container 
being used at startup. I am now fetching the files and then create 
an 
entrypoint.sh that copies this from the sandbox. 

Creating these custom entrypoints.sh is cumbersome. I thought about 

mounting the path's of the sandbox in the container but don't have 
good 
example to get this working[0]. Furthermore this page[1] says the 
sandbox is considered read only, yet the stdout and stderr are 
located 
there???

Is there a (security) advantage copying files from the sandbox at 
startup or just use a mount point?

[0]
https://www.mail-archive.com/user@mesos.apache.org/msg10445.html

[1]
http://mesos.apache.org/documentation/latest/sandbox/





RE: Is chained cni networks supported in mesos 1.7

2019-08-14 Thread Marc Roos
 
Hi Gilbert, 

Yes indeed. I have written already a netfilter chain plugin[0] I wanted 
to use. But also the default tuning pluging of cni, which requires 
chaining I would like to use.

-Marc

[0]
https://github.com/f1-outsourcing/plugins/tree/hostrouteif/plugins/meta/firewallnetns

-Original Message-
To: user
Subject: Re: Is chained cni networks supported in mesos 1.7

Are you interested in implementing the CNI chain support?

-Gilbert

On Wed, Jul 24, 2019 at 12:52 PM Marc Roos  
wrote:


 
Hmm, I guess I should not get my hopes up this will be there soon?
[0]
https://issues.apache.org/jira/browse/MESOS-7178



-Original Message-
From: Jie Yu [mailto:yujie@gmail.com] 
Sent: woensdag 24 juli 2019 21:35
To: user
Subject: Re: Is chained cni networks supported in mesos 1.7

No, not yet

On Wed, Jul 24, 2019 at 12:27 PM Marc Roos 
 
wrote:







This error message of course
E0724 21:19:17.852210  1160 cni.cpp:330] Failed to parse 
CNI 
network 
configuration file '/etc/mesos-cni/93-chain.conflist': 
Protobuf 
parse 
failed: Missing required fields: typ


-Original Message-
Subject: Is chained cni networks supported in mesos 1.7


I am getting this error, while I don not have problems 
using it 
with 
cnitool.

 cni.cpp:330] Failed to parse CNI network configuration 
file
'/etc/mesos-cni/93-chain-routing-overwrite.conflist.bak': 
Protobuf 
parse
failed: Missing required fields: type

[@ mesos-cni]# cat 93-chain.conflist
{
  "name": "test-chain",
  "plugins": [{
"type": "bridge",
"bridge": "test-chain0",
"isGateway": false,
"isDefaultGateway": false,
"ipMasq": false,
"ipam": {
"type": "host-local",
"subnet": "10.15.15.0/24"
}
},
{
  "type": "portmap",
  "capabilities": {"portMappings": true},
  "snat": false
}]
}


[@ mesos-cni]#  CNI_PATH="/usr/libexec/cni/"  
NETCONFPATH="/etc/mesos-cni" cnitool-0.5.2 add test-chain 
/var/run/netns/testing {
"ip4": {
"ip": "10.15.15.2/24",
"gateway": "10.15.15.1"
},
"dns": {}










Should mesos 1.8 (and marathon 1.8) drain/migrate tasks or not?

2019-08-08 Thread Marc Roos


I don’t get from this page 
http://mesos.apache.org/documentation/latest/maintenance/ if mesos 
should be 'moving' tasks to another node when it is marked as draining. 
I know DRAIN_AGENT is only for mesos 1.9. But what use it to post a 
maintenance schedule, see the node being marked as draining, and nothing 
happens with the tasks?


On the marathon page the say "draining is not yet implemented", yet they 
refer to an issue that has been resolved.
https://mesosphere.github.io/marathon/docs/maintenance-mode.html


On stackoverflow there is the same question, and again referencing issue 
that have been resolved.
https://stackoverflow.com/questions/37194123/marathon-tasks-not-migrating-off-mesos-node-goes-into-draining-mode
https://jira.mesosphere.com/browse/MARATHON-3216
https://phabricator.mesosphere.com/D1069



-Original Message-
From: Vinod Kone [mailto:vinodk...@apache.org] 
Sent: donderdag 8 augustus 2019 0:35
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' 
to be present

Please read the "maintenace primitives" section in this doc 
http://mesos.apache.org/documentation/latest/maintenance/ and let us 
know if you have unanswered questions.

On Wed, Aug 7, 2019 at 4:59 PM Marc Roos  
wrote:



 I seem to be able to add a maintenance schedule, and get also a 
report 
on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see 
tasks 
migrate to other hosts. Or is this not the purpose of maintenance 
mode 
in 1.8? Just to make sure no new tasks will be launched on hosts 
scheduled for maintenance?



-Original Message-
From: Chun-Hung Hsiao [mailto:chhs...@apache.org] 
Sent: woensdag 7 augustus 2019 22:59
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 
'type' 
to be present

Hi Marc.

Agent draining is a Mesos 1.9 feature and is only available on the 
current Mesos master branch.
Please see https://issues.apache.org/jira/browse/MESOS-9814.

Best,
Chun-Hung
    
On Wed, Aug 7, 2019 at 1:35 PM Marc Roos  

wrote:



Should this be working in mesos 1.8?

[@m01 ~]# curl --user test:x -X POST \
>   https://m01.local:5050/api/v1 \
>   --cacert /etc/pki/ca-trust/source/ca.crt \
>   -H 'Accept: application/json' \
>   -H 'content-type: application/json' -d '{
>   "type": "DRAIN_AGENT",
>   "drain_agent": {"agent_id": {
> "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1"
>   }}}'

Failed to validate master::Call: Expecting 'type' to be 
present









RE: Draining: Failed to validate master::Call: Expecting 'type' to be present

2019-08-08 Thread Marc Roos
 
Now I am getting the draining state, don’t know why I did not get this 
before.
{"draining_machines":[{"id":{"hostname":"m02.local"}}]}

But no tasks are migrating, nothing happens

After a while, brought the agent down
 {"down_machines":[{"hostname":"m02.local"}]}

Tasks are still there.

I assume this automatic draining is not related to the mesos 1.9 
DRAIN_AGENT? And tasks should migrate to other nodes?






-Original Message-
To: user
Subject: RE: Draining: Failed to validate master::Call: Expecting 'type' 
to be present

I have scheduled a maintenance (from date now), how can I verify if the 
agent is indeed in 'draining' mode?

 



-Original Message-
From: Vinod Kone [mailto:vinodk...@apache.org]
Sent: donderdag 8 augustus 2019 0:35
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' 

to be present

Please read the "maintenace primitives" section in this doc 
http://mesos.apache.org/documentation/latest/maintenance/ and let us 
know if you have unanswered questions.

On Wed, Aug 7, 2019 at 4:59 PM Marc Roos 
wrote:



 I seem to be able to add a maintenance schedule, and get also a 
report 
on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see 
tasks 
migrate to other hosts. Or is this not the purpose of maintenance 
mode 
in 1.8? Just to make sure no new tasks will be launched on hosts 
scheduled for maintenance?



-Original Message-
From: Chun-Hung Hsiao [mailto:chhs...@apache.org] 
Sent: woensdag 7 augustus 2019 22:59
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 
'type' 
to be present

Hi Marc.

Agent draining is a Mesos 1.9 feature and is only available on the 
current Mesos master branch.
Please see https://issues.apache.org/jira/browse/MESOS-9814.

Best,
Chun-Hung

On Wed, Aug 7, 2019 at 1:35 PM Marc Roos  


wrote:



Should this be working in mesos 1.8?

[@m01 ~]# curl --user test:x -X POST \
>   https://m01.local:5050/api/v1 \
>   --cacert /etc/pki/ca-trust/source/ca.crt \
>   -H 'Accept: application/json' \
>   -H 'content-type: application/json' -d '{
>   "type": "DRAIN_AGENT",
>   "drain_agent": {"agent_id": {
> "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1"
>   }}}'

Failed to validate master::Call: Expecting 'type' to be 
present











RE: Draining: Failed to validate master::Call: Expecting 'type' to be present

2019-08-08 Thread Marc Roos
I have scheduled a maintenance (from date now), how can I verify if the 
agent is indeed in 'draining' mode?

 



-Original Message-
From: Vinod Kone [mailto:vinodk...@apache.org] 
Sent: donderdag 8 augustus 2019 0:35
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 'type' 
to be present

Please read the "maintenace primitives" section in this doc 
http://mesos.apache.org/documentation/latest/maintenance/ and let us 
know if you have unanswered questions.

On Wed, Aug 7, 2019 at 4:59 PM Marc Roos  
wrote:



 I seem to be able to add a maintenance schedule, and get also a 
report 
on '{"down_machines":[{"hostname":"m02.local"}]}' but I do not see 
tasks 
migrate to other hosts. Or is this not the purpose of maintenance 
mode 
in 1.8? Just to make sure no new tasks will be launched on hosts 
scheduled for maintenance?



-Original Message-
From: Chun-Hung Hsiao [mailto:chhs...@apache.org] 
Sent: woensdag 7 augustus 2019 22:59
To: user
Subject: Re: Draining: Failed to validate master::Call: Expecting 
'type' 
to be present

Hi Marc.

Agent draining is a Mesos 1.9 feature and is only available on the 
current Mesos master branch.
Please see https://issues.apache.org/jira/browse/MESOS-9814.

Best,
Chun-Hung

    On Wed, Aug 7, 2019 at 1:35 PM Marc Roos  

wrote:



Should this be working in mesos 1.8?

[@m01 ~]# curl --user test:x -X POST \
>   https://m01.local:5050/api/v1 \
>   --cacert /etc/pki/ca-trust/source/ca.crt \
>   -H 'Accept: application/json' \
>   -H 'content-type: application/json' -d '{
>   "type": "DRAIN_AGENT",
>   "drain_agent": {"agent_id": {
> "value":"53336fcb-7756-4673-b9c7-177e04f34c3b-S1"
>   }}}'

Failed to validate master::Call: Expecting 'type' to be 
present









  1   2   3   >