[jira] [Created] (MESOS-9785) Frameworks recovered from reregistered agents are not reported to master `/api/v1` subscribers.

2019-05-14 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9785:
--

 Summary: Frameworks recovered from reregistered agents are not 
reported to master `/api/v1` subscribers.
 Key: MESOS-9785
 URL: https://issues.apache.org/jira/browse/MESOS-9785
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Reporter: Chun-Hung Hsiao


Currently when an operator subscribes to the {{/api/v1}} master endpoint, it 
would receive a {{SUBSCRIBED}} event carrying information about all known 
frameworks, including registered ones and unregistered ones. If an unregistered 
framework reregisters later, a {{FRAMEWORK_UPDATED}} event would be sent to the 
operator.

However, if an operator subscribes to the {{/api/v1}} master endpoint after a 
master failover but before any of the frameworks and agents reregisters, 
{{SUBSCRIBED}} would contain no recovered framework information. When a agent 
with running tasks reregisters later, unregistered frameworks of those tasks 
will be recovered, but no {{FRAMEWORK_ADDED}} will be sent to the operator, so 
the operator will receive {{TASK_ADDED}} for those tasks with unknown framework 
IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9677) RPM packages should be built with launcher sealing

2019-05-14 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9677:
--

  Resolution: Fixed
Assignee: Benno Evers
Target Version/s:   (was: 1.8.0)

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benno Evers
>Priority: Major
>  Labels: integration, mesosphere, packaging, rpm
> Fix For: 1.8.1
>
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9677) RPM packages should be built with launcher sealing

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839611#comment-16839611
 ] 

Benno Evers commented on MESOS-9677:


master:
{noformat}
commit 7ff4263c371f8a51551199c694837cf371923137
Author: Benjamin Bannier 
Date:   Mon Apr 8 14:37:32 2019 +0200

Enabled launcher sealing for RPM packages.

We enable this flag since with it disabled certain public functions
are not available making it hard to e.g., write modules against this
version of Mesos.

While launcher sealing depends on a recent kernel, the platform we
build RPMs for already satisfies the requirements.

Review: https://reviews.apache.org/r/70295
{noformat}

1.8.x:
{noformat}
commit 3bc6082afe75390dc3b0abd58d6ce85827709b89 (origin/1.8.x, 1.8.x)
Author: Benjamin Bannier 
Date:   Mon Apr 8 14:37:32 2019 +0200

Enabled launcher sealing for RPM packages.

We enable this flag since with it disabled certain public functions
are not available making it hard to e.g., write modules against this
version of Mesos.

While launcher sealing depends on a recent kernel, the platform we
build RPMs for already satisfies the requirements.

Review: https://reviews.apache.org/r/70295
{noformat}

> RPM packages should be built with launcher sealing
> --
>
> Key: MESOS-9677
> URL: https://issues.apache.org/jira/browse/MESOS-9677
> Project: Mesos
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: integration, mesosphere, packaging, rpm
>
> We should consider enabling launcher sealing in the Mesos RPM packages. Since 
> this feature is built conditionally, it is hard to write e.g., module code 
> against Mesos packages since required functions might be missing (e.g., 
> [https://github.com/dcos/dcos-mesos-modules/commit/8ce70e6cc789054831daa3058647e326b2b11bc9]
>  cannot be linked against the default RPM package anymore). The RPM's target 
> platform centos7 should include a recent enough kernel for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9329) CMake build on Fedora 28 fails due to libevent error

2019-05-14 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839483#comment-16839483
 ] 

Alexander Rukletsov commented on MESOS-9329:


Indeed, the autotools build uses a newer version of libevent, 
[2.0.22|https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/3rdparty/libevent-2.0.22-stable.tar.gz].
 We can't easily use it in the cmake build because newer versions do not 
support cmake, see MESOS-3529. Bottom line is: a cmake build on Linux with ssl 
and libevent enabled is currently not supported.

> CMake build on Fedora 28 fails due to libevent error
> 
>
> Key: MESOS-9329
> URL: https://issues.apache.org/jira/browse/MESOS-9329
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> Trying to build Mesos using cmake with the options 
> {noformat}
> cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_SSL=1 -DENABLE_LIBEVENT=1
> {noformat}
> fails due to the following:
> {noformat}
> [  1%] Building C object CMakeFiles/event_extra.dir/bufferevent_openssl.c.o
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  In function ‘bio_bufferevent_new’:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:112:3:
>  error: dereferencing pointer to incomplete type ‘BIO’ {aka ‘struct bio_st’}
>   b->init = 0;
>^~
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:
>  At top level:
> /home/bevers/mesos/worktrees/master/build-cmake/3rdparty/libevent-2.1.5-beta/src/libevent-2.1.5-beta/bufferevent_openssl.c:234:1:
>  error: variable ‘methods_bufferevent’ has initializer but incomplete type
>  static BIO_METHOD methods_bufferevent = {
> [...]
> {noformat}
> Since the autotools build does not have issues when enabling libevent and 
> ssl, it seems most likely that the `libevent-2.1.5-beta` version used by 
> default in the cmake build is somehow connected to the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839419#comment-16839419
 ] 

Joseph Wu commented on MESOS-9749:
--

The agent ends up in a bad state because the stdout/err pipe gets filled, and 
therefore starts to block threads.  This can lead to unpredictable results 
(since we aren't sure which threads are blocked by IO).

If the logs are not written directly to journald, then you won't need a restart 
of the agent.  It should remain functional during the time journald is down.

Of course, restarting the agent is still an option.

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Gregoire Seux (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839412#comment-16839412
 ] 

Gregoire Seux commented on MESOS-9749:
--

Thanks [~kaysoky] for your reply.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771122 is very instructive 
indeed. It was closed however by saying it was fixed upstream (in systemd-217?).

On our deployment we are trying to introduce a relationship between mesos and 
journald to force a restart of mesos-slave if something restart journald 
(https://github.com/criteo-forks/mesos_cookbook/pull/14).

The real issue though is not logging but the more general problem of the agent 
state after a journald restart (MESOS-9772)

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839408#comment-16839408
 ] 

Joseph Wu commented on MESOS-9749:
--

The default behavior of Mesos's logging is to write to stdout/stderr. When 
launching via systemd, this means you are writing to journald. And if journald 
is restarted, the pipe between the agent and journald would be broken. These 
sorts of broken pipes usually terminate the agent, but it seems to be different 
in systemd's case.
 See also: [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771122]

There are a variety of ways to get around this, basically involving writing 
logs to some other location:

---
 
h2. Built-in solutions

Mesos lets you write stdout/stderr to disk instead.  If you specify the 
{{--log_dir}} flag, Mesos will leverage glog's log writing behavior, which has 
some form of log rotation built in.  But unfortunately, this does not seem to 
bound the size of logs on disk, so you'd end up writing a script or such to 
clean up logs.

Besides that, you may modify your service file to write to something besides 
journald, such as syslog, or a file.
https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Logging%20and%20Standard%20Input/Output

h2. Other solutions

By the looks of your agent configuration, you are not averse to deploying 
modules ({{--modules='file:///etc/mesos-chef/slave-modules.json'}}).  In this 
case, you have some other options.

DC/OS uses a {{LogSink}} module (which is a Mesos Anonymous module implementing 
a glog module) to pipe logs to file, which are then rotated by another timer.
https://github.com/dcos/dcos-mesos-modules/tree/master/logsink

If the goal is to get logs into journald, across journald restarts, this is 
also possible with a {{LogSink}}.  This would entail using the journald C API, 
like {{sd_journal_send}}.  I believe this is capable of reconnecting after 
journald restarts.
https://www.freedesktop.org/software/systemd/man/sd_journal_print.html

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atla

[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839372#comment-16839372
 ] 

Benjamin Mahler commented on MESOS-9749:


cc [~kaysoky]

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9750) Agent V1 GET_STATE response may report a complete executor's tasks as non-terminal after a graceful agent shutdown

2019-05-14 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839373#comment-16839373
 ] 

Joseph Wu commented on MESOS-9750:
--

Found one more code path where the agent's {{GET_STATE}} will return extraneous 
"launched_tasks".

This happens when a Framework or Master {{TEARDOWN}} call is used and the 
executor does not send a terminal status update in time.  This one does not 
require an agent restart/shutdown.
Also, this code path will result in an executor's checkpointed state looking 
identical to the agent shutdown case.  If the agent is restarted, the code in 
the above patch will be run to put the agent back into a consistent state.

Fix and test here: https://reviews.apache.org/r/70641/

> Agent V1 GET_STATE response may report a complete executor's tasks as 
> non-terminal after a graceful agent shutdown
> --
>
> Key: MESOS-9750
> URL: https://issues.apache.org/jira/browse/MESOS-9750
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.6.0, 1.7.0, 1.8.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations
>
> When the following steps occur:
> 1) A graceful shutdown is initiated on the agent (i.e. SIGUSR1 or 
> /master/machine/down).
> 2) The executor is sent a kill, and the agent counts down on 
> {{executor_shutdown_grace_period}}.
> 3) The executor exits, before all terminal status updates reach the agent. 
> This is more likely if {{executor_shutdown_grace_period}} passes.
> This results in a completed executor, with non-terminal tasks (according to 
> status updates).
> When the agent starts back up, the completed executor will be recovered and 
> shows up correctly  as a completed executor in {{/state}}.  However, if you 
> fetch the V1 {{GET_STATE}} result, there will be an entry in 
> {{launched_tasks}} even though nothing is running.
> {code}
> get_tasks {
>   launched_tasks {
> name: "test-task"
> task_id {
>   value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
> }
> framework_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
> }
> executor_id {
>   value: "default"
> }
> agent_id {
>   value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
> }
> state: TASK_RUNNING
> resources { ... }
> resources { ... }
> resources { ... }
> resources { ... }
> statuses {
>   task_id {
> value: "dff5a155-47f1-4a71-9b92-30ca059ab456"
>   }
>   state: TASK_RUNNING
>   agent_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-S0"
>   }
>   timestamp: 1556674758.2175469
>   executor_id {
> value: "default"
>   }
>   source: SOURCE_EXECUTOR
>   uuid: "xPmn\234\236F&\235\\d\364\326\323\222\224"
>   container_status { ... }
> }
>   }
> }
> get_executors {
>   completed_executors {
> executor_info {
>   executor_id {
> value: "default"
>   }
>   command {
> value: ""
>   }
>   framework_id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
> }
>   }
> }
> get_frameworks {
>   completed_frameworks {
> framework_info {
>   user: "user"
>   name: "default"
>   id {
> value: "4b34a3aa-f651-44a9-9b72-58edeede94ef-"
>   }
>   checkpoint: true
>   hostname: "localhost"
>   principal: "test-principal"
>   capabilities {
> type: MULTI_ROLE
>   }
>   capabilities {
> type: RESERVATION_REFINEMENT
>   }
>   roles: "*"
> }
>   }
> }
> {code}
> This happens because we combine executors and completed executors when 
> constructing the response.  The terminal task(s) with non-terminal updates 
> appear under completed executors.
> https://github.com/apache/mesos/blob/89c3dd95a421e14044bc91ceb1998ff4ae3883b4/src/slave/http.cpp#L1734-L1756



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839310#comment-16839310
 ] 

Benno Evers commented on MESOS-9783:


master:
{noformat}
commit a9a2acabd03181865055b77cf81e7bb310b236d6
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

1.8.x:
{noformat}
commit 5ca16bfeae19c193f4e67390543d08897a0f4ab8
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

1.7.x:
{noformat}
commit cfc7e6e9905329460d182150a91317b3e0a75157
Author: Benno Evers 
Date:   Tue May 14 11:52:59 2019 +0200

Updated URL in CentOS 6 Dockerfile.

The link was pointing to a rpm package that was apparently
replaced on the upstream file server.

Review: https://reviews.apache.org/r/70639
{noformat}

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9784) Server side SSL Certificate Validation

2019-05-14 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9784:
-

 Summary: Server side SSL Certificate Validation
 Key: MESOS-9784
 URL: https://issues.apache.org/jira/browse/MESOS-9784
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone
Assignee: Benno Evers






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839282#comment-16839282
 ] 

Benno Evers commented on MESOS-9783:


https://reviews.apache.org/r/70639/diff/1#index_header

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-9783:
--

Assignee: Benno Evers

> Centos 6 RPM build is broken on Apache CI
> -
>
> Key: MESOS-9783
> URL: https://issues.apache.org/jira/browse/MESOS-9783
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Major
>  Labels: foundations
>
> The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
> since April 16, as it fails on the following step:
> {noformat}
> RUN  rpm -Uvh --replacepkgs \
>   
> http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
> {noformat}
> The URL returns a 404 response because the package was removed from the 
> upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9783) Centos 6 RPM build is broken on Apache CI

2019-05-14 Thread Benno Evers (JIRA)
Benno Evers created MESOS-9783:
--

 Summary: Centos 6 RPM build is broken on Apache CI
 Key: MESOS-9783
 URL: https://issues.apache.org/jira/browse/MESOS-9783
 Project: Mesos
  Issue Type: Improvement
Reporter: Benno Evers


The centos 6 rpm build on the Apache CI on `build.apache.org` has been broken 
since April 16, as it fails on the following step:

{noformat}
RUN  rpm -Uvh --replacepkgs \
  
http://yum.postgresql.org/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm
{noformat}

The URL returns a 404 response because the package was removed from the 
upstream fileserver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)