[jira] [Commented] (MESOS-9580) Master sends inconsistent `UpdateFrameworkMessage` to agents.

2019-02-15 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770020#comment-16770020
 ] 

Chun-Hung Hsiao commented on MESOS-9580:


Some more comments about this issue.

The current behavior, although making such an inconsistency between master and 
agents, can handle the following scenario:
1. A framework reregistered with a new user.
2. The master broadcasted the new framework info and all agents received it 
(inconsistency occured).
3. The master failed over.
4. No matter the framework or any agent registered first, the master would 
always get the same framework info.

> Master sends inconsistent `UpdateFrameworkMessage` to agents.
> -
>
> Key: MESOS-9580
> URL: https://issues.apache.org/jira/browse/MESOS-9580
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Chun-Hung Hsiao
>Priority: Major
>  Labels: foundations
>
> If a framework reregisters with a new user, the master would ignore the user 
> update because of MESOS-703:
>  
> [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/framework.cpp#L526-L529]
> However, it would send the framework info *coming from the framework* (i.e., 
> with the new user) provided by the framework to all agents:
>  
> [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757]
>  
> [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L3156-L3162]
> But, when an agent reregistered, the master would send the framework info 
> from its in-memory state:
>  
> [https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L7827-L7842]
> This would make the framework info inconsistent between the master and some 
> of its agents. Although it won't affect executor and task launch (as the 
> framework info would be injected into {{RunTask(Group)*Message}}), if there 
> is a master failover, a race between framework and agent reregistrations 
> would make the new master learn different framework info.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9581) Mesos package naming appears to be undeterministic.

2019-02-15 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-9581:
-

 Summary: Mesos package naming appears to be undeterministic.
 Key: MESOS-9581
 URL: https://issues.apache.org/jira/browse/MESOS-9581
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 1.7.1
Reporter: Till Toenshoff


Transcribed from slack; 
https://mesos.slack.com/archives/C7N086PK2/p1550158266006900

It appears there are a number of RPM packages called 
“mesos-1.7.1-2.0.1.el7.x86_64.rpm” in the wild.

I’ve caught specimens with build dates February 1st, 7th and 13th. While it’s 
somewhat troubling in itself, none of these packages is the one referred to in 
Yum repository metadata (repos.mesosphere.com), which is a package built today 
on the 14th, so I can’t install Mesos right now.

Could it be that your pipeline is creating a new package with the same verson 
and release in every nightly build?

Repository metadata
{noformat}
sqlite3 *primary.sqlite "select name, version, release, strftime('%d-%m-%Y 
%H:%M', datetime(time_build, 'unixepoch')) build_as_string, rpm_buildhost from 
packages where name = 'mesos' and version = '1.7.1';"
mesos|1.7.1|2.0.1|14-02-2019 12:30|ip-172-16-10-254.ec2.internal
Packages downloaded while investigating over the past few days 
Name : mesos
Version : 1.7.1
Release : 2.0.1
Architecture: x86_64
Install Date: (not installed)
Group : misc
Size : 298787793
License : Apache-2.0
Signature : RSA/SHA256, Fri 01 Feb 2019 11:38:47 PM UTC, Key ID df7d54cbe56151bf
Source RPM : mesos-1.7.1-2.0.1.src.rpm
Build Date : Fri 01 Feb 2019 11:15:17 PM UTC
Build Host : ip-172-16-10-11.ec2.internal
Relocations : / 
Packager : d...@mesos.apache.org
URL : https://mesos.apache.org/
Summary : Cluster resource manager with efficient resource isolation
Description :
[snip]

Name : mesos
Version : 1.7.1
Release : 2.0.1
Architecture: x86_64
Install Date: (not installed)
Group : misc
Size : 298791347
License : Apache-2.0
Signature : RSA/SHA256, Thu 07 Feb 2019 10:33:06 PM UTC, Key ID df7d54cbe56151bf
Source RPM : mesos-1.7.1-2.0.1.src.rpm
Build Date : Thu 07 Feb 2019 10:31:02 PM UTC
Build Host : ip-172-16-10-4.ec2.internal
Relocations : / 
Packager : d...@mesos.apache.org
URL : https://mesos.apache.org/
Summary : Cluster resource manager with efficient resource isolation
Description :
[snip]

Name : mesos
Version : 1.7.1
Release : 2.0.1
Architecture: x86_64
Install Date: (not installed)
Group : misc
Size : 298789309
License : Apache-2.0
Signature : RSA/SHA256, Wed Feb 13 04:35:02 2019, Key ID df7d54cbe56151bf
Source RPM : mesos-1.7.1-2.0.1.src.rpm
Build Date : Wed Feb 13 04:32:41 2019
Build Host : ip-172-16-10-83.ec2.internal
Relocations : / 
Packager : d...@mesos.apache.org
URL : https://mesos.apache.org/
Summary : Cluster resource manager with efficient resource isolation
Description :
 {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9580) Master sends inconsistent `UpdateFrameworkMessage` to agents.

2019-02-15 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9580:
--

 Summary: Master sends inconsistent `UpdateFrameworkMessage` to 
agents.
 Key: MESOS-9580
 URL: https://issues.apache.org/jira/browse/MESOS-9580
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Chun-Hung Hsiao


If a framework reregisters with a new user, the master would ignore the user 
update because of MESOS-703:
[https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/framework.cpp#L526-L529]

However, it would send the *original* framework info provided by the framework 
to all agents:
[https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757
https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L3156-L3162|https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L2748-L2757]

But, when an agent reregistered, the master would send the framework info from 
its in-memory state:
[https://github.com/apache/mesos/blob/f1dc50568dcc90cec7158205dca86a2398a42dcd/src/master/master.cpp#L7827-L7842]

This would make the framework info inconsistent between the master and some of 
its agents. Although it won't affect executor and task launch (as the framework 
info would be injected into {{RunTask(Group)*Message}}), if there is a master 
failover, a race between framework and agent reregistrations would make the new 
master learn different framework info.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9191) Docker command executor may stuck at infinite unkillable loop.

2019-02-15 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769892#comment-16769892
 ] 

Greg Mann commented on MESOS-9191:
--

Retargeted 1.6.3.

> Docker command executor may stuck at infinite unkillable loop.
> --
>
> Key: MESOS-9191
> URL: https://issues.apache.org/jira/browse/MESOS-9191
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer
>
> Due to the change from https://issues.apache.org/jira/browse/MESOS-8574, the 
> behavior of docker command executor to discard the future of docker stop was 
> changed. If there is a new killTask() invoked and there is an existing docker 
> stop in pending state, the old one would call discard and then execute the 
> new one. This is ok for most of cases.
> However, docker stop could take long (depends on grace period and whether the 
> application could handle SIGTERM). If the framework retry killTask more 
> frequently than grace period (depends on killpolicy API, env var, or agent 
> flags), then the executor may be stuck forever with unkillable tasks. Because 
> everytime before the docker stop finishes, the future of docker stop is 
> discarded by the new incoming killTask.
> We should consider re-use grace period before calling discard() to a pending 
> docker stop future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9579) ExecutorHttpApiTest.HeartbeatCalls is flaky.

2019-02-15 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-9579:
-

 Summary: ExecutorHttpApiTest.HeartbeatCalls is flaky.
 Key: MESOS-9579
 URL: https://issues.apache.org/jira/browse/MESOS-9579
 Project: Mesos
  Issue Type: Bug
  Components: executor
Affects Versions: 1.8.0
 Environment: Centos 6
Reporter: Till Toenshoff


I just saw this failing on our internal CI:
{noformat}
21:42:35 [ RUN ] ExecutorHttpApiTest.HeartbeatCalls
21:42:35 I0215 21:42:35.917752 17173 executor.cpp:206] Version: 1.8.0
21:42:35 W0215 21:42:35.917771 17173 process.cpp:2829] Attempted to spawn 
already running process version@172.16.10.166:35439
21:42:35 I0215 21:42:35.918581 17174 executor.cpp:432] Connected with the agent
21:42:35 F0215 21:42:35.918857 17174 owned.hpp:112] Check failed: 'get()' Must 
be non NULL 
21:42:35 *** Check failure stack trace: ***
21:42:35 @ 0x7fb93ce1d1dd google::LogMessage::Fail()
21:42:35 @ 0x7fb93ce1ee7d google::LogMessage::SendToLog()
21:42:35 @ 0x7fb93ce1cdb3 google::LogMessage::Flush()
21:42:35 @ 0x7fb93ce1f879 google::LogMessageFatal::~LogMessageFatal()
21:42:35 @ 0x55e80a099f76 google::CheckNotNull<>()
21:42:35 @ 0x55e80a07dde4 
_ZNSt17_Function_handlerIFvvEZN5mesos8internal5tests39ExecutorHttpApiTest_HeartbeatCalls_Test8TestBodyEvEUlvE_E9_M_invokeERKSt9_Any_data
21:42:35 @ 0x7fb93baea260 process::AsyncExecutorProcess::execute<>()
21:42:35 @ 0x7fb93baf62cb 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEESG_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSL_FSI_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISA_EESt14default_deleteISW_EEOSE_S3_E_JSZ_SE_St12_PlaceholderILi1EEclEOS3_
21:42:36 @ 0x7fb93cd646b1 process::ProcessBase::consume()
21:42:36 @ 0x7fb93cd794ba process::ProcessManager::resume()
21:42:36 @ 0x7fb93cd7d486 
_ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
21:42:36 @ 0x7fb93d02a1af execute_native_thread_routine
21:42:36 @ 0x7fb939794aa1 start_thread
21:42:36 @ 0x7fb938b39c4d clone
21:42:36 The test binary has crashed OR the timeout has been exceeded!
21:42:36 ~/workspace/mesos/Mesos_CI-build/FLAG/SSL/label/mesos-ec2-centos-6
21:42:36 mkswap: /tmp/swapfile: warning: don't erase bootbits sectors
21:42:36 on whole disk. Use -f to force.
21:42:36 Setting up swapspace version 1, size = 8388604 KiB
21:42:36 no label, UUID=dda5aa26-dba6-4ac8-bc6c-41264f510694
21:42:36 gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
21:42:36 Copyright (C) 2016 Free Software Foundation, Inc.
21:42:36 This is free software; see the source for copying conditions. There is 
NO
21:42:36 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR 
PURPOSE.
21:42:36 Docker version 1.7.1, build 786b29d
21:42:36 curl 7.61.1 (x86_64-redhat-linux-gnu) libcurl/7.61.1 OpenSSL/1.0.1e 
zlib/1.2.3 c-ares/1.14.0 libssh2/1.8.0 nghttp2/1.6.0
21:42:36 Release-Date: 2018-09-05{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769813#comment-16769813
 ] 

Vinod Kone commented on MESOS-8887:
---

Landed on master:

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


Backported to 1.7.x

commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


> Unreachable tasks are not GC'ed when unreachable agent is GC'ed.
> 
>
> Key: MESOS-8887
> URL: https://issues.apache.org/jira/browse/MESOS-8887
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.4.3, 1.5.2, 1.6.1, 1.7.1
>Reporter: Gilbert Song
>Assignee: Vinod Kone
>Priority: Major
>  Labels: foundations, mesosphere, partition, registry
>
> Unreachable agents will be gc-ed by the master registry after 
> `--registry_max_agent_age` duration or `--registry_max_agent_count`. When the 
> GC happens, the agent will be removed from the master's unreachable agent 
> list, but its corresponding tasks are still in UNREACHABLE state in the 
> framework struct (though removed from `slaves.unreachableTasks`). We should 
> instead remove those tasks from everywhere or transition those tasks to a 
> terminal state, either TASK_LOST or TASK_GONE (further discussion is needed 
> to define the semantic).
> This improvement relates to how do we want to couple the update of task with 
> the GC of agent. Right now they are somewhat decoupled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769822#comment-16769822
 ] 

Vinod Kone commented on MESOS-8892:
---

Observed this on 1.6.x branch

{code}
[ RUN  ] MasterSlaveReconciliationTest.ReconcileDroppedOperation
I0215 21:36:18.921594  4052 cluster.cpp:172] Creating default 'local' authorizer
I0215 21:36:18.922894  4057 master.cpp:465] Master 
21d3c979-83c3-4141-9a3a-635fd550d45a (ip-172-16-10-236.ec2.internal) started on 
172.16.10.236:36326
I0215 21:36:18.922915  4057 master.cpp:468] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator
="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwri
te="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/exYTvt/credentials" 
--filter_gpu_resources="true" --framework_s
orter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize=
"true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per
_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memo
ry" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --reg
istry_strict="false" --require_agent_domain="false" --role_sorter="drf" 
--root_submissions="true" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/exY
Tvt/master" --zk_session_timeout="10secs"
I0215 21:36:18.923121  4057 master.cpp:517] Master only allowing authenticated 
frameworks to register
I0215 21:36:18.923393  4057 master.cpp:523] Master only allowing authenticated 
agents to register
I0215 21:36:18.923408  4057 master.cpp:529] Master only allowing authenticated 
HTTP frameworks to register
I0215 21:36:18.923414  4057 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/exYTvt/credentials'
I0215 21:36:18.923651  4057 master.cpp:573] Using default 'crammd5' 
authenticator
I0215 21:36:18.923777  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0215 21:36:18.923904  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0215 21:36:18.924266  4057 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0215 21:36:18.924465  4057 master.cpp:654] Authorization enabled
I0215 21:36:18.924823  4056 hierarchical.cpp:179] Initialized hierarchical 
allocator process
I0215 21:36:18.927826  4058 whitelist_watcher.cpp:77] No whitelist given
I0215 21:36:18.928741  4054 master.cpp:2176] Elected as the leading master!
I0215 21:36:18.928759  4054 master.cpp:1711] Recovering from registrar
I0215 21:36:18.928800  4054 registrar.cpp:339] Recovering registrar
I0215 21:36:18.929002  4054 registrar.cpp:383] Successfully fetched the 
registry (0B) in 132096ns
I0215 21:36:18.929033  4054 registrar.cpp:487] Applied 1 operations in 7184ns; 
attempting to update the registry
I0215 21:36:18.929154  4058 registrar.cpp:544] Successfully updated the 
registry in 108032ns
I0215 21:36:18.929232  4058 registrar.cpp:416] Successfully recovered registrar
I0215 21:36:18.929361  4055 master.cpp:1825] Recovered 0 agents from the 
registry (176B); allowing 10mins for agents to reregister
I0215 21:36:18.929415  4055 hierarchical.cpp:217] Skipping recovery of 
hierarchical allocator: nothing to recover
W0215 21:36:18.931118  4052 process.cpp:2829] Attempted to spawn already 
running process files@172.16.10.236:36326
I0215 21:36:18.931596  4052 containerizer.cpp:300] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0215 21:36:18.934453  4052 linux_launcher.cpp:147] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0215 21:36:18.934859  4052 provisioner.cpp:299] Using default backend 'aufs'
I0215 21:36:18.935410  4052 cluster.cpp:460] Creating default 'local' authorizer
I0215 21:36:18.936164  4060 slave.cpp:259] Mesos agent started on 
(230)@172.16.10.236:36326
W0215 21:36:18.936399  4052 process.cpp:2829] Attempted to spawn already 
running process version@172.16.10.236:36326
I0215 21:36:18.936187  4060 slave.cpp:260] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/exYTvt/GHfic5/store/appc" --authenticate
_http_exe

[jira] [Commented] (MESOS-8892) MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769824#comment-16769824
 ] 

Vinod Kone commented on MESOS-8892:
---

[~bbannier] Can we backport this test fix to 1.6.x branch?

> MasterSlaveReconciliationTest.ReconcileDroppedOperation is flaky
> 
>
> Key: MESOS-8892
> URL: https://issues.apache.org/jira/browse/MESOS-8892
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.6.0
>Reporter: Greg Mann
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.7.0
>
> Attachments: 
> MasterSlaveReconciliationTest.ReconcileDroppedOperation.txt
>
>
> This was observed on a Debian 9 SSL/GRPC-enabled build. It appears that a 
> poorly-timed {{UpdateSlaveMessage}} leads to the operation reconciliation 
> occurring before the expectation for the {{ReconcileOperationsMessage}} is 
> registered:
> {code}
> I0508 00:11:09.700815 22498 master.cpp:4362] Processing ACCEPT call for 
> offers: [ f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-O0 ] on agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost) for framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) 
> at scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309
> I0508 00:11:09.700870 22498 master.cpp:3602] Authorizing principal 
> 'test-principal' to reserve resources 'cpus(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):2; 
> mem(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):1024; disk(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> ports(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):[31000-32000]'
> I0508 00:11:09.701228 22493 master.cpp:4725] Applying RESERVE operation for 
> resources 
> [{"allocation_info":{"role":"default-role"},"name":"cpus","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"mem","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"disk","reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"default-role"},"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"reservations":[{"principal":"test-principal","role":"default-role","type":"DYNAMIC"}],"type":"RANGES"}]
>  from framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (default) at 
> scheduler-b0f55e01-2f6f-42c8-8614-901036acfc31@127.0.0.1:36309 to agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost)
> I0508 00:11:09.701498 22493 master.cpp:11265] Sending operation '' (uuid: 
> 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) to agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost)
> I0508 00:11:09.701627 22494 slave.cpp:1564] Forwarding agent update 
> {"operations":{},"resource_version_uuid":{"value":"0HeA06ftS6m76SNoNZNPag=="},"slave_id":{"value":"f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0"},"update_oversubscribed_resources":true}
> I0508 00:11:09.701848 22494 master.cpp:7800] Received update of agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 at slave(212)@127.0.0.1:36309 
> (localhost) with total oversubscribed resources {}
> W0508 00:11:09.701905 22494 master.cpp:7974] Performing explicit 
> reconciliation with agent for known operation 
> 81dffb62-6e75-4c6c-a97b-41c92c58d6a7 since it was not present in original 
> reconciliation message from agent
> I0508 00:11:09.702085 22494 master.cpp:11015] Updating the state of operation 
> '' (uuid: 81dffb62-6e75-4c6c-a97b-41c92c58d6a7) for framework 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- (latest state: OPERATION_PENDING, 
> status update state: OPERATION_DROPPED)
> I0508 00:11:09.702239 22491 hierarchical.cpp:925] Updated allocation of 
> framework f850080d-9c7a-4ff7-8d4b-9e54aa0418cb- on agent 
> f850080d-9c7a-4ff7-8d4b-9e54aa0418cb-S0 from cpus(allocated: default-role):2; 
> mem(allocated: default-role):1024; disk(allocated: default-role):1024; 
> ports(allocated: default-role):[31000-32000] to disk(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> cpus(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,test-principal)]):2; mem(allocated: 
> default-role)(reservations: [(DYNAMIC,default-role,test-principal)]):1024; 
> ports(allocated: default-role)(reservations: 
> [(DYNAMIC,default-role,

[jira] [Comment Edited] (MESOS-8887) Unreachable tasks are not GC'ed when unreachable agent is GC'ed.

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769813#comment-16769813
 ] 

Vinod Kone edited comment on MESOS-8887 at 2/15/19 10:38 PM:
-

Landed on master:
---
commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907



---
Backported to 1.7.x
---
commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907



was (Author: vinodkone):
Landed on master:

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable agent is GCed from the registry. This patch fixes it
so that the latter is also cleaned up.

Review: https://reviews.apache.org/r/69908

commit f0cd3b7b62807fe377b1b47bf1bf364b18c4a373
Author: Vinod Kone 
Date:   Sat Feb 2 09:51:09 2019 -0600

Fixed variable names in `Master::_doRegistryGC()`.

Substituted `slave` with `slaveId` to be consistent with the code base.
No functional changes.

Review: https://reviews.apache.org/r/69907


Backported to 1.7.x

commit 6fcf70167076bbe6fb10ca04876939fe0e3379d9
Author: Vinod Kone 
Date:   Fri Feb 15 14:33:00 2019 -0600

Added MESOS-8887 to the 1.7.2 CHANGELOG.

commit 1a506a4536a4b79dba6634d8dc627eaf2a55caba
Author: Vinod Kone 
Date:   Tue Feb 5 16:55:19 2019 -0600

Tested unreachable task behavior on agent GC.

Updated `PartitionTest, RegistryGcByCount` test. This test fails
without the previous patch.

Review: https://reviews.apache.org/r/69909

commit c72a4f909054e5efa75d9e5d8dde71b0083402c1
Author: Vinod Kone 
Date:   Sat Feb 2 10:01:56 2019 -0600

Removed unreachable tasks from `Master::Framework` on agent GC.

Unreachable tasks are stored in `Slaves` and `Framework` structs of
the master, but they were only being removed from the former when
an unreachable

[jira] [Comment Edited] (MESOS-9143) MasterQuotaTest.RemoveSingleQuota is flaky.

2019-02-15 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769794#comment-16769794
 ] 

Meng Zhu edited comment on MESOS-9143 at 2/15/19 10:15 PM:
---

{noformat}
commit 4380e5ba999b31782ef2fb32f51a1f225d28f5c5
Date:   Wed Feb 13 16:06:26 2019 -0800

Fixed a flaky test `MasterQuotaTest.RemoveSingleQuota`.

The test is flaky due to a race between metrics update
and metrics query.

This patch adds clock settle to ensure quota update and
removal are fully processed (including metrics updates) before
continuing with the metrics query.

Review: https://reviews.apache.org/r/69981
{noformat}


was (Author: mzhu):
commit 4380e5ba999b31782ef2fb32f51a1f225d28f5c5
Date:   Wed Feb 13 16:06:26 2019 -0800

Fixed a flaky test `MasterQuotaTest.RemoveSingleQuota`.

The test is flaky due to a race between metrics update
and metrics query.

This patch adds clock settle to ensure quota update and
removal are fully processed (including metrics updates) before
continuing with the metrics query.

Review: https://reviews.apache.org/r/69981

> MasterQuotaTest.RemoveSingleQuota is flaky.
> ---
>
> Key: MESOS-9143
> URL: https://issues.apache.org/jira/browse/MESOS-9143
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Meng Zhu
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere, resource-management
> Fix For: 1.8.0
>
> Attachments: RemoveSingleQuota-badrun.txt
>
>
> {noformat}
> ../../src/tests/master_quota_tests.cpp:493
> Value of: metrics.at(metricKey).isNone()
>   Actual: false
> Expected: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-15 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769771#comment-16769771
 ] 

Benjamin Mahler commented on MESOS-9490:


[~bennoe] Can you include the stack trace of the CHECK failure?

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> with a decode error in this case.
> For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8750) Check failed: !slaves.registered.contains(task->slave_id)

2019-02-15 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769727#comment-16769727
 ] 

Vinod Kone commented on MESOS-8750:
---

[~megha.sharma] [~xujyan] Why was this not backported to older versions?

> Check failed: !slaves.registered.contains(task->slave_id)
> -
>
> Key: MESOS-8750
> URL: https://issues.apache.org/jira/browse/MESOS-8750
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>Priority: Critical
> Fix For: 1.6.0
>
>
> It appears that in certain circumstances an unreachable task doesn't get 
> cleaned up from the framework.unreachableTasks when the respective agent 
> re-registers leading to this check failure later when the framework is being 
> removed. When an agent goes unreachable master adds the tasks from this agent 
> to {{framework.unreachableTasks}} and when such an agent re-registers the 
> master removes the tasks that it specifies during re-registeration from this 
> datastructure but there could be tasks that the agent doesn't know about e.g. 
> if the runTask message for them got dropped and so such tasks will not get 
> removed from unreachableTasks.
> {noformat}
> F0310 13:30:58.856665 62740 master.cpp:9671] Check failed: 
> !slaves.registered.contains(task->slave_id()) Unreachable task  of 
> framework 4f57975b-05dd-4118-8674-5b29a86c6a6c-0850 was found on registered 
> agent 683c4a92-b5a0-490c-998a-6113fc86d37a-S1428
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9578) Document per framework minimal allocatable resources in framework development guides

2019-02-15 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769467#comment-16769467
 ] 

Benjamin Mahler commented on MESOS-9578:


Also nice would be to document this in the multi-scheduler scalability 
guidelines.

> Document per framework minimal allocatable resources in framework development 
> guides
> 
>
> Key: MESOS-9578
> URL: https://issues.apache.org/jira/browse/MESOS-9578
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Benjamin Bannier
>Priority: Blocker
>
> With MESOS-9523 we introduced fields into {{FrameworkInfo}} to give 
> frameworks a way to express their resource requirements. We should document 
> this feature in the framework development guide(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9578) Document per framework minimal allocatable resources in framework development guides

2019-02-15 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9578:
---

 Summary: Document per framework minimal allocatable resources in 
framework development guides
 Key: MESOS-9578
 URL: https://issues.apache.org/jira/browse/MESOS-9578
 Project: Mesos
  Issue Type: Task
  Components: documentation
Reporter: Benjamin Bannier


With MESOS-9523 we introduced fields into {{FrameworkInfo}} to give frameworks 
a way to express their resource requirements. We should document this feature 
in the framework development guide(s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-15 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234
 ] 

Benno Evers edited comment on MESOS-9490 at 2/15/19 11:57 AM:
--

[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

If I remember correctly, running this test leads to a segfault due to some 
internal CHECK failure.


was (Author: bennoe):
[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> with a decode error in this case.
> For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9490) Support accepting gzipped responses in libprocess

2019-02-15 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769234#comment-16769234
 ] 

Benno Evers commented on MESOS-9490:


[~bmahler], the full code which originally hit this issue is pasted in the 
linked issue, a more minimal version looks like this:
{noformat}
TEST_F(MasterLoadTest, DISABLED_AcceptEncoding) {
 Try> master = StartMaster();

 Headers authHeaders = createBasicAuthHeaders(DEFAULT_CREDENTIAL);
 Headers acceptGzipHeaders = {{"Accept-Encoding", "gzip"}};

 auto response = process::http::get(
 master.get()->pid,
 "/state",
 None(),
 authHeaders + acceptGzipHeaders);

 AWAIT_READY(response);
}
{noformat}

> Support accepting gzipped responses in libprocess
> -
>
> Key: MESOS-9490
> URL: https://issues.apache.org/jira/browse/MESOS-9490
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently all libprocess endpoints support the serving of gzipped responses 
> when the client is requesting this with an `Accept-Encoding: gzip` header.
> However, libprocess does not support receiving gzipped responses, failing 
> with a decode error in this case.
> For symmetry, we should try to support compression in this case as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9575) Mesos Web UI can't display relative timestamps in the future

2019-02-15 Thread Armand Grillet (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769093#comment-16769093
 ] 

Armand Grillet commented on MESOS-9575:
---

relative-date has not been updated since 2011: 
https://github.com/azer/relative-date
We could use https://github.com/moment/moment/ which can [parse UNIX 
timestamps|https://momentjs.com/docs/#/parsing/unix-timestamp/] and return 
relative dates in the past and the future:
{noformat}
moment("20111031", "MMDD").fromNow(); // 7 years ago
moment().startOf('day').fromNow();// 10 hours ago
moment().endOf('day').fromNow();  // in 14 hours
{noformat}

> Mesos Web UI can't display relative timestamps in the future
> 
>
> Key: MESOS-9575
> URL: https://issues.apache.org/jira/browse/MESOS-9575
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>
> The `relativeDate()` function used by the Mesos WebUI 
> (https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=src/webui/assets/libs/relative-date.js;hb=HEAD)
>  is only able to handle dates in the past. All dates in the future are 
> rendered as "just now".
> This can be especially confusing when posting maintenance windows, where 
> usually both dates are in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)