[jira] [Created] (MESOS-9450) MasterAuthorizationTest.SlaveRemovedDropped is flaky.

2018-12-03 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-9450:
-

 Summary: MasterAuthorizationTest.SlaveRemovedDropped is flaky.
 Key: MESOS-9450
 URL: https://issues.apache.org/jira/browse/MESOS-9450
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.8.0
 Environment: Debian 9, autotools, libevent + SSL
Reporter: Till Toenshoff


{noformat}
23:50:59  [ RUN  ] MasterAuthorizationTest.SlaveRemovedDropped
23:50:59  I1203 23:50:59.123471  1137 master.cpp:414] Master 
1f14ff95-e61f-4410-a724-dfec18eb52b0 (localhost) started on 127.0.0.1:33161
23:50:59  I1203 23:50:59.123558  1137 master.cpp:417] Flags at startup: 
--acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/0p45nb/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/0p45nb/master" --zk_session_timeout="10secs"
23:50:59  W1203 23:50:59.123672  1137 master.cpp:420] 
23:50:59  **
23:50:59  Master bound to loopback interface! Cannot communicate with remote 
schedulers or agents. You might want to set '--ip' flag to a routable IP 
address.
23:50:59  **
23:50:59  I1203 23:50:59.123688  1137 master.cpp:466] Master only allowing 
authenticated frameworks to register
23:50:59  I1203 23:50:59.123695  1137 master.cpp:472] Master only allowing 
authenticated agents to register
23:50:59  I1203 23:50:59.123702  1137 master.cpp:478] Master only allowing 
authenticated HTTP frameworks to register
23:50:59  I1203 23:50:59.123708  1137 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/0p45nb/credentials'
23:50:59  I1203 23:50:59.123761  1137 master.cpp:522] Using default 'crammd5' 
authenticator
23:50:59  I1203 23:50:59.123819  1137 http.cpp:1017] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
23:50:59  I1203 23:50:59.123875  1137 http.cpp:1017] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
23:50:59  I1203 23:50:59.123903  1137 http.cpp:1017] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
23:50:59  I1203 23:50:59.123939  1137 master.cpp:603] Authorization enabled
23:50:59  I1203 23:50:59.124068  1133 hierarchical.cpp:175] Initialized 
hierarchical allocator process
23:50:59  I1203 23:50:59.124094  1138 whitelist_watcher.cpp:77] No whitelist 
given
23:50:59  I1203 23:50:59.124608  1137 master.cpp:2089] Elected as the leading 
master!
23:50:59  I1203 23:50:59.124625  1137 master.cpp:1644] Recovering from registrar
23:50:59  I1203 23:50:59.124652  1136 registrar.cpp:339] Recovering registrar
23:50:59  I1203 23:50:59.124763  1136 registrar.cpp:383] Successfully fetched 
the registry (0B) in 97024ns
23:50:59  I1203 23:50:59.124807  1136 registrar.cpp:487] Applied 1 operations 
in 6279ns; attempting to update the registry
23:50:59  I1203 23:50:59.124967  1136 registrar.cpp:544] Successfully updated 
the registry in 143104ns
23:50:59  I1203 23:50:59.125001  1136 registrar.cpp:416] Successfully recovered 
registrar
23:50:59  I1203 23:50:59.125172  1137 master.cpp:1758] Recovered 0 agents from 
the registry (125B); allowing 10mins for agents to reregister
23:50:59  I1203 23:50:59.125355  1138 hierarchical.cpp:215] Skipping recovery 
of hierarchical allocator: nothing to recover
23:50:59  W1203 23:50:59.126682  1117 process.cpp:2829] Attempted to spawn 
already running process files@127.0.0.1:33161
23:50:59  I1203 23:50:59.126904  1117 cluster.cpp:485] Creating default 'local' 
authorizer
23:50:59  I1203 23:50:59.127399  1131 slave.cpp:268] Mesos 

[jira] [Commented] (MESOS-4646) PortMappingIsolatorTests get kernel stuck.

2018-12-03 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708001#comment-16708001
 ] 

Till Toenshoff commented on MESOS-4646:
---

We should try this on a more recent Kernel -- [~ipronin] suggested using a 4.9 
or 4.14 -- will give that a spin as soon as possible.

> PortMappingIsolatorTests get kernel stuck.
> --
>
> Key: MESOS-4646
> URL: https://issues.apache.org/jira/browse/MESOS-4646
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Kernel 3.19.9-49-generic,
> libnl-3.2.27
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: flaky, flaky-test
>
> {noformat}
> $ sudo ./bin/mesos-tests.sh --gtest_filter="*PortMappingIsolatorTest*"
> Source directory: /home/till/scratchpad/mesos
> Build directory: /home/till/scratchpad/mesos/build
> -
> We cannot run any cgroups tests that require mounting
> hierarchies because you have the following hierarchies mounted:
> /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
> /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
> /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, 
> /sys/fs/cgroup/net_prio, /sys/fs/cgroup/perf_event, /sys/fs/cgroup/systemd
> We'll disable the CgroupsNoHierarchyTest test fixture for now.
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> No 'perf' command found so no 'perf' tests will be run
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> The 'perf' command wasn't found so tests using it
> to sample the 'cycles' hardware event will not be run.
> -
> /bin/nc
> /usr/local/bin/curl
> Note: Google Test filter = 
> 

[jira] [Commented] (MESOS-4646) PortMappingIsolatorTests get kernel stuck.

2018-12-03 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707993#comment-16707993
 ] 

Till Toenshoff commented on MESOS-4646:
---

[~ipronin] do you have any cycles for looking into this?

> PortMappingIsolatorTests get kernel stuck.
> --
>
> Key: MESOS-4646
> URL: https://issues.apache.org/jira/browse/MESOS-4646
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Kernel 3.19.9-49-generic,
> libnl-3.2.27
>Reporter: Till Toenshoff
>Priority: Major
>  Labels: flaky, flaky-test
>
> {noformat}
> $ sudo ./bin/mesos-tests.sh --gtest_filter="*PortMappingIsolatorTest*"
> Source directory: /home/till/scratchpad/mesos
> Build directory: /home/till/scratchpad/mesos/build
> -
> We cannot run any cgroups tests that require mounting
> hierarchies because you have the following hierarchies mounted:
> /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
> /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
> /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, 
> /sys/fs/cgroup/net_prio, /sys/fs/cgroup/perf_event, /sys/fs/cgroup/systemd
> We'll disable the CgroupsNoHierarchyTest test fixture for now.
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> No 'perf' command found so no 'perf' tests will be run
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> The 'perf' command wasn't found so tests using it
> to sample the 'cycles' hardware event will not be run.
> -
> /bin/nc
> /usr/local/bin/curl
> Note: Google Test filter = 
> 

[jira] [Assigned] (MESOS-8045) Update Mesos executables output if there is a typo

2018-12-03 Thread Benno Evers (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-8045:
--

Resolution: Fixed
  Assignee: Benno Evers

This is resolved by MESOS-8728, now we only print the full help string when the 
"--help" option is specified.

> Update Mesos executables output if there is a typo
> --
>
> Key: MESOS-8045
> URL: https://issues.apache.org/jira/browse/MESOS-8045
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Armand Grillet
>Assignee: Benno Evers
>Priority: Minor
>
> Current output if a user makes a typo while using one of the Mesos 
> executables:
> {code}
> build (master) $ ./bin/mesos-master.sh --ip=127.0.0.1 --workdir=/tmp
> Failed to load unknown flag 'workdir'
> Usage: mesos-master [options]
>   --acls=VALUE
>The value could be a JSON-formatted string of ACLs
>   
>or a file path containing the JSON-formatted ACLs used
>   
>for authorization. Path could be of the form `file:///path/to/file`
>   
>or `/path/to/file`.
>   
>Note that if the flag `--authorizers` is provided with a value
>   
>different than `local`, the ACLs contents
>   
>will be ignored.
>   
>See the ACLs protobuf in acls.proto for the expected format.
>   
>Example:
>   
>{
>   
>  "register_frameworks": [
>   
>{
>   
>  "principals": { "type": "ANY" },
>   
>  "roles": { "values": ["a"] }
>   
>}
>   
>  ],
>   
>  "run_tasks": [
>   
>{
>   
>  "principals": { "values": ["a", "b"] },
>   
>  "users": { "values": ["c"] }
>   
>}
>   
>  ],
>   
>  "teardown_frameworks": [
>   
>{
>   
>  "principals": { "values": ["a", "b"] },
>   
>  "framework_principals": { "values": ["c"] }
>   
>}
>   
>  ],
>   
>  "set_quotas": [
>   
>{
>   
>  "principals": { "values": ["a"] },
>   
>  "roles": { "values": ["a", "b"] }
>   
>}
>   
>  ],
>   
>  "remove_quotas": [
>   
>{
>

[jira] [Commented] (MESOS-3938) Consider allowing setting quotas for the default '*' role.

2018-12-03 Thread Meng Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707949#comment-16707949
 ] 

Meng Zhu commented on MESOS-3938:
-

Closing this unless there are new use cases.

> Consider allowing setting quotas for the default '*' role.
> --
>
> Key: MESOS-3938
> URL: https://issues.apache.org/jira/browse/MESOS-3938
> Project: Mesos
>  Issue Type: Task
>Reporter: Alexander Rukletsov
>Priority: Major
>
> Investigate use cases and implications of the possibility to set quota for 
> the '*' role. For example, having quota for '*' set can effectively reduce 
> the scope of the quota capacity heuristic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765
 ] 

Vinod Kone edited comment on MESOS-7971 at 12/3/18 8:50 PM:


Saw this again.

{noformat}
06:14:51 [ RUN  ] 
PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
06:14:51 I1203 06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' 
authorizer
06:14:51 I1203 06:14:50.633529 19796 master.cpp:413] Master 
f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 172.17.0.3:44340
06:14:51 I1203 06:14:50.633581 19796 master.cpp:416] Flags at startup: 
--acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/4vMyjy/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs"
06:14:51 I1203 06:14:50.634217 19796 master.cpp:465] Master only allowing 
authenticated frameworks to register
06:14:51 I1203 06:14:50.634236 19796 master.cpp:471] Master only allowing 
authenticated agents to register
06:14:51 I1203 06:14:50.634253 19796 master.cpp:477] Master only allowing 
authenticated HTTP frameworks to register
06:14:51 I1203 06:14:50.634270 19796 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/4vMyjy/credentials'
06:14:51 I1203 06:14:50.634608 19796 master.cpp:521] Using default 'crammd5' 
authenticator
06:14:51 I1203 06:14:50.634840 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
06:14:51 I1203 06:14:50.635052 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
06:14:51 I1203 06:14:50.635200 19796 http.cpp:1042] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
06:14:51 I1203 06:14:50.635373 19796 master.cpp:602] Authorization enabled
06:14:51 W1203 06:14:50.635457 19796 master.cpp:665] The '--roles' flag is 
deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade 
notes for more information
06:14:51 I1203 06:14:50.635991 19800 whitelist_watcher.cpp:77] No whitelist 
given
06:14:51 I1203 06:14:50.636032 19793 hierarchical.cpp:175] Initialized 
hierarchical allocator process
06:14:51 I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the leading 
master!
06:14:51 I1203 06:14:50.638975 19796 master.cpp:1660] Recovering from registrar
06:14:51 I1203 06:14:50.639200 19792 registrar.cpp:339] Recovering registrar
06:14:51 I1203 06:14:50.639927 19792 registrar.cpp:383] Successfully fetched 
the registry (0B) in 672768ns
06:14:51 I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 
48006ns; attempting to update the registry
06:14:51 I1203 06:14:50.640718 19792 registrar.cpp:544] Successfully updated 
the registry in 582912ns
06:14:51 I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered 
registrar
06:14:51 I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 agents from 
the registry (135B); allowing 10mins for agents to reregister
06:14:51 I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping recovery of 
hierarchical allocator: nothing to recover
06:14:51 W1203 06:14:50.647153 19784 process.cpp:2829] Attempted to spawn 
already running process files@172.17.0.3:44340
06:14:51 I1203 06:14:50.648453 19784 containerizer.cpp:305] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
06:14:51 W1203 06:14:50.649060 19784 backend.cpp:76] Failed to create 'aufs' 
backend: AufsBackend requires root privileges
06:14:51 W1203 06:14:50.649088 19784 backend.cpp:76] Failed 

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707765#comment-16707765
 ] 

Vinod Kone commented on MESOS-7971:
---

Saw this again.

{code}
*06:14:51* [ RUN  ] 
PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove*06:14:51* I1203 
06:14:50.630549 19784 cluster.cpp:173] Creating default 'local' 
authorizer*06:14:51* I1203 06:14:50.633529 19796 master.cpp:413] Master 
f1ffe054-ad44-45d4-9f39-84b048e1a359 (c16130e94783) started on 
172.17.0.3:44340*06:14:51* I1203 06:14:50.633581 19796 master.cpp:416] Flags at 
startup: --acls="" --agent_ping_timeout="15secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1000secs" 
--allocator="hierarchical" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4vMyjy/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.8.0/_inst/share/mesos/webui" 
--work_dir="/tmp/4vMyjy/master" --zk_session_timeout="10secs"*06:14:51* I1203 
06:14:50.634217 19796 master.cpp:465] Master only allowing authenticated 
frameworks to register*06:14:51* I1203 06:14:50.634236 19796 master.cpp:471] 
Master only allowing authenticated agents to register*06:14:51* I1203 
06:14:50.634253 19796 master.cpp:477] Master only allowing authenticated HTTP 
frameworks to register*06:14:51* I1203 06:14:50.634270 19796 
credentials.hpp:37] Loading credentials for authentication from 
'/tmp/4vMyjy/credentials'*06:14:51* I1203 06:14:50.634608 19796 master.cpp:521] 
Using default 'crammd5' authenticator*06:14:51* I1203 06:14:50.634840 19796 
http.cpp:1042] Creating default 'basic' HTTP authenticator for realm 
'mesos-master-readonly'*06:14:51* I1203 06:14:50.635052 19796 http.cpp:1042] 
Creating default 'basic' HTTP authenticator for realm 
'mesos-master-readwrite'*06:14:51* I1203 06:14:50.635200 19796 http.cpp:1042] 
Creating default 'basic' HTTP authenticator for realm 
'mesos-master-scheduler'*06:14:51* I1203 06:14:50.635373 19796 master.cpp:602] 
Authorization enabled*06:14:51* W1203 06:14:50.635457 19796 master.cpp:665] The 
'--roles' flag is deprecated. This flag will be removed in the future. See the 
Mesos 0.27 upgrade notes for more information*06:14:51* I1203 06:14:50.635991 
19800 whitelist_watcher.cpp:77] No whitelist given*06:14:51* I1203 
06:14:50.636032 19793 hierarchical.cpp:175] Initialized hierarchical allocator 
process*06:14:51* I1203 06:14:50.638939 19796 master.cpp:2105] Elected as the 
leading master!*06:14:51* I1203 06:14:50.638975 19796 master.cpp:1660] 
Recovering from registrar*06:14:51* I1203 06:14:50.639200 19792 
registrar.cpp:339] Recovering registrar*06:14:51* I1203 06:14:50.639927 19792 
registrar.cpp:383] Successfully fetched the registry (0B) in 672768ns*06:14:51* 
I1203 06:14:50.640069 19792 registrar.cpp:487] Applied 1 operations in 48006ns; 
attempting to update the registry*06:14:51* I1203 06:14:50.640718 19792 
registrar.cpp:544] Successfully updated the registry in 582912ns*06:14:51* 
I1203 06:14:50.640852 19792 registrar.cpp:416] Successfully recovered 
registrar*06:14:51* I1203 06:14:50.641299 19800 master.cpp:1774] Recovered 0 
agents from the registry (135B); allowing 10mins for agents to 
reregister*06:14:51* I1203 06:14:50.641340 19799 hierarchical.cpp:215] Skipping 
recovery of hierarchical allocator: nothing to recover*06:14:51* W1203 
06:14:50.647153 19784 process.cpp:2829] Attempted to spawn already running 
process files@172.17.0.3:44340*06:14:51* I1203 06:14:50.648453 19784 
containerizer.cpp:305] Using isolation \{ environment_secret, posix/cpu, 
posix/mem, filesystem/posix, network/cni }*06:14:51* W1203 06:14:50.649060 
19784 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires 
root privileges*06:14:51* W1203 06:14:50.649088 19784 backend.cpp:76] Failed to 

[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery flaky

2018-12-03 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707749#comment-16707749
 ] 

Vinod Kone commented on MESOS-8983:
---

This is happening on ASF CI.

{code}
*15:49:24* 3: [ RUN  ] 
SlaveRecoveryTest/0.PingTimeoutDuringRecovery*15:49:24* 3: I1203 
15:49:24.425719 24686 cluster.cpp:173] Creating default 'local' 
authorizer*15:49:24* 3: I1203 15:49:24.430784 24687 master.cpp:413] Master 
620b2018-c90f-4b11-bbe3-8fa1c90f204d (5a45e7f918b2) started on 
172.17.0.3:42912*15:49:24* 3: I1203 15:49:24.430824 24687 master.cpp:416] Flags 
at startup: --acls="" --agent_ping_timeout="1secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1secs" 
--allocator="hierarchical" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/PNxXC7/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="2" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/PNxXC7/master" --zk_session_timeout="10secs"*15:49:24* 3: 
I1203 15:49:24.431120 24687 master.cpp:465] Master only allowing authenticated 
frameworks to register*15:49:24* 3: I1203 15:49:24.431131 24687 master.cpp:471] 
Master only allowing authenticated agents to register*15:49:24* 3: I1203 
15:49:24.431139 24687 master.cpp:477] Master only allowing authenticated HTTP 
frameworks to register*15:49:24* 3: I1203 15:49:24.431149 24687 
credentials.hpp:37] Loading credentials for authentication from 
'/tmp/PNxXC7/credentials'*15:49:24* 3: I1203 15:49:24.431355 24687 
master.cpp:521] Using default 'crammd5' authenticator*15:49:24* 3: I1203 
15:49:24.431514 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'*15:49:24* 3: I1203 
15:49:24.431659 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'*15:49:24* 3: I1203 
15:49:24.431778 24687 http.cpp:1042] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'*15:49:24* 3: I1203 
15:49:24.431896 24687 master.cpp:602] Authorization enabled*15:49:24* 3: I1203 
15:49:24.432276 24688 hierarchical.cpp:175] Initialized hierarchical allocator 
process*15:49:24* 3: I1203 15:49:24.432498 24688 whitelist_watcher.cpp:77] No 
whitelist given*15:49:24* 3: I1203 15:49:24.444337 24690 master.cpp:2105] 
Elected as the leading master!*15:49:24* 3: I1203 15:49:24.444366 24690 
master.cpp:1660] Recovering from registrar*15:49:24* 3: I1203 15:49:24.445142 
24687 registrar.cpp:339] Recovering registrar*15:49:24* 3: I1203 
15:49:24.445669 24687 registrar.cpp:383] Successfully fetched the registry (0B) 
in 472064ns*15:49:24* 3: I1203 15:49:24.445785 24687 registrar.cpp:487] Applied 
1 operations in 40517ns; attempting to update the registry*15:49:24* 3: I1203 
15:49:24.446497 24687 registrar.cpp:544] Successfully updated the registry in 
660992ns*15:49:24* 3: I1203 15:49:24.453212 24687 registrar.cpp:416] 
Successfully recovered registrar*15:49:24* 3: I1203 15:49:24.453722 24692 
master.cpp:1774] Recovered 0 agents from the registry (135B); allowing 10mins 
for agents to reregister*15:49:24* 3: I1203 15:49:24.453984 24692 
hierarchical.cpp:215] Skipping recovery of hierarchical allocator: nothing to 
recover*15:49:24* 3: I1203 15:49:24.468710 24686 containerizer.cpp:305] Using 
isolation \{ environment_secret, posix/cpu, posix/mem, filesystem/posix, 
network/cni }*15:49:24* 3: W1203 15:49:24.481513 24686 backend.cpp:76] Failed 
to create 'aufs' backend: AufsBackend requires root privileges*15:49:24* 3: 
W1203 15:49:24.481549 24686 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges*15:49:24* 3: I1203 15:49:24.481591 24686 
provisioner.cpp:298] Using default backend 'copy'*15:49:24* 3: W1203 
15:49:24.498661 24686 process.cpp:2829] Attempted to spawn already running 
process 

[jira] [Commented] (MESOS-9448) Semantics of RECONCILE_OPERATIONS framework API call are incorrect

2018-12-03 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707713#comment-16707713
 ] 

Gastón Kleiman commented on MESOS-9448:
---

These are the intended semantics for {{RECONCILE_OPERATIONS}}, we decided that 
we wanted to follow a Request/Response pattern instead of an event based 
pattern like {{RECONCILE}}.

{{send()}} is a {{void}} method, so we had to add the {{call()}} method in 
order to use this API call. We should update the description of {{send()}} in 
{{scheduler.hpp}} and {{scheduler.cpp}} to make it clear that it can't be used 
to send {{RECONCILE_OPERATIONS}} requests.

> Semantics of RECONCILE_OPERATIONS framework API call are incorrect
> --
>
> Key: MESOS-9448
> URL: https://issues.apache.org/jira/browse/MESOS-9448
> Project: Mesos
>  Issue Type: Bug
>  Components: framework, HTTP API, master
>Reporter: Benjamin Bannier
>Priority: Major
>
> The typical pattern in the framework HTTP API is that frameworks send calls 
> to which the master responds with {{Accepted}} responses and which trigger 
> events. The only designed exception to this are {{SUBSCRIBE}} calls to which 
> the master responds with an {{Ok}} response containing the assigned framework 
> ID. This is even codified in {{src/scheduler.cpp:646ff}},
> {code}
> if (response->code == process::http::Status::OK) {
>   // Only SUBSCRIBE call should get a "200 OK" response.
>   CHECK_EQ(Call::SUBSCRIBE, call.type());
> {code}
> Currently, the handling of {{RECONCILE_OPERATIONS}} calls does not follow 
> this pattern. Instead of sending events, the master immediately responds with 
> a {{Ok}} and a list of operations. This e.g., leads to assertion failures in 
> above hard check whenever one uses the {{Scheduler::send}} instead of 
> {{Scheduler::call}}. One can reproduce this by modifying the existing tests 
> in {{src/operation_reconciliation_tests.cpp}},
> {code}
> mesos.send({createCallReconcileOperations(frameworkId, {operation})}); // ADD 
> THIS.
> const Future result =
>   mesos.call({createCallReconcileOperations(frameworkId, {operation})});
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming

2018-12-03 Thread Evelyn Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707638#comment-16707638
 ] 

Evelyn Liu commented on MESOS-9022:
---

Thanks [~bennoe] [~vinodkone]!

> Race condition in task updates could cause missing event in streaming
> -
>
> Key: MESOS-9022
> URL: https://issues.apache.org/jira/browse/MESOS-9022
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Affects Versions: 1.6.0
>Reporter: Evelyn Liu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: events, foundations, mesos, mesosphere, race-condition, 
> streaming
> Fix For: 1.7.0
>
>
> Master sends update event of {{TASK_STARTING}} when task's latest state is 
> already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, 
> {{sendSubscribersUpdate}} is set to {{false}} because of 
> [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805].
>  The subscriber would not receive update event of {{TASK_FAILED}}.
> This happened when a task failed very fast. Is there a race condition while 
> handling task updates?
> {{*master log:*}}
> {code:java}
> I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, 
> status update state: TASK_STARTING)
>  I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_STARTING)
>  I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED 
> (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update 
> TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_FAILED)
>  I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9338) Add asynchronous DNS facilities to libprocess.

2018-12-03 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707502#comment-16707502
 ] 

Chun-Hung Hsiao edited comment on MESOS-9338 at 12/3/18 4:44 PM:
-

C-ares is implicitly bundled in the gRPC bundle. If we are going to bundle 
c-ares we should compile gRPC against our c-ares bundle.


was (Author: chhsia0):
C-ares is implicitly bundled in the gRPC bundle. If we are going to bundle 
c-ares we should compile gRPC against the our c-ares bundle.

> Add asynchronous DNS facilities to libprocess.
> --
>
> Key: MESOS-9338
> URL: https://issues.apache.org/jira/browse/MESOS-9338
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> This would enable non-blocking DNS queries. One use case is during TLS peer 
> certificate verification, we need to perform a reverse DNS lookup to get the 
> peer's hostname. This blocks the event loop thread!
> Some options:
> (1) Linux provides {{getaddrinfo_a}}, however I don't see an equivalent one 
> for {{getnameinfo}}:
> http://man7.org/linux/man-pages/man3/getaddrinfo_a.3.html
> (2) A popular library is c-ares (MIT license):
> https://c-ares.haxx.se/
> (3) ADNS (GPLv3):
> https://www.gnu.org/software/adns/
> (4) c-ares has a list of other libraries:
> https://c-ares.haxx.se/otherlibs.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9338) Add asynchronous DNS facilities to libprocess.

2018-12-03 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707502#comment-16707502
 ] 

Chun-Hung Hsiao commented on MESOS-9338:


C-ares is implicitly bundled in the gRPC bundle. If we are going to bundle 
c-ares we should compile gRPC against the our c-ares bundle.

> Add asynchronous DNS facilities to libprocess.
> --
>
> Key: MESOS-9338
> URL: https://issues.apache.org/jira/browse/MESOS-9338
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> This would enable non-blocking DNS queries. One use case is during TLS peer 
> certificate verification, we need to perform a reverse DNS lookup to get the 
> peer's hostname. This blocks the event loop thread!
> Some options:
> (1) Linux provides {{getaddrinfo_a}}, however I don't see an equivalent one 
> for {{getnameinfo}}:
> http://man7.org/linux/man-pages/man3/getaddrinfo_a.3.html
> (2) A popular library is c-ares (MIT license):
> https://c-ares.haxx.se/
> (3) ADNS (GPLv3):
> https://www.gnu.org/software/adns/
> (4) c-ares has a list of other libraries:
> https://c-ares.haxx.se/otherlibs.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9022) Race condition in task updates could cause missing event in streaming

2018-12-03 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707413#comment-16707413
 ] 

Benno Evers commented on MESOS-9022:


Confirmed, this is caused by the same underlying problem as MESOS-9000 and 
should be solved by https://reviews.apache.org/r/67575/ .

> Race condition in task updates could cause missing event in streaming
> -
>
> Key: MESOS-9022
> URL: https://issues.apache.org/jira/browse/MESOS-9022
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Affects Versions: 1.6.0
>Reporter: Evelyn Liu
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: events, foundations, mesos, mesosphere, race-condition, 
> streaming
>
> Master sends update event of {{TASK_STARTING}} when task's latest state is 
> already {{TASK_FAILED}}. Then when it handles the update of {{TASK_FAILED}}, 
> {{sendSubscribersUpdate}} is set to {{false}} because of 
> [this|https://github.com/apache/mesos/blob/1.6.x/src/master/master.cpp#L10805].
>  The subscriber would not receive update event of {{TASK_FAILED}}.
> This happened when a task failed very fast. Is there a race condition while 
> handling task updates?
> {{*master log:*}}
> {code:java}
> I0622 13:08:29.189771 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.189801 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.190004 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_STARTING, 
> status update state: TASK_STARTING)
>  I0622 13:08:29.603857 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615643 84079 master.cpp:8345] Status update TASK_STARTING 
> (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.615669 84079 master.cpp:8402] Forwarding status update 
> TASK_STARTING (Status UUID: eb091093-d303-4e82-b69f-e2ba1011ba76) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.615783 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_STARTING)
>  I0622 13:08:29.620837 84079 master.cpp:8345] Status update TASK_FAILED 
> (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- from agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.620853 84079 master.cpp:8402] Forwarding status update 
> TASK_FAILED (Status UUID: ac34f1e9-eaa4-4765-82ac-7398c2e6c835) for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e-
>  I0622 13:08:29.620923 84079 master.cpp:10843] Updating the state of task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (latest state: TASK_FAILED, status 
> update state: TASK_FAILED)
>  I0622 13:08:29.630455 84079 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status eb091093-d303-4e82-b69f-e2ba1011ba76 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587
>  I0622 13:08:29.673051 84095 master.cpp:6195] Processing ACKNOWLEDGE call for 
> status ac34f1e9-eaa4-4765-82ac-7398c2e6c835 for task 
> f839055c-7a40-4e6c-9f53-22030f388c8c of framework 
> 4591ea8b-4adb-4acf-bb29-b70817663c4e- (Aurora) on agent 
> d2f1c7c2-668d-46e5-829b-ce614cca79ae-S1587{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9318) Consider providing better operation status updates while an RP is recovering

2018-12-03 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707362#comment-16707362
 ] 

Benjamin Bannier commented on MESOS-9318:
-

The flow for a possible fix could be:
* master sees a reconcilation request an operation on some resource provider on 
a registered agent
* master forwards reconcilation request to agent
* agent forwards it to its resource provider manager
* resource provider manager either sends a {{ReconcileOperations}} event to the 
registered resource provider, or responds with an {{OPERATION_UNREACHABLE}} for 
a resource provider which is not subscribed. It could also respond with some 
status for resource providers marked gone, see MESOS-8403.

> Consider providing better operation status updates while an RP is recovering
> 
>
> Key: MESOS-9318
> URL: https://issues.apache.org/jira/browse/MESOS-9318
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere, operation-feedback
>
> Consider the following scenario:
> 1. A framework accepts an offer with an operation affecting SLRP resources.
> 2. The master forwards it to the corresponding agent.
> 3. The agent forwards it to the corresponding RP.
> 4. The agent and the master fail over.
> 5. The master recovers.
> 6. The agent recovers while the RP is still recovering, so it doesn't include 
> the pending operation on the {{RegisterMessage}}.
> 7. A framework performs an explicit operation status reconciliation.
> In this case the master will currently respond with {{OPERATION_UNKNOWN}}, 
> but it should be possible to respond with a more fine-grained and useful 
> state, such as {{OPERATION_RECOVERING}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub

2018-12-03 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707338#comment-16707338
 ] 

Andrei Budnik commented on MESOS-9157:
--

[~MichaelBowie] Can you please provide stderr and stdout logs of a failed 
Docker container from its sandbox?

> cannot pull docker image from dockerhub
> ---
>
> Key: MESOS-9157
> URL: https://issues.apache.org/jira/browse/MESOS-9157
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.6.1
>Reporter: Michael Bowie
>Priority: Blocker
>  Labels: containerization
>
> I am not able to pull docker images from docker hub through marathon/mesos. 
> I get one of two errors:
>  * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: 
> time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing 
> with pull after error: context canceled"`
>  * `Failed to run docker -H ... Error: No such object: 
> mesos-d2f333a8-fef2-48fb-8b99-28c52c327790`
> However, I can manually ssh into one of the agents and successfully pull the 
> image from the command line. 
> Any pointers in the right direction?
> Thank you!
> Similar Issues:
> https://github.com/mesosphere/marathon/issues/3869



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9157) cannot pull docker image from dockerhub

2018-12-03 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707338#comment-16707338
 ] 

Andrei Budnik edited comment on MESOS-9157 at 12/3/18 3:11 PM:
---

[~MichaelBowie] Can you please provide stderr and stdout logs of a failed 
Docker container from its sandbox?

It would also be great if you provide Mesos agent logs containing the failing 
Docker task. 


was (Author: abudnik):
[~MichaelBowie] Can you please provide stderr and stdout logs of a failed 
Docker container from its sandbox?

> cannot pull docker image from dockerhub
> ---
>
> Key: MESOS-9157
> URL: https://issues.apache.org/jira/browse/MESOS-9157
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.6.1
>Reporter: Michael Bowie
>Priority: Blocker
>  Labels: containerization
>
> I am not able to pull docker images from docker hub through marathon/mesos. 
> I get one of two errors:
>  * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: 
> time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing 
> with pull after error: context canceled"`
>  * `Failed to run docker -H ... Error: No such object: 
> mesos-d2f333a8-fef2-48fb-8b99-28c52c327790`
> However, I can manually ssh into one of the agents and successfully pull the 
> image from the command line. 
> Any pointers in the right direction?
> Thank you!
> Similar Issues:
> https://github.com/mesosphere/marathon/issues/3869



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-12-03 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707318#comment-16707318
 ] 

James DeFelice commented on MESOS-9223:
---

MESOS-8380 addresses UI changes. The UI should not be the only place to easily 
observe/troubleshoot errors. Ideally there'd be an API that exposes such.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9449) Support HTTP when pull UCR image using docker registry v2 API

2018-12-03 Thread haoyuan ge (JIRA)
haoyuan ge created MESOS-9449:
-

 Summary: Support HTTP when pull UCR image using docker registry v2 
API
 Key: MESOS-9449
 URL: https://issues.apache.org/jira/browse/MESOS-9449
 Project: Mesos
  Issue Type: Improvement
  Components: agent, containerization
Reporter: haoyuan ge


Many customers use Harbor as docker registries in their private cloud. And most 
of them use HTTP instead of HTTPS to expose registry API. However, currently 
SSL is automatically handled when fetching images/layers for Mesos container. 
And mesos agent will report error when the registry is using HTTP: 

Failed to launch container: Failed to perform 'curl': curl: (60) SSL 
certificate problem: unable to get local issuer certificate

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9448) Semantics of RECONCILE_OPERATIONS framework API call are incorrect

2018-12-03 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-9448:
---

 Summary: Semantics of RECONCILE_OPERATIONS framework API call are 
incorrect
 Key: MESOS-9448
 URL: https://issues.apache.org/jira/browse/MESOS-9448
 Project: Mesos
  Issue Type: Bug
  Components: framework, HTTP API, master
Reporter: Benjamin Bannier


The typical pattern in the framework HTTP API is that frameworks send calls to 
which the master responds with {{Accepted}} responses and which trigger events. 
The only designed exception to this are {{SUBSCRIBE}} calls to which the master 
responds with an {{Ok}} response containing the assigned framework ID. This is 
even codified in {{src/scheduler.cpp:646ff}},
{code}
if (response->code == process::http::Status::OK) {
  // Only SUBSCRIBE call should get a "200 OK" response.
  CHECK_EQ(Call::SUBSCRIBE, call.type())
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)