[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky

2019-08-28 Thread Vinod Kone (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918041#comment-16918041
 ] 

Vinod Kone commented on MESOS-8983:
---

Seen this again when testing 1.9.0-RC2.

{code}
13:32:33 3: [ RUN  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery
13:32:33 3: I0828 18:32:33.580678 20801 cluster.cpp:177] Creating default 
'local' authorizer
13:32:33 3: I0828 18:32:33.587858 20824 master.cpp:440] Master 
3de64da7-619c-4652-9d33-3fe2ca2a3d5f (b766865f9da3) started on 172.17.0.2:42011
13:32:33 3: I0828 18:32:33.587904 20824 master.cpp:443] Flags at startup: 
--acls="" --agent_ping_timeout="1secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/sIRhDp/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="2" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/sIRhDp/master" --zk_session_timeout="10secs"
13:32:33 3: I0828 18:32:33.588558 20824 master.cpp:492] Master only allowing 
authenticated frameworks to register
13:32:33 3: I0828 18:32:33.588574 20824 master.cpp:498] Master only allowing 
authenticated agents to register
13:32:33 3: I0828 18:32:33.588587 20824 master.cpp:504] Master only allowing 
authenticated HTTP frameworks to register
13:32:33 3: I0828 18:32:33.588599 20824 credentials.hpp:37] Loading credentials 
for authentication from '/tmp/sIRhDp/credentials'
13:32:33 3: I0828 18:32:33.588999 20824 master.cpp:548] Using default 'crammd5' 
authenticator
13:32:33 3: I0828 18:32:33.589262 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readonly'
13:32:33 3: I0828 18:32:33.589529 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-readwrite'
13:32:33 3: I0828 18:32:33.589697 20824 http.cpp:975] Creating default 'basic' 
HTTP authenticator for realm 'mesos-master-scheduler'
13:32:33 3: I0828 18:32:33.589866 20824 master.cpp:629] Authorization enabled
13:32:33 3: I0828 18:32:33.590817 20823 whitelist_watcher.cpp:77] No whitelist 
given
13:32:33 3: I0828 18:32:33.594827 20816 master.cpp:2170] Elected as the leading 
master!
13:32:33 3: I0828 18:32:33.594887 20816 master.cpp:1666] Recovering from 
registrar
13:32:33 3: I0828 18:32:33.595124 20808 hierarchical.cpp:474] Initialized 
hierarchical allocator process
13:32:33 3: I0828 18:32:33.595382 20808 registrar.cpp:339] Recovering registrar
13:32:33 3: I0828 18:32:33.596575 20808 registrar.cpp:383] Successfully fetched 
the registry (0B) in 1.14688ms
13:32:33 3: I0828 18:32:33.596779 20808 registrar.cpp:487] Applied 1 operations 
in 63194ns; attempting to update the registry
13:32:33 3: I0828 18:32:33.597638 20819 registrar.cpp:544] Successfully updated 
the registry in 788224ns
13:32:33 3: I0828 18:32:33.597805 20819 registrar.cpp:416] Successfully 
recovered registrar
13:32:33 3: I0828 18:32:33.598423 20819 master.cpp:1819] Recovered 0 agents 
from the registry (144B); allowing 10mins for agents to reregister
13:32:33 3: I0828 18:32:33.598599 20813 hierarchical.cpp:513] Skipping recovery 
of hierarchical allocator: nothing to recover
13:32:33 3: I0828 18:32:33.614511 20801 containerizer.cpp:318] Using isolation 
{ environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
13:32:33 3: W0828 18:32:33.615756 20801 backend.cpp:76] Failed to create 
'overlay' backend: OverlayBackend requires root privileges
13:32:33 3: W0828 18:32:33.615855 20801 backend.cpp:76] Failed to create 'aufs' 
backend: AufsBackend requires root privileges
13:32:33 3: W0828 18:32:33.615934 20801 backend.cpp:76] Failed to create 'bind' 
backend: BindBackend requires root privileges
13:32:33 3: I0828 18:32:33.616178 20801 provisioner

[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky

2019-04-16 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819281#comment-16819281
 ] 

Andrei Budnik commented on MESOS-8983:
--

ThisĀ testĀ fails pretty often on ARM.

> SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky
> --
>
> Key: MESOS-8983
> URL: https://issues.apache.org/jira/browse/MESOS-8983
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Alexander Rojas
>Assignee: Joseph Wu
>Priority: Major
>  Labels: flaky-test, foundations
>
> During an unrelated change in a PR, the apache build bot sent the following 
> error:
> {noformat}
> @   7FF71117D888  
> std::invoke<,process::Future
>  >,process::ProcessBase *>
> @   7FF71119257B  
> lambda::internal::Partial<,process::Future
>  >,std::_Ph<1> 
> >::invoke_expand<,std::tuple
>  >,std::_Ph<1> >,st
> @   7FF7110C08BA  ) @   7FF7110F058C  
> std::_Invoker_functor::_Call,process::Future
>  >,std::_Ph<1> >,process::ProcessBase *>
> @   7FF711183EBC  
> std::invoke,process::Future
>  >,std::_Ph<1> >,process::ProcessBase *>
> @   7FF7110C9F21  
> ),process::Future
>  >,std::_Ph<1> >,process::ProcessBase *
> @   7FF711236416  process::ProcessBase 
> *)>::CallableFn,process::Future
>  >,std::_Ph<1> > >::operator(
> @   7FF712C1A25D  process::ProcessBase *)>::operator(
> @   7FF712ACB2F9  process::ProcessBase::consume
> @   7FF712C738CA  process::DispatchEvent::consume
> @   7FF70ECE7B07  process::ProcessBase::serve
> @   7FF712AD93B0  process::ProcessManager::resume
> @   7FF712C07371   ?? 
> @   7FF712B2B130  
> std::_Invoker_functor::_Call< >
> @   7FF712B8B8E0  
> std::invoke< >
> @   7FF712B4076C  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Execute<0>
> @   7FF712C5A60A  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Run
> @   7FF712C45E78  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Go
> @   7FF712C2C3CD  std::_Pad::_Call_func
> @   7FFF9BE53428  _register_onexit_function
> @   7FFF9BE53071  _register_onexit_function
> @   7FFFB6391FE4  BaseThreadInitThunk
> @   7FFFB69FF061  RtlUserThreadStart
> ll containerizers
> I0606 10:25:26.680230 18356 slave.cpp:7158] Recovering executors
> I0606 10:25:26.680230 18356 slave.cpp:7182] Sending reconnect request to 
> executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework 
> 62cf792a-dc69-4e3c-b54f-d83f98fb9451- at executor(1)@192.10.1.5:55652
> I0606 10:25:26.688225 22560 slave.cpp:4984] Received re-registration message 
> from executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework 
> 62cf792a-dc69-4e3c-b54f-d83f98fb9451-
> I0606 10:25:26.691216 22888 slave.cpp:5901] No pings from master received 
> within 75secs
> F0606 10:25:26.692219 22888 slave.cpp:1249] Check failed: state == 
> DISCONNECTED || state == RUNNING || state == TERMINATING RECOVERING
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)