[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906328#comment-16906328
 ] 

Frédéric Comte commented on MESOS-9936:
---

I am on CoreOS, I don't know how I can do that.

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.1
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906473#comment-16906473
 ] 

Vinod Kone commented on MESOS-9545:
---

[~greggomann] Lets backport this to older releases.

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9937:
--

Assignee: Greg Mann
Priority: Blocker  (was: Major)
Target Version/s: 1.7.3

Marking as a blocker for the next 1.7.x release. Greg please reassign if 
someone else can pick this up.

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906276#comment-16906276
 ] 

Vinod Kone commented on MESOS-9936:
---

[~Fcomte] That's pretty weird and unexpected. Can you share gdb stack trace 
during one of these long recovery periods?

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.1
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906587#comment-16906587
 ] 

Greg Mann commented on MESOS-9545:
--

[~vinodkone] thanks for the ping - I have these backports in progress but got 
distracted, will make this happen this week.

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9669) Deprecate v0 quota calls.

2019-08-13 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906539#comment-16906539
 ] 

Benjamin Mahler commented on MESOS-9669:


The new quota documentation from MESOS-9427 hides the /quota endpoint.

We can mark it as deprecated with comments in the code as well as in the help 
string before closing this.

> Deprecate v0 quota calls.
> -
>
> Key: MESOS-9669
> URL: https://issues.apache.org/jira/browse/MESOS-9669
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management
>
> Once we introduce the new quota APIs in MESOS-8068, we should deprecate the 
> `/quota` endpoint. We should mark this as deprecated and hide it in our 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9669) Deprecate v0 quota calls.

2019-08-13 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9669:
--

Assignee: Benjamin Mahler

> Deprecate v0 quota calls.
> -
>
> Key: MESOS-9669
> URL: https://issues.apache.org/jira/browse/MESOS-9669
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, resource-management
>
> Once we introduce the new quota APIs in MESOS-8068, we should deprecate the 
> `/quota` endpoint. We should mark this as deprecated and hide it in our 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906579#comment-16906579
 ] 

Greg Mann commented on MESOS-9938:
--

Review here: https://reviews.apache.org/r/65112/

> Standalone container documentation
> --
>
> Key: MESOS-9938
> URL: https://issues.apache.org/jira/browse/MESOS-9938
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9758) Take ports out of the GET_ROLES endpoints.

2019-08-13 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9758:
--

Assignee: Benjamin Mahler

> Take ports out of the GET_ROLES endpoints.
> --
>
> Key: MESOS-9758
> URL: https://issues.apache.org/jira/browse/MESOS-9758
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> It does not make sense to combine ports across agents.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906559#comment-16906559
 ] 

Greg Mann commented on MESOS-9937:
--

[~carlone] good timing! I was already planning to backport that commit as part 
of backporting MESOS-9545, which I previously overlooked backporting. Should 
happen in the next couple days.

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: foundations
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9938:


Assignee: Joseph Wu

> Standalone container documentation
> --
>
> Key: MESOS-9938
> URL: https://issues.apache.org/jira/browse/MESOS-9938
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9938) Standalone container documentation

2019-08-13 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9938:


 Summary: Standalone container documentation
 Key: MESOS-9938
 URL: https://issues.apache.org/jira/browse/MESOS-9938
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Greg Mann


We should add documentation for standalone containers.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9939) PersistentVolumeEndpointsTest.DynamicReservation is flaky.

2019-08-13 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-9939:
--

 Summary: PersistentVolumeEndpointsTest.DynamicReservation is flaky.
 Key: MESOS-9939
 URL: https://issues.apache.org/jira/browse/MESOS-9939
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler


{noformat}
[ RUN  ] PersistentVolumeEndpointsTest.DynamicReservation
I0813 20:55:33.670486 32445 cluster.cpp:177] Creating default 'local' authorizer
I0813 20:55:33.674396 32457 master.cpp:440] Master 
87e437ee-0796-49fd-bfab-e7866bb7a81d (6c6cd7a3b2c1) started on 172.17.0.2:36761
I0813 20:55:33.674434 32457 master.cpp:443] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/9zz3CO/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
--work_dir="/tmp/9zz3CO/master" --zk_session_timeout="10secs"
I0813 20:55:33.674772 32457 master.cpp:492] Master only allowing authenticated 
frameworks to register
I0813 20:55:33.674784 32457 master.cpp:498] Master only allowing authenticated 
agents to register
I0813 20:55:33.674793 32457 master.cpp:504] Master only allowing authenticated 
HTTP frameworks to register
I0813 20:55:33.674800 32457 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/9zz3CO/credentials'
I0813 20:55:33.675024 32457 master.cpp:548] Using default 'crammd5' 
authenticator
I0813 20:55:33.675189 32457 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0813 20:55:33.675369 32457 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0813 20:55:33.675529 32457 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0813 20:55:33.675685 32457 master.cpp:629] Authorization enabled
W0813 20:55:33.675709 32457 master.cpp:692] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I0813 20:55:33.676091 32460 whitelist_watcher.cpp:77] No whitelist given
I0813 20:55:33.676143 32455 hierarchical.cpp:241] Initialized hierarchical 
allocator process
I0813 20:55:33.678655 32452 master.cpp:2168] Elected as the leading master!
I0813 20:55:33.678683 32452 master.cpp:1664] Recovering from registrar
I0813 20:55:33.678833 32454 registrar.cpp:339] Recovering registrar
I0813 20:55:33.679450 32454 registrar.cpp:383] Successfully fetched the 
registry (0B) in 576us
I0813 20:55:33.679579 32454 registrar.cpp:487] Applied 1 operations in 46310ns; 
attempting to update the registry
I0813 20:55:33.680164 32454 registrar.cpp:544] Successfully updated the 
registry in 525824ns
I0813 20:55:33.680292 32454 registrar.cpp:416] Successfully recovered registrar
I0813 20:55:33.680759 32447 master.cpp:1817] Recovered 0 agents from the 
registry (143B); allowing 10mins for agents to reregister
I0813 20:55:33.680793 32459 hierarchical.cpp:280] Skipping recovery of 
hierarchical allocator: nothing to recover
W0813 20:55:33.687850 32445 process.cpp:2877] Attempted to spawn already 
running process files@172.17.0.2:36761
I0813 20:55:33.689188 32445 containerizer.cpp:318] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
W0813 20:55:33.689808 32445 backend.cpp:76] Failed to create 'overlay' backend: 
OverlayBackend requires root privileges
W0813 20:55:33.689841 32445 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
W0813 20:55:33.689865 32445 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root 

[jira] [Comment Edited] (MESOS-9560) ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky

2019-08-13 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905216#comment-16905216
 ] 

Benjamin Bannier edited comment on MESOS-9560 at 8/13/19 9:29 AM:
--

Reviews:
[https://reviews.apache.org/r/71272/]
[https://reviews.apache.org/r/71277/]


was (Author: bbannier):
Review: https://reviews.apache.org/r/71272/

> ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky
> 
>
> Key: MESOS-9560
> URL: https://issues.apache.org/jira/browse/MESOS-9560
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere, storage, test
> Fix For: 1.9.0
>
> Attachments: consoleText.txt
>
>
> We observed a segfault in 
> {{ContentType/AgentAPITest.MarkResourceProviderGone/1}} on test teardown.
> {noformat}
> I0131 23:55:59.378453  6798 slave.cpp:923] Agent terminating
> I0131 23:55:59.378813 31143 master.cpp:1269] Agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal) disconnected
> I0131 23:55:59.378831 31143 master.cpp:3272] Disconnecting agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal)
> I0131 23:55:59.378846 31143 master.cpp:3291] Deactivating agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 
> (ip-172-16-10-236.ec2.internal)
> I0131 23:55:59.378891 31143 hierarchical.cpp:793] Agent 
> a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 deactivated
> F0131 23:55:59.378891 31149 logging.cpp:67] RAW: Pure virtual method called
> @ 0x7f633aaaebdd  google::LogMessage::Fail()
> @ 0x7f633aab6281  google::RawLog__()
> @ 0x7f6339821262  __cxa_pure_virtual
> @ 0x55671cacc113  
> testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith()
> @ 0x55671b532e78  
> mesos::internal::tests::resource_provider::MockResourceProvider<>::disconnected()
> @ 0x7f633978f6b0  process::AsyncExecutorProcess::execute<>()
> @ 0x7f633979f218  
> _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_
> @ 0x7f633a9f5d01  process::ProcessBase::consume()
> @ 0x7f633aa1a08a  process::ProcessManager::resume()
> @ 0x7f633aa1db06  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7f633acc9f80  execute_native_thread_routine
> @ 0x7f6337142e25  start_thread
> @ 0x7f6336241bad  __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906181#comment-16906181
 ] 

Frédéric Comte commented on MESOS-9936:
---

I am using dcos v 1.13.3 so mesos is 1.8.1

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread longfei (JIRA)
longfei created MESOS-9937:
--

 Summary: 53598228fe should be backported to 1.7.x
 Key: MESOS-9937
 URL: https://issues.apache.org/jira/browse/MESOS-9937
 Project: Mesos
  Issue Type: Bug
Reporter: longfei


Commit 53598228fe on the master branch should be backported to 1.7.x. 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread longfei (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906197#comment-16906197
 ] 

longfei commented on MESOS-9937:


Hi [~greggomann] . Would you backport commit 53598228fe  to 1.7.x pls?

 

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Priority: Major
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-13 Thread longfei (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906210#comment-16906210
 ] 

longfei commented on MESOS-9852:


Yes. It's another memory-leak issue, which has been fixed in commit 53598228fe 
but not backported to 1.7.x.

I started a new ticket MESOS-9937 to track it.

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, 
> _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile 
> 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, 
> statistics
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-8808) CSI documentation has a broken link to a non-existent page.

2019-08-13 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906128#comment-16906128
 ] 

Benjamin Bannier commented on MESOS-8808:
-

[~joseph], is there anything we can help with to get 
[https://reviews.apache.org/r/65112/] over the finish line?

> CSI documentation has a broken link to a non-existent page.
> ---
>
> Key: MESOS-8808
> URL: https://issues.apache.org/jira/browse/MESOS-8808
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation, storage
>Affects Versions: 1.5.0
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: csi, documentation, mesosphere
>
> There's a broken link to a non-existent {{resource-provider.md}} document 
> here: https://mesos.apache.org/documentation/latest/csi/#resource-providers



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (MESOS-8808) CSI documentation has a broken link to a non-existent page.

2019-08-13 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906128#comment-16906128
 ] 

Benjamin Bannier edited comment on MESOS-8808 at 8/13/19 12:08 PM:
---

[~kaysoky], is there anything we can help with to get 
[https://reviews.apache.org/r/65112/] over the finish line?


was (Author: bbannier):
[~joseph], is there anything we can help with to get 
[https://reviews.apache.org/r/65112/] over the finish line?

> CSI documentation has a broken link to a non-existent page.
> ---
>
> Key: MESOS-8808
> URL: https://issues.apache.org/jira/browse/MESOS-8808
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation, storage
>Affects Versions: 1.5.0
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: csi, documentation, mesosphere
>
> There's a broken link to a non-existent {{resource-provider.md}} document 
> here: https://mesos.apache.org/documentation/latest/csi/#resource-providers



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread JIRA
Frédéric Comte created MESOS-9936:
-

 Summary: Slave recovery is very slow with high local volume 
persistant ( marathon app )
 Key: MESOS-9936
 URL: https://issues.apache.org/jira/browse/MESOS-9936
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Frédéric Comte


I run some local persistant applications..

After an unplannified shutdown of  nodes running this kind of applications, I 
see that the recovery process of mesos is taking a lot of time (more than 8 
hours)...

This time depends of the amount of data in those volumes.

What does Mesos do in this process ?
{code:java}
Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
Recovering Mesos containers 
Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
linux_launcher.cpp:286] Recovering Linux launcher 
Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
containerizer.cpp:1127] Recovering isolators 
Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
containerizer.cpp:1166] Recovering provisioner 
Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
composing.cpp:339] Finished recovering all containerizers 
Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
status_update_manager_process.hpp:314] Recovering operation status update 
manager 
Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
slave.cpp:7729] Recovering executors
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906165#comment-16906165
 ] 

Andrei Budnik commented on MESOS-9936:
--

[~Fcomte]
what version of Mesos are you using?

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)