[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906328#comment-16906328 ] Frédéric Comte commented on MESOS-9936: --- I am on CoreOS, I don't know how I can do that. > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.1 >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906473#comment-16906473 ] Vinod Kone commented on MESOS-9545: --- [~greggomann] Lets backport this to older releases. > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9937: -- Assignee: Greg Mann Priority: Blocker (was: Major) Target Version/s: 1.7.3 Marking as a blocker for the next 1.7.x release. Greg please reassign if someone else can pick this up. > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: Greg Mann >Priority: Blocker > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906276#comment-16906276 ] Vinod Kone commented on MESOS-9936: --- [~Fcomte] That's pretty weird and unexpected. Can you share gdb stack trace during one of these long recovery periods? > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.1 >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state
[ https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906587#comment-16906587 ] Greg Mann commented on MESOS-9545: -- [~vinodkone] thanks for the ping - I have these backports in progress but got distracted, will make this happen this week. > Marking an unreachable agent as gone should transition the tasks to terminal > state > -- > > Key: MESOS-9545 > URL: https://issues.apache.org/jira/browse/MESOS-9545 > Project: Mesos > Issue Type: Improvement >Reporter: Vinod Kone >Assignee: Greg Mann >Priority: Major > Labels: foundations > Fix For: 1.9.0 > > > If an unreachable agent is marked as gone, currently master just marks that > agent in the registry but doesn't do anything about its tasks. So the tasks > are in UNREACHABLE state in the master forever, until the master fails over. > This is not great UX. We should transition these to terminal state instead. > This fix should also include a test to verify. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9669) Deprecate v0 quota calls.
[ https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906539#comment-16906539 ] Benjamin Mahler commented on MESOS-9669: The new quota documentation from MESOS-9427 hides the /quota endpoint. We can mark it as deprecated with comments in the code as well as in the help string before closing this. > Deprecate v0 quota calls. > - > > Key: MESOS-9669 > URL: https://issues.apache.org/jira/browse/MESOS-9669 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management > > Once we introduce the new quota APIs in MESOS-8068, we should deprecate the > `/quota` endpoint. We should mark this as deprecated and hide it in our > documentation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9669) Deprecate v0 quota calls.
[ https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9669: -- Assignee: Benjamin Mahler > Deprecate v0 quota calls. > - > > Key: MESOS-9669 > URL: https://issues.apache.org/jira/browse/MESOS-9669 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere, resource-management > > Once we introduce the new quota APIs in MESOS-8068, we should deprecate the > `/quota` endpoint. We should mark this as deprecated and hide it in our > documentation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9938) Standalone container documentation
[ https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906579#comment-16906579 ] Greg Mann commented on MESOS-9938: -- Review here: https://reviews.apache.org/r/65112/ > Standalone container documentation > -- > > Key: MESOS-9938 > URL: https://issues.apache.org/jira/browse/MESOS-9938 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9758) Take ports out of the GET_ROLES endpoints.
[ https://issues.apache.org/jira/browse/MESOS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9758: -- Assignee: Benjamin Mahler > Take ports out of the GET_ROLES endpoints. > -- > > Key: MESOS-9758 > URL: https://issues.apache.org/jira/browse/MESOS-9758 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > It does not make sense to combine ports across agents. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906559#comment-16906559 ] Greg Mann commented on MESOS-9937: -- [~carlone] good timing! I was already planning to backport that commit as part of backporting MESOS-9545, which I previously overlooked backporting. Should happen in the next couple days. > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: Greg Mann >Priority: Blocker > Labels: foundations > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9938) Standalone container documentation
[ https://issues.apache.org/jira/browse/MESOS-9938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9938: Assignee: Joseph Wu > Standalone container documentation > -- > > Key: MESOS-9938 > URL: https://issues.apache.org/jira/browse/MESOS-9938 > Project: Mesos > Issue Type: Documentation > Components: documentation >Reporter: Greg Mann >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9938) Standalone container documentation
Greg Mann created MESOS-9938: Summary: Standalone container documentation Key: MESOS-9938 URL: https://issues.apache.org/jira/browse/MESOS-9938 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Greg Mann We should add documentation for standalone containers. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9939) PersistentVolumeEndpointsTest.DynamicReservation is flaky.
Benjamin Mahler created MESOS-9939: -- Summary: PersistentVolumeEndpointsTest.DynamicReservation is flaky. Key: MESOS-9939 URL: https://issues.apache.org/jira/browse/MESOS-9939 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler {noformat} [ RUN ] PersistentVolumeEndpointsTest.DynamicReservation I0813 20:55:33.670486 32445 cluster.cpp:177] Creating default 'local' authorizer I0813 20:55:33.674396 32457 master.cpp:440] Master 87e437ee-0796-49fd-bfab-e7866bb7a81d (6c6cd7a3b2c1) started on 172.17.0.2:36761 I0813 20:55:33.674434 32457 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/9zz3CO/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" --work_dir="/tmp/9zz3CO/master" --zk_session_timeout="10secs" I0813 20:55:33.674772 32457 master.cpp:492] Master only allowing authenticated frameworks to register I0813 20:55:33.674784 32457 master.cpp:498] Master only allowing authenticated agents to register I0813 20:55:33.674793 32457 master.cpp:504] Master only allowing authenticated HTTP frameworks to register I0813 20:55:33.674800 32457 credentials.hpp:37] Loading credentials for authentication from '/tmp/9zz3CO/credentials' I0813 20:55:33.675024 32457 master.cpp:548] Using default 'crammd5' authenticator I0813 20:55:33.675189 32457 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0813 20:55:33.675369 32457 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0813 20:55:33.675529 32457 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0813 20:55:33.675685 32457 master.cpp:629] Authorization enabled W0813 20:55:33.675709 32457 master.cpp:692] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I0813 20:55:33.676091 32460 whitelist_watcher.cpp:77] No whitelist given I0813 20:55:33.676143 32455 hierarchical.cpp:241] Initialized hierarchical allocator process I0813 20:55:33.678655 32452 master.cpp:2168] Elected as the leading master! I0813 20:55:33.678683 32452 master.cpp:1664] Recovering from registrar I0813 20:55:33.678833 32454 registrar.cpp:339] Recovering registrar I0813 20:55:33.679450 32454 registrar.cpp:383] Successfully fetched the registry (0B) in 576us I0813 20:55:33.679579 32454 registrar.cpp:487] Applied 1 operations in 46310ns; attempting to update the registry I0813 20:55:33.680164 32454 registrar.cpp:544] Successfully updated the registry in 525824ns I0813 20:55:33.680292 32454 registrar.cpp:416] Successfully recovered registrar I0813 20:55:33.680759 32447 master.cpp:1817] Recovered 0 agents from the registry (143B); allowing 10mins for agents to reregister I0813 20:55:33.680793 32459 hierarchical.cpp:280] Skipping recovery of hierarchical allocator: nothing to recover W0813 20:55:33.687850 32445 process.cpp:2877] Attempted to spawn already running process files@172.17.0.2:36761 I0813 20:55:33.689188 32445 containerizer.cpp:318] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } W0813 20:55:33.689808 32445 backend.cpp:76] Failed to create 'overlay' backend: OverlayBackend requires root privileges W0813 20:55:33.689841 32445 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges W0813 20:55:33.689865 32445 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root
[jira] [Comment Edited] (MESOS-9560) ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky
[ https://issues.apache.org/jira/browse/MESOS-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905216#comment-16905216 ] Benjamin Bannier edited comment on MESOS-9560 at 8/13/19 9:29 AM: -- Reviews: [https://reviews.apache.org/r/71272/] [https://reviews.apache.org/r/71277/] was (Author: bbannier): Review: https://reviews.apache.org/r/71272/ > ContentType/AgentAPITest.MarkResourceProviderGone/1 is flaky > > > Key: MESOS-9560 > URL: https://issues.apache.org/jira/browse/MESOS-9560 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Critical > Labels: flaky, flaky-test, mesosphere, storage, test > Fix For: 1.9.0 > > Attachments: consoleText.txt > > > We observed a segfault in > {{ContentType/AgentAPITest.MarkResourceProviderGone/1}} on test teardown. > {noformat} > I0131 23:55:59.378453 6798 slave.cpp:923] Agent terminating > I0131 23:55:59.378813 31143 master.cpp:1269] Agent > a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 > (ip-172-16-10-236.ec2.internal) disconnected > I0131 23:55:59.378831 31143 master.cpp:3272] Disconnecting agent > a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 > (ip-172-16-10-236.ec2.internal) > I0131 23:55:59.378846 31143 master.cpp:3291] Deactivating agent > a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 at slave(1112)@172.16.10.236:43229 > (ip-172-16-10-236.ec2.internal) > I0131 23:55:59.378891 31143 hierarchical.cpp:793] Agent > a27bcaba-70cc-4ec3-9786-38f9512c61fd-S0 deactivated > F0131 23:55:59.378891 31149 logging.cpp:67] RAW: Pure virtual method called > @ 0x7f633aaaebdd google::LogMessage::Fail() > @ 0x7f633aab6281 google::RawLog__() > @ 0x7f6339821262 __cxa_pure_virtual > @ 0x55671cacc113 > testing::internal::UntypedFunctionMockerBase::UntypedInvokeWith() > @ 0x55671b532e78 > mesos::internal::tests::resource_provider::MockResourceProvider<>::disconnected() > @ 0x7f633978f6b0 process::AsyncExecutorProcess::execute<>() > @ 0x7f633979f218 > _ZN5cpp176invokeIZN7process8dispatchI7NothingNS1_20AsyncExecutorProcessERKSt8functionIFvvEES9_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSE_FSB_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS3_EESt14default_deleteISP_EEOS7_PNS1_11ProcessBaseEE_JSS_S7_SV_EEEDTclcl7forwardISB_Efp_Espcl7forwardIT0_Efp0_EEEOSB_DpOSX_ > @ 0x7f633a9f5d01 process::ProcessBase::consume() > @ 0x7f633aa1a08a process::ProcessManager::resume() > @ 0x7f633aa1db06 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f633acc9f80 execute_native_thread_routine > @ 0x7f6337142e25 start_thread > @ 0x7f6336241bad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906181#comment-16906181 ] Frédéric Comte commented on MESOS-9936: --- I am using dcos v 1.13.3 so mesos is 1.8.1 > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9937) 53598228fe should be backported to 1.7.x
longfei created MESOS-9937: -- Summary: 53598228fe should be backported to 1.7.x Key: MESOS-9937 URL: https://issues.apache.org/jira/browse/MESOS-9937 Project: Mesos Issue Type: Bug Reporter: longfei Commit 53598228fe on the master branch should be backported to 1.7.x. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906197#comment-16906197 ] longfei commented on MESOS-9937: Hi [~greggomann] . Would you backport commit 53598228fe to 1.7.x pls? > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Priority: Major > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906210#comment-16906210 ] longfei commented on MESOS-9852: Yes. It's another memory-leak issue, which has been fixed in commit 53598228fe but not backported to 1.7.x. I started a new ticket MESOS-9937 to track it. > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, > _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile > 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, > statistics > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-8808) CSI documentation has a broken link to a non-existent page.
[ https://issues.apache.org/jira/browse/MESOS-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906128#comment-16906128 ] Benjamin Bannier commented on MESOS-8808: - [~joseph], is there anything we can help with to get [https://reviews.apache.org/r/65112/] over the finish line? > CSI documentation has a broken link to a non-existent page. > --- > > Key: MESOS-8808 > URL: https://issues.apache.org/jira/browse/MESOS-8808 > Project: Mesos > Issue Type: Bug > Components: documentation, storage >Affects Versions: 1.5.0 >Reporter: Gastón Kleiman >Priority: Major > Labels: csi, documentation, mesosphere > > There's a broken link to a non-existent {{resource-provider.md}} document > here: https://mesos.apache.org/documentation/latest/csi/#resource-providers -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (MESOS-8808) CSI documentation has a broken link to a non-existent page.
[ https://issues.apache.org/jira/browse/MESOS-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906128#comment-16906128 ] Benjamin Bannier edited comment on MESOS-8808 at 8/13/19 12:08 PM: --- [~kaysoky], is there anything we can help with to get [https://reviews.apache.org/r/65112/] over the finish line? was (Author: bbannier): [~joseph], is there anything we can help with to get [https://reviews.apache.org/r/65112/] over the finish line? > CSI documentation has a broken link to a non-existent page. > --- > > Key: MESOS-8808 > URL: https://issues.apache.org/jira/browse/MESOS-8808 > Project: Mesos > Issue Type: Bug > Components: documentation, storage >Affects Versions: 1.5.0 >Reporter: Gastón Kleiman >Priority: Major > Labels: csi, documentation, mesosphere > > There's a broken link to a non-existent {{resource-provider.md}} document > here: https://mesos.apache.org/documentation/latest/csi/#resource-providers -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
Frédéric Comte created MESOS-9936: - Summary: Slave recovery is very slow with high local volume persistant ( marathon app ) Key: MESOS-9936 URL: https://issues.apache.org/jira/browse/MESOS-9936 Project: Mesos Issue Type: Bug Components: agent Reporter: Frédéric Comte I run some local persistant applications.. After an unplannified shutdown of nodes running this kind of applications, I see that the recovery process of mesos is taking a lot of time (more than 8 hours)... This time depends of the amount of data in those volumes. What does Mesos do in this process ? {code:java} Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] Recovering Mesos containers Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 linux_launcher.cpp:286] Recovering Linux launcher Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 containerizer.cpp:1127] Recovering isolators Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 containerizer.cpp:1166] Recovering provisioner Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 composing.cpp:339] Finished recovering all containerizers Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 status_update_manager_process.hpp:314] Recovering operation status update manager Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 slave.cpp:7729] Recovering executors {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906165#comment-16906165 ] Andrei Budnik commented on MESOS-9936: -- [~Fcomte] what version of Mesos are you using? > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)