[jira] [Commented] (MESOS-5828) Modularize Network in replicated_log
[ https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823644#comment-16823644 ] longfei commented on MESOS-5828: Actually, I have made some modifications to MESOS-1806(based on Jay Guo's work), and it(etcd contender/detector) works fine for my test mesos cluster. I'll commit a patch if needed. The problem is that Replicated Log only works when ZK is present. I think some abstraction would make it more flexible and elegant. Anyway, I'll try zetcd. Thanks a lot! > Modularize Network in replicated_log > > > Key: MESOS-5828 > URL: https://issues.apache.org/jira/browse/MESOS-5828 > Project: Mesos > Issue Type: Bug > Components: replicated log >Reporter: Jay Guo >Assignee: Jay Guo >Priority: Major > > Currently replicated_log relies on Zookeeper for coordinator election. This > is done through network abstraction _ZookeeperNetwork_. We need to modularize > this part in order to enable replicated_log when using Master > contender/detector modules. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources
[ https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823580#comment-16823580 ] Greg Mann commented on MESOS-9619: -- Backports forthcoming > Mesos Master Crashes with Launch Group when using Port Resources > > > Key: MESOS-9619 > URL: https://issues.apache.org/jira/browse/MESOS-9619 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.3, 1.7.1 > Environment: > Testing in both Mesos 1.4.3 and Mesos 1.7.1 >Reporter: Nimi Wariboko Jr. >Assignee: Greg Mann >Priority: Critical > Labels: foundations, master, mesosphere > Attachments: mesos-master.log, mesos-master.snippet.log > > > Original Issue: > [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E] > When the ports resources is removed, Mesos functions normally (I'm able to > launch the task as many times as possible, while it always fails continually). > Attached is a snippet of the mesos master log from OFFER to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources
[ https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823576#comment-16823576 ] Greg Mann commented on MESOS-9619: -- On master branch: {code} commit cbae57b7e790b8b46c79052975406d603e7d175a Author: Greg Mann Date: Fri Apr 19 00:34:15 2019 -0700 Enabled construction of `ResourceQuantities` from `Resources`. This patch adds a new static method which enables the construction of `ResourceQuantities` from `Resources`. Namely, this permits the inclusion of sets and ranges in the input resources used to construct `ResourceQuantities`. Review: https://reviews.apache.org/r/70507 {code} {code} commit f8ffdb7bbf3ff58e1e7a411cdd66767519d9a7ad Author: Greg Mann Date: Sat Apr 20 11:48:39 2019 -0700 Ensured that task groups do not specify overlapping ranges or sets. This patch adds validation to the master to ensure that task groups do not include resources with overlapping set- or range-valued resources, as this can crash the allocator. Review: https://reviews.apache.org/r/70472/ {code} {code} commit 45c9788618e7123f408a1dffcf6772a1285cd2e5 Author: Greg Mann Date: Mon Apr 22 11:08:32 2019 -0700 Added unit test for a master validation helper function. Review: https://reviews.apache.org/r/70517 {code} > Mesos Master Crashes with Launch Group when using Port Resources > > > Key: MESOS-9619 > URL: https://issues.apache.org/jira/browse/MESOS-9619 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.3, 1.7.1 > Environment: > Testing in both Mesos 1.4.3 and Mesos 1.7.1 >Reporter: Nimi Wariboko Jr. >Assignee: Greg Mann >Priority: Critical > Labels: foundations, master, mesosphere > Attachments: mesos-master.log, mesos-master.snippet.log > > > Original Issue: > [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E] > When the ports resources is removed, Mesos functions normally (I'm able to > launch the task as many times as possible, while it always fails continually). > Attached is a snippet of the mesos master log from OFFER to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9735) Migrate master metrics to PushGauge
Greg Mann created MESOS-9735: Summary: Migrate master metrics to PushGauge Key: MESOS-9735 URL: https://issues.apache.org/jira/browse/MESOS-9735 Project: Mesos Issue Type: Task Affects Versions: 1.8.0 Reporter: Greg Mann We should migrate all metrics in the master actor to use {{PushGauges}} instead of {{PullGauges}}. If there are any cases that would be very cumbersome to handle with the {{PushGauge}} (i.e. uptime), we should file JIRA tickets for the design/development of metric types that can handle those cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-5828) Modularize Network in replicated_log
[ https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823434#comment-16823434 ] Joseph Wu commented on MESOS-5828: -- Progress on this has been paused for a while (although the bulk of the patches are still usable). In the meantime, you can try using zetcd, which basically exposes a ZK API for etcd: https://github.com/etcd-io/zetcd See this thread too: https://issues.apache.org/jira/browse/MESOS-1806?focusedCommentId=15895593=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15895593 > Modularize Network in replicated_log > > > Key: MESOS-5828 > URL: https://issues.apache.org/jira/browse/MESOS-5828 > Project: Mesos > Issue Type: Bug > Components: replicated log >Reporter: Jay Guo >Assignee: Jay Guo >Priority: Major > > Currently replicated_log relies on Zookeeper for coordinator election. This > is done through network abstraction _ZookeeperNetwork_. We need to modularize > this part in order to enable replicated_log when using Master > contender/detector modules. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9726) No running tasks in marathon after restart non-leader mesos-master node
[ https://issues.apache.org/jira/browse/MESOS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823423#comment-16823423 ] Joseph Wu commented on MESOS-9726: -- Logs of all three mesos masters would be helpful. Also some quick questions to get us on the same page: * What method did you use to determine the active Mesos leader? (Marathon leader is not the Mesos leader.) * When you "restart" anything, is this referring to stopping/starting a service? Or rebooting an entire node? > No running tasks in marathon after restart non-leader mesos-master node > --- > > Key: MESOS-9726 > URL: https://issues.apache.org/jira/browse/MESOS-9726 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Alexandr >Priority: Minor > Labels: beginner > > Good day! > I have problem with my mesos-cluster: > I have 3 mesos-master (Mesos version 1.4.1) nodes with Marathon cluster v > 1.5.11 (lets call them - Mas1,Mas2,Mas3) and 3 mesos-slave nodes (lets call > them - Slv1,Slv2,Slv3) with running apps on it. And I also have a Zookeeper > cluster on nodes Mas1,Mas2,Mas3. > So, Mesos leader - master-node "Mas1". > After I restarted master-node "Mas3" - he got back to the Mesos cluster, > everything is fine but after a moment I opened Marathon and all running tasks > from my mesos-slave nodes became "unknown" and have no instances on it. > So I checked: > 1. My mesos agents - everything was ok, 3 agents running. > 2. That all services are running and all clusters (Mesos\Marathon\Zookeeper) > are fine > 3. Decided to restart all mesos-slave services on slave nodes - on > slave-node Slv3 1 of 3 instances launched for all applications, then > restarted all marathon-services. After it all tasks switched to status > "Waiting"\"Delayed". > 4. Checked mesos-master and slave logs, no errors or information about any > problems on cluster - only information about killing and launching new tasks > on slave-node. > 5. Decided to stop-start mesos-master service for re-election of a Mesos > leader. > After it leader became master-node "Mas2" and all tasks in marathon started > to run instances like normal. > Logs will be uploaded later. Wonder how it could happen > {code:java} > Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.34240612 http.cpp:**] > HTTP GET for /master/state from :*** with User-Agent='Go-http-client/1.1' > Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.391480 9 > http.cpp:1185] HTTP GET for /master/state from *:40686 with > User-Agent='Go-http-client/1.1' Apr 10 18:32:18 Mas1 docker[27956]: E0410 > 15:32:18.56788014 process.cpp:2577] Failed to shutdown socket with fd 44, > address :5050: Transport endpoint is not connected Apr 10 18:32:18 Mas1 > docker[27956]: E0410 15:32:18.59690114 process.cpp:2577] Failed to > shutdown socket with fd 46, address :5050: Transport endpoint is not > connected > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources
[ https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816794#comment-16816794 ] Greg Mann edited comment on MESOS-9619 at 4/22/19 6:18 PM: --- Reviews here: [https://reviews.apache.org/r/70507/] [https://reviews.apache.org/r/70472/] [https://reviews.apache.org/r/70517/] [https://reviews.apache.org/r/70509/] was (Author: greggomann): Reviews here: [https://reviews.apache.org/r/70507/] [https://reviews.apache.org/r/70472/] [https://reviews.apache.org/r/70509/] > Mesos Master Crashes with Launch Group when using Port Resources > > > Key: MESOS-9619 > URL: https://issues.apache.org/jira/browse/MESOS-9619 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.3, 1.7.1 > Environment: > Testing in both Mesos 1.4.3 and Mesos 1.7.1 >Reporter: Nimi Wariboko Jr. >Assignee: Greg Mann >Priority: Critical > Labels: foundations, master, mesosphere > Attachments: mesos-master.log, mesos-master.snippet.log > > > Original Issue: > [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E] > When the ports resources is removed, Mesos functions normally (I'm able to > launch the task as many times as possible, while it always fails continually). > Attached is a snippet of the mesos master log from OFFER to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9726) No running tasks in marathon after restart non-leader mesos-master node
[ https://issues.apache.org/jira/browse/MESOS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823060#comment-16823060 ] Alexandr commented on MESOS-9726: - Good day, [~abudnik] ! Can someone help me with this case? > No running tasks in marathon after restart non-leader mesos-master node > --- > > Key: MESOS-9726 > URL: https://issues.apache.org/jira/browse/MESOS-9726 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Alexandr >Priority: Minor > Labels: beginner > > Good day! > I have problem with my mesos-cluster: > I have 3 mesos-master (Mesos version 1.4.1) nodes with Marathon cluster v > 1.5.11 (lets call them - Mas1,Mas2,Mas3) and 3 mesos-slave nodes (lets call > them - Slv1,Slv2,Slv3) with running apps on it. And I also have a Zookeeper > cluster on nodes Mas1,Mas2,Mas3. > So, Mesos leader - master-node "Mas1". > After I restarted master-node "Mas3" - he got back to the Mesos cluster, > everything is fine but after a moment I opened Marathon and all running tasks > from my mesos-slave nodes became "unknown" and have no instances on it. > So I checked: > 1. My mesos agents - everything was ok, 3 agents running. > 2. That all services are running and all clusters (Mesos\Marathon\Zookeeper) > are fine > 3. Decided to restart all mesos-slave services on slave nodes - on > slave-node Slv3 1 of 3 instances launched for all applications, then > restarted all marathon-services. After it all tasks switched to status > "Waiting"\"Delayed". > 4. Checked mesos-master and slave logs, no errors or information about any > problems on cluster - only information about killing and launching new tasks > on slave-node. > 5. Decided to stop-start mesos-master service for re-election of a Mesos > leader. > After it leader became master-node "Mas2" and all tasks in marathon started > to run instances like normal. > Logs will be uploaded later. Wonder how it could happen > {code:java} > Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.34240612 http.cpp:**] > HTTP GET for /master/state from :*** with User-Agent='Go-http-client/1.1' > Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.391480 9 > http.cpp:1185] HTTP GET for /master/state from *:40686 with > User-Agent='Go-http-client/1.1' Apr 10 18:32:18 Mas1 docker[27956]: E0410 > 15:32:18.56788014 process.cpp:2577] Failed to shutdown socket with fd 44, > address :5050: Transport endpoint is not connected Apr 10 18:32:18 Mas1 > docker[27956]: E0410 15:32:18.59690114 process.cpp:2577] Failed to > shutdown socket with fd 46, address :5050: Transport endpoint is not > connected > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)