[jira] [Commented] (MESOS-5828) Modularize Network in replicated_log

2019-04-22 Thread longfei (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823644#comment-16823644
 ] 

longfei commented on MESOS-5828:


Actually, I have made some modifications to MESOS-1806(based on Jay Guo's 
work), and it(etcd contender/detector) works fine for my test mesos cluster. 

I'll commit a patch if needed.

 

The problem is that Replicated Log only works when ZK is present. I think some 
abstraction would make it more flexible and elegant.

Anyway, I'll try zetcd. Thanks a lot!

> Modularize Network in replicated_log
> 
>
> Key: MESOS-5828
> URL: https://issues.apache.org/jira/browse/MESOS-5828
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Jay Guo
>Assignee: Jay Guo
>Priority: Major
>
> Currently replicated_log relies on Zookeeper for coordinator election. This 
> is done through network abstraction _ZookeeperNetwork_. We need to modularize 
> this part in order to enable replicated_log when using Master 
> contender/detector modules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources

2019-04-22 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823580#comment-16823580
 ] 

Greg Mann commented on MESOS-9619:
--

Backports forthcoming

> Mesos Master Crashes with Launch Group when using Port Resources
> 
>
> Key: MESOS-9619
> URL: https://issues.apache.org/jira/browse/MESOS-9619
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.3, 1.7.1
> Environment:  
> Testing in both Mesos 1.4.3 and Mesos 1.7.1
>Reporter: Nimi Wariboko Jr.
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, master, mesosphere
> Attachments: mesos-master.log, mesos-master.snippet.log
>
>
> Original Issue: 
> [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E]
>  When the ports resources is removed, Mesos functions normally (I'm able to 
> launch the task as many times as possible, while it always fails continually).
> Attached is a snippet of the mesos master log from OFFER to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources

2019-04-22 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823576#comment-16823576
 ] 

Greg Mann commented on MESOS-9619:
--

On master branch:
{code}
commit cbae57b7e790b8b46c79052975406d603e7d175a
Author: Greg Mann 
Date:   Fri Apr 19 00:34:15 2019 -0700

Enabled construction of `ResourceQuantities` from `Resources`.

This patch adds a new static method which enables the
construction of `ResourceQuantities` from `Resources`.
Namely, this permits the inclusion of sets and ranges in the
input resources used to construct `ResourceQuantities`.

Review: https://reviews.apache.org/r/70507
{code}
{code}
commit f8ffdb7bbf3ff58e1e7a411cdd66767519d9a7ad
Author: Greg Mann 
Date:   Sat Apr 20 11:48:39 2019 -0700

Ensured that task groups do not specify overlapping ranges or sets.

This patch adds validation to the master to ensure that task
groups do not include resources with overlapping set- or
range-valued resources, as this can crash the allocator.

Review: https://reviews.apache.org/r/70472/
{code}
{code}
commit 45c9788618e7123f408a1dffcf6772a1285cd2e5
Author: Greg Mann 
Date:   Mon Apr 22 11:08:32 2019 -0700

Added unit test for a master validation helper function.

Review: https://reviews.apache.org/r/70517
{code}

> Mesos Master Crashes with Launch Group when using Port Resources
> 
>
> Key: MESOS-9619
> URL: https://issues.apache.org/jira/browse/MESOS-9619
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.3, 1.7.1
> Environment:  
> Testing in both Mesos 1.4.3 and Mesos 1.7.1
>Reporter: Nimi Wariboko Jr.
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, master, mesosphere
> Attachments: mesos-master.log, mesos-master.snippet.log
>
>
> Original Issue: 
> [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E]
>  When the ports resources is removed, Mesos functions normally (I'm able to 
> launch the task as many times as possible, while it always fails continually).
> Attached is a snippet of the mesos master log from OFFER to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9735) Migrate master metrics to PushGauge

2019-04-22 Thread Greg Mann (JIRA)
Greg Mann created MESOS-9735:


 Summary: Migrate master metrics to PushGauge
 Key: MESOS-9735
 URL: https://issues.apache.org/jira/browse/MESOS-9735
 Project: Mesos
  Issue Type: Task
Affects Versions: 1.8.0
Reporter: Greg Mann


We should migrate all metrics in the master actor to use {{PushGauges}} instead 
of {{PullGauges}}. If there are any cases that would be very cumbersome to 
handle with the {{PushGauge}} (i.e. uptime), we should file JIRA tickets for 
the design/development of metric types that can handle those cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5828) Modularize Network in replicated_log

2019-04-22 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823434#comment-16823434
 ] 

Joseph Wu commented on MESOS-5828:
--

Progress on this has been paused for a while (although the bulk of the patches 
are still usable).

In the meantime, you can try using zetcd, which basically exposes a ZK API for 
etcd:
https://github.com/etcd-io/zetcd

See this thread too: 
https://issues.apache.org/jira/browse/MESOS-1806?focusedCommentId=15895593=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15895593

> Modularize Network in replicated_log
> 
>
> Key: MESOS-5828
> URL: https://issues.apache.org/jira/browse/MESOS-5828
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Jay Guo
>Assignee: Jay Guo
>Priority: Major
>
> Currently replicated_log relies on Zookeeper for coordinator election. This 
> is done through network abstraction _ZookeeperNetwork_. We need to modularize 
> this part in order to enable replicated_log when using Master 
> contender/detector modules.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9726) No running tasks in marathon after restart non-leader mesos-master node

2019-04-22 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823423#comment-16823423
 ] 

Joseph Wu commented on MESOS-9726:
--

Logs of all three mesos masters would be helpful.

Also some quick questions to get us on the same page:
* What method did you use to determine the active Mesos leader?  (Marathon 
leader is not the Mesos leader.)
* When you "restart" anything, is this referring to stopping/starting a 
service?  Or rebooting an entire node?

> No running tasks in marathon after restart non-leader mesos-master node
> ---
>
> Key: MESOS-9726
> URL: https://issues.apache.org/jira/browse/MESOS-9726
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Alexandr
>Priority: Minor
>  Labels: beginner
>
> Good day!
>  I have problem with my mesos-cluster:
>  I have 3 mesos-master (Mesos version 1.4.1) nodes with Marathon cluster v 
> 1.5.11 (lets call them - Mas1,Mas2,Mas3) and 3 mesos-slave nodes (lets call 
> them - Slv1,Slv2,Slv3) with running apps on it. And I also have a Zookeeper 
> cluster on nodes Mas1,Mas2,Mas3.
>  So, Mesos leader - master-node "Mas1".
>  After I restarted master-node "Mas3" - he got back to the Mesos cluster, 
> everything is fine but after a moment I opened Marathon and all running tasks 
> from my mesos-slave nodes became "unknown" and have no instances on it. 
>  So I checked:
>  1. My mesos agents - everything was ok, 3 agents running. 
>  2. That all services are running and all clusters (Mesos\Marathon\Zookeeper) 
> are fine
>  3. Decided to restart all mesos-slave services on slave nodes - on 
> slave-node Slv3 1 of 3 instances launched for all applications, then 
> restarted all marathon-services. After it all tasks switched to status 
> "Waiting"\"Delayed".
>  4. Checked mesos-master and slave logs, no errors or information about any 
> problems on cluster - only information about killing and launching new tasks 
> on slave-node.
>  5. Decided to stop-start mesos-master service for re-election of a Mesos 
> leader. 
>  After it leader became master-node "Mas2" and all tasks in marathon started 
> to run instances like normal. 
> Logs will be uploaded later. Wonder how it could happen
> {code:java}
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.34240612 http.cpp:**] 
> HTTP GET for /master/state from :*** with User-Agent='Go-http-client/1.1' 
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.391480 9 
> http.cpp:1185] HTTP GET for /master/state from *:40686 with 
> User-Agent='Go-http-client/1.1' Apr 10 18:32:18 Mas1 docker[27956]: E0410 
> 15:32:18.56788014 process.cpp:2577] Failed to shutdown socket with fd 44, 
> address :5050: Transport endpoint is not connected Apr 10 18:32:18 Mas1 
> docker[27956]: E0410 15:32:18.59690114 process.cpp:2577] Failed to 
> shutdown socket with fd 46, address :5050: Transport endpoint is not 
> connected
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources

2019-04-22 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816794#comment-16816794
 ] 

Greg Mann edited comment on MESOS-9619 at 4/22/19 6:18 PM:
---

Reviews here:
[https://reviews.apache.org/r/70507/]
[https://reviews.apache.org/r/70472/]
[https://reviews.apache.org/r/70517/]
[https://reviews.apache.org/r/70509/]


was (Author: greggomann):
Reviews here:
[https://reviews.apache.org/r/70507/]
[https://reviews.apache.org/r/70472/]
[https://reviews.apache.org/r/70509/]

> Mesos Master Crashes with Launch Group when using Port Resources
> 
>
> Key: MESOS-9619
> URL: https://issues.apache.org/jira/browse/MESOS-9619
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.3, 1.7.1
> Environment:  
> Testing in both Mesos 1.4.3 and Mesos 1.7.1
>Reporter: Nimi Wariboko Jr.
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, master, mesosphere
> Attachments: mesos-master.log, mesos-master.snippet.log
>
>
> Original Issue: 
> [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E]
>  When the ports resources is removed, Mesos functions normally (I'm able to 
> launch the task as many times as possible, while it always fails continually).
> Attached is a snippet of the mesos master log from OFFER to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9726) No running tasks in marathon after restart non-leader mesos-master node

2019-04-22 Thread Alexandr (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823060#comment-16823060
 ] 

Alexandr commented on MESOS-9726:
-

Good day, [~abudnik] !
Can someone help me with this case? 

> No running tasks in marathon after restart non-leader mesos-master node
> ---
>
> Key: MESOS-9726
> URL: https://issues.apache.org/jira/browse/MESOS-9726
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Alexandr
>Priority: Minor
>  Labels: beginner
>
> Good day!
>  I have problem with my mesos-cluster:
>  I have 3 mesos-master (Mesos version 1.4.1) nodes with Marathon cluster v 
> 1.5.11 (lets call them - Mas1,Mas2,Mas3) and 3 mesos-slave nodes (lets call 
> them - Slv1,Slv2,Slv3) with running apps on it. And I also have a Zookeeper 
> cluster on nodes Mas1,Mas2,Mas3.
>  So, Mesos leader - master-node "Mas1".
>  After I restarted master-node "Mas3" - he got back to the Mesos cluster, 
> everything is fine but after a moment I opened Marathon and all running tasks 
> from my mesos-slave nodes became "unknown" and have no instances on it. 
>  So I checked:
>  1. My mesos agents - everything was ok, 3 agents running. 
>  2. That all services are running and all clusters (Mesos\Marathon\Zookeeper) 
> are fine
>  3. Decided to restart all mesos-slave services on slave nodes - on 
> slave-node Slv3 1 of 3 instances launched for all applications, then 
> restarted all marathon-services. After it all tasks switched to status 
> "Waiting"\"Delayed".
>  4. Checked mesos-master and slave logs, no errors or information about any 
> problems on cluster - only information about killing and launching new tasks 
> on slave-node.
>  5. Decided to stop-start mesos-master service for re-election of a Mesos 
> leader. 
>  After it leader became master-node "Mas2" and all tasks in marathon started 
> to run instances like normal. 
> Logs will be uploaded later. Wonder how it could happen
> {code:java}
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.34240612 http.cpp:**] 
> HTTP GET for /master/state from :*** with User-Agent='Go-http-client/1.1' 
> Apr 10 18:32:18 Mas1 docker[*]: I0410 15:32:18.391480 9 
> http.cpp:1185] HTTP GET for /master/state from *:40686 with 
> User-Agent='Go-http-client/1.1' Apr 10 18:32:18 Mas1 docker[27956]: E0410 
> 15:32:18.56788014 process.cpp:2577] Failed to shutdown socket with fd 44, 
> address :5050: Transport endpoint is not connected Apr 10 18:32:18 Mas1 
> docker[27956]: E0410 15:32:18.59690114 process.cpp:2577] Failed to 
> shutdown socket with fd 46, address :5050: Transport endpoint is not 
> connected
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)