[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-31 Thread Yiheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006078#comment-17006078
 ] 

Yiheng Wang commented on LIVY-718:
--

[~bikassaha] Compared to the designated server solution, I think stateless 
server solution get more accessibility by sacrificing scalability. In this 
background, one concern is memory. We observed that when the running session 
number grows to 400~500, the Livy server process consuming about 2G memory. 
Another concern is Livy use long connections between server and spark drivers. 
Say there're M server and N session. In designate solution, there're N 
connections. In stateless solution, there're M x N connections. I'm afraid this 
may bring a lot of overhead in RPC communication(e.g. serialization, routing).

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (LIVY-718) Support multi-active high availability in Livy

2019-12-31 Thread Yiheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006069#comment-17006069
 ] 

Yiheng Wang edited comment on LIVY-718 at 12/31/19 12:28 PM:
-

bq. When a server fails, its sessions become unavailable until other servers 
are designated to handle them. This was not acceptable behavior, at least for 
clusters that I worked with in my previous job.

[~meisam] Currently, Livy only supports single node failure recover. Do you use 
Livy in that cluster? If so, how do you handle the downtime?


was (Author: yihengw):
bq. When a server fails, its sessions become unavailable until other servers 
are designated to handle them. This was not acceptable behavior, at least for 
clusters that I worked with in my previous job.

[~meisam] Currently, Livy only supports single node failure recover. Do you use 
Livy in that cluster? If so, would you like to share your solution?

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-31 Thread Yiheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006069#comment-17006069
 ] 

Yiheng Wang commented on LIVY-718:
--

bq. When a server fails, its sessions become unavailable until other servers 
are designated to handle them. This was not acceptable behavior, at least for 
clusters that I worked with in my previous job.

[~meisam] Currently, Livy only supports single node failure recover. Do you use 
Livy in that cluster? If so, would you like to share your solution?

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)