[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Saisai Shao (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005895#comment-17005895
 ] 

Saisai Shao commented on LIVY-718:
--

IIUC, current JDBC part maintains lots of results/metadata/information on 
LivyServer, I think the key point mentioned by [~bikassaha] is to make Livy 
Server stateless, that would be an ideal solution, but currently it requires a 
large amount of works.

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005512#comment-17005512
 ] 

Marco Gaido edited comment on LIVY-718 at 12/30/19 6:29 PM:


bq. IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without 
a fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
Hive Thrift server? Then maybe the hive JDBC client now supports server side 
transitions. If not, then we may have the caveat that HA won't work for such 
connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).


was (Author: mgaido):
> IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
> fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
> Hive Thrift server? Then maybe the hive JDBC client now supports server side 
> transitions. If not, then we may have the caveat that HA won't work for such 
> connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005512#comment-17005512
 ] 

Marco Gaido commented on LIVY-718:
--

> IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
> fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
> Hive Thrift server? Then maybe the hive JDBC client now supports server side 
> transitions. If not, then we may have the caveat that HA won't work for such 
> connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Bikas Saha (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005439#comment-17005439
 ] 

Bikas Saha commented on LIVY-718:
-

There could be in-memory state in the livy server but could that be re-created 
from the state in the Spark driver with an initial sync operation?

If not, then what additional metadata could be stored in the Spark drive to 
make it happen?

The ideal situation would be (keeping in mind Meisam's observations)
 # Any livy client can hit any livy server and continue from where it was. The 
first time a livy server is hit for a session it may take some time to hydrate 
the state in case it was not done in the background.
 ## Note that this can happen even without any livy server failure in cases 
where a load balancer is running in front of the livy server and sticky 
sessions are not working or there is too much hot-spotting.
 # A livy server can (with some extra sync operation if needed) service any 
session from that sessions Spark driver. The only information it needs is the 
information of how to connect with the Spark driver. That could be stored in a 
reliable state store (e.g. even in a YARN application tag for YARN clusters)

If we can achieve the above then the system could be much simpler to operate 
and work with.

IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
Hive Thrift server? Then maybe the hive JDBC client now supports server side 
transitions. If not, then we may have the caveat that HA won't work for such 
connections. I am not super familiar with the JDBC client.

 Any locking or serialization could be achieved in livy code running inside the 
Spark driver since its a single point where every request eventually has to 
come and get serialized.

Good point on monitoring - if communicating with YARN is the main overhead 
(like Meisam observed) then perhaps the number of livy sessions being monitored 
is not that important. Assuming that a single YARN query can return all livy 
session related Spark jobs (using tags etc.) which is then iterated to 
determine status of livy sessions for running jobs. Until monitoring starts 
consuming significant cycles on the server maybe its not a problem.

If the Livy server does not use the integral identifier semantically (and 
neither do clients) then its long term better IMO to change it now rather than 
have to live with its limitations (even more with HA) going forward. I am not 
sure why integral identifiers were chosen in the first place (in case there is 
some important use case for it that I am missing). It would be a breaking 
change for clients but its a mechanical change that may be less effort compared 
to a semantic change. And the benefit is huge - getting HA - which may justify 
that cost.

 

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Bikas Saha (Jira)


 [ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bikas Saha updated LIVY-718:

Comment: was deleted

(was: There could be in-memory state in the livy server but could that be 
re-created from the state in the Spark driver with an initial sync operation?

If not, then what additional metadata could be stored in the Spark drive to 
make it happen?

The ideal situation would be (keeping in mind Meisam's observations)
 # Any livy client can hit any livy server and continue from where it was. The 
first time a livy server is hit for a session it may take some time to hydrate 
the state in case it was not done in the background.
 ## Note that this can happen even without any livy server failure in cases 
where a load balancer is running in front of the livy server and sticky 
sessions are not working or there is too much hot-spotting.
 # A livy server can (with some extra sync operation if needed) service any 
session from that sessions Spark driver. The only information it needs is the 
information of how to connect with the Spark driver. That could be stored in a 
reliable state store (e.g. even in a YARN application tag for YARN clusters)

If we can achieve the above then the system could be much simpler to operate 
and work with.

IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
Hive Thrift server? Then maybe the hive JDBC client now supports server side 
transitions. If not, then we may have the caveat that HA won't work for such 
connections. I am not super familiar with the JDBC client.

 )

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Bikas Saha (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005434#comment-17005434
 ] 

Bikas Saha commented on LIVY-718:
-

There could be in-memory state in the livy server but could that be re-created 
from the state in the Spark driver with an initial sync operation?

If not, then what additional metadata could be stored in the Spark drive to 
make it happen?

The ideal situation would be (keeping in mind Meisam's observations)
 # Any livy client can hit any livy server and continue from where it was. The 
first time a livy server is hit for a session it may take some time to hydrate 
the state in case it was not done in the background.
 ## Note that this can happen even without any livy server failure in cases 
where a load balancer is running in front of the livy server and sticky 
sessions are not working or there is too much hot-spotting.
 # A livy server can (with some extra sync operation if needed) service any 
session from that sessions Spark driver. The only information it needs is the 
information of how to connect with the Spark driver. That could be stored in a 
reliable state store (e.g. even in a YARN application tag for YARN clusters)

If we can achieve the above then the system could be much simpler to operate 
and work with.

IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
Hive Thrift server? Then maybe the hive JDBC client now supports server side 
transitions. If not, then we may have the caveat that HA won't work for such 
connections. I am not super familiar with the JDBC client.

 

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (LIVY-737) Livy Server support use hadoop native method in local filesystem

2019-12-30 Thread Zhefeng Wang (Jira)
Zhefeng Wang created LIVY-737:
-

 Summary: Livy Server support use hadoop native method in local 
filesystem
 Key: LIVY-737
 URL: https://issues.apache.org/jira/browse/LIVY-737
 Project: Livy
  Issue Type: New Feature
  Components: Server
Reporter: Zhefeng Wang
Assignee: Zhefeng Wang


Livy server doesn't use hadoop local methods when choosing local file system to 
store sessionMetadata and nextId,which spent more time than using hadoop local 
methods.

this will affect livy server throughput in high-concurrency scenario, because 
sessionStore.nextId is synchronized and will retain the SessionManager Object 
lock.

 
{code:java}
//代码占位符
"qtp20084184-405665" #405665 prio=5 os_prio=0 tid=0x7fa9dca37000 nid=0xf55d 
runnable [0x7faa1590]
java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:486)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:815)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:798)
at org.apache.hadoop.fs.FileUtil.readLink(FileUtil.java:160)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileLinkStatusInternal(RawLocalFileSystem.java:835)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatus(RawLocalFileSystem.java:795)
at 
org.apache.hadoop.fs.DelegateToFileSystem.getFileLinkStatus(DelegateToFileSystem.java:130)
at 
org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:705)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:236)
at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:678)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:958)at 
org.apache.livy.server.recovery.FileSystemStateStore$$anonfun$set$1.apply(FileSystemStateStore.scala:92)
at 
org.apache.livy.server.recovery.FileSystemStateStore$$anonfun$set$1.apply(FileSystemStateStore.scala:88)
at org.apache.livy.Utils$.usingResource(Utils.scala:103)
at 
org.apache.livy.server.recovery.FileSystemStateStore.set(FileSystemStateStore.scala:88)
at 
org.apache.livy.server.recovery.SessionStore.saveNextSessionId(SessionStore.scala:50)
at org.apache.livy.sessions.SessionManager.nextId(SessionManager.scala:93)
- locked <0x804f1228> (a org.apache.livy.sessions.BatchSessionManager)
at 
org.apache.livy.server.batch.BatchSessionServlet.createSession(BatchSessionServlet.scala:68)
at 
org.apache.livy.server.batch.BatchSessionServlet.createSession(BatchSessionServlet.scala:40)
at 
org.apache.livy.server.SessionServlet$$anonfun$17.apply(SessionServlet.scala:161)
at 
org.scalatra.ScalatraBase$class.org$scalatra$ScalatraBase$$liftAction(ScalatraBase.scala:270)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Yiheng Wang (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005203#comment-17005203
 ] 

Yiheng Wang commented on LIVY-718:
--

Thanks for your comments [~bikassaha] and [~meisam]. I summary the discussing 
points and put my comments below(please point it out if I miss something).

h4. Designated Server - Is it because there are issues with multiple servers 
handling multiple clients to the same session?
The issue includes:
1. Livy server needs to monitor spark sessions. If we remove the designated 
server, each server may need to monitor all sessions. It's kind of waste and 
there may be some inconsistent among servers.
2. Besides service data, Livy server also stores other data like the 
application log and last active time in memory. Such information has a higher 
update rate. It's not suitable to store in some state-store backend like 
zookeeper.
3. If multiple servers serve one session, we need to add some kind of lock 
mechanism to handle the concurrent state-change request(e.g. stop session)

h4. Session id - I would strongly suggest deprecating the integral session id
I agree that the incremental integral session ID is not necessary for Livy. For 
changing it to UUID, my biggest concern is compatible with the earlier 
API(session-id data type may be needed to change from Int to String). It's a 
quite big move so we choose a conservative way in the design.

h4. Dependency on ZK
ZK is introduced to resolve the above two problems(server status change 
notification and unique id generation). If we don't need the designated server 
and incremental session-id, I think we can remove zk.

h4. Service discovery, Ease of use of the API and the number of ports in the 
firewall that needs to be opened for Livy HA can become a security concern
I think the point here is a single URL for the Livy cluster. It depends on the 
first questions. If we can remove the designated server, we just need to put a 
load balancer before all the servers.

If we keep the designated server design, we can use a http load balancer which 
is aware of 307 responses. Currently, we use a 307 response to route the 
request to the designated server on the client-side. The response can be 
handled automatically by some load balancer.

I think the key point of the discussion is the designated server. Please let me 
know your suggestions for my list issues and see if we can improve the design 
proposal.


> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)