[jira] [Closed] (LIVY-712) EMR 5.23/5.27 - Livy does not recognise that Spark job failed

2019-11-28 Thread Michal Sankot (Jira)


 [ 
https://issues.apache.org/jira/browse/LIVY-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Sankot closed LIVY-712.
--
Resolution: Workaround

Issue seems to be external - concretely a problem in AWS EMR customizations of 
hadoop libraries.

> EMR 5.23/5.27 - Livy does not recognise that Spark job failed
> -
>
> Key: LIVY-712
> URL: https://issues.apache.org/jira/browse/LIVY-712
> Project: Livy
>  Issue Type: Bug
>  Components: API
>Affects Versions: 0.5.0, 0.6.0
> Environment: AWS EMR 5.23/5.27, Scala
>Reporter: Michal Sankot
>Priority: Major
>  Labels: EMR, api, spark
>
> We've upgraded from AWS EMR 5.13 -> 5.23 (Livy 0.4.0 -> 0.5.0, Spark 2.3.0 -> 
> 2.4.0) and an issue appears that when there is an exception thrown during 
> Spark job execution, Spark shuts down as if there was no problem and job 
> appears as Completed in EMR. So we're not notified when system crashes. The 
> same problem appears in EMR 5.27 (Livy 0.6.0, Spark 2.4.4).
> Is it something with Spark? Or a known issue with Livy?
> In Livy logs I see that spark-submit exists with error code 1
> {quote}{{05:34:59 WARN BatchSession$: spark-submit exited with code 1}}
> {quote}
>  And then Livy API states that batch state is
> {quote}{{"state": "success"}}
> {quote}
> How can it be made work again?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (LIVY-712) EMR 5.23/5.27 - Livy does not recognise that Spark job failed

2019-11-28 Thread Michal Sankot (Jira)


[ 
https://issues.apache.org/jira/browse/LIVY-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984437#comment-16984437
 ] 

Michal Sankot commented on LIVY-712:


It seems that issue was present in EMR 5.23/5.27 (hadoop libraries 
2.8.5-amzn-3/2.8.5-amzn-4) and is not present in EMR 5.28 anynore (hadoop 
libraries 2.8.5-amzn-5).

I'm thus closing the issue.

> EMR 5.23/5.27 - Livy does not recognise that Spark job failed
> -
>
> Key: LIVY-712
> URL: https://issues.apache.org/jira/browse/LIVY-712
> Project: Livy
>  Issue Type: Bug
>  Components: API
>Affects Versions: 0.5.0, 0.6.0
> Environment: AWS EMR 5.23/5.27, Scala
>Reporter: Michal Sankot
>Priority: Major
>  Labels: EMR, api, spark
>
> We've upgraded from AWS EMR 5.13 -> 5.23 (Livy 0.4.0 -> 0.5.0, Spark 2.3.0 -> 
> 2.4.0) and an issue appears that when there is an exception thrown during 
> Spark job execution, Spark shuts down as if there was no problem and job 
> appears as Completed in EMR. So we're not notified when system crashes. The 
> same problem appears in EMR 5.27 (Livy 0.6.0, Spark 2.4.4).
> Is it something with Spark? Or a known issue with Livy?
> In Livy logs I see that spark-submit exists with error code 1
> {quote}{{05:34:59 WARN BatchSession$: spark-submit exited with code 1}}
> {quote}
>  And then Livy API states that batch state is
> {quote}{{"state": "success"}}
> {quote}
> How can it be made work again?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (LIVY-718) Support multi-active high availability in Livy

2019-11-28 Thread Yiheng Wang (Jira)
Yiheng Wang created LIVY-718:


 Summary: Support multi-active high availability in Livy
 Key: LIVY-718
 URL: https://issues.apache.org/jira/browse/LIVY-718
 Project: Livy
  Issue Type: Epic
  Components: RSC, Server
Reporter: Yiheng Wang


In this JIRA we want to discuss how to implement multi-active high availability 
in Livy.

Currently, Livy only supports single node recovery. This is not sufficient in 
some production environments. In our scenario, the Livy server serves many 
notebook and JDBC services. We want to make Livy service more fault-tolerant 
and scalable.

There're already some proposals in the community for high availability. But 
they're not so complete or just for active-standby high availability. So we 
propose a multi-active high availability design to achieve the following goals:
# One or more servers will serve the client requests at the same time.
# Sessions are allocated among different servers.
# When one node crashes, the affected sessions will be moved to other active 
services.

Here's our design document, please review and comment:
https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)