[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-04-24 Thread Petar Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Petrov updated SPARK-23182:
-
Affects Version/s: 2.2.2

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-02-05 Thread Petar Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870
 ] 

Petar Petrov edited comment on SPARK-23182 at 2/5/18 8:14 PM:
--

We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / 
workers and a standard (non-preemptible) master VM. That cluster processes tons 
of jobs 24/7.

It processes about 2 jobs / day and does not stop. With time many workers 
join and get dissociated from the cluster. GCE evicts preemptible VMs without a 
graceful shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not 
always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):]
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does 
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully 
stopped, the master keeps the connection open (although inactive) infinitely 
long. After some time the master errors with "Too many open files" and can not 
accept connections anymore. Thus the need to enable TCP keep alive. It 
guarantees that when the worker is stopped, the master's OS will check the 
other side and close the connection if it's not responding. 

 


was (Author: pesho82):
We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / 
workers and a standard (non-preemptible) master VM. That cluster processes tons 
of jobs 24/7.

It processes about 2 jobs / day and does not stop. With time many workers 
join and get dissociated from the cluster. GCE evicts preemptible VMs without a 
graceful shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not 
always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):]

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does 
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully 
stopped, the master keeps the connection open (although inactive) infinitely 
long. After some time the master errors with "Too many open files" and can not 
accept connections anymore. Thus the need to enable TCP keep alive. It 
guarantees that when the worker is stopped, the master's OS will check the 
other side and close the connection if it's not responding. 

 

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-02-05 Thread Petar Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870
 ] 

Petar Petrov edited comment on SPARK-23182 at 2/5/18 8:14 PM:
--

We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / 
workers and a standard (non-preemptible) master VM. That cluster processes tons 
of jobs 24/7.

It processes about 2 jobs / day and does not stop. With time many workers 
join and get dissociated from the cluster. GCE evicts preemptible VMs without a 
graceful shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not 
always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):]

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does 
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully 
stopped, the master keeps the connection open (although inactive) infinitely 
long. After some time the master errors with "Too many open files" and can not 
accept connections anymore. Thus the need to enable TCP keep alive. It 
guarantees that when the worker is stopped, the master's OS will check the 
other side and close the connection if it's not responding. 

 


was (Author: pesho82):
We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / 
workers and a standard (non-preemptible) master VM. That cluster processes tons 
of jobs 24/7.

It processes about 2 jobs / day and does not stop. With time many workers 
join and get dissociated from the cluster. GCE evicts VMs without a graceful 
shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not 
always invoked (from https://cloud.google.com/compute/docs/shutdownscript):

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does 
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully 
stopped, the master keeps the connection open (although inactive) infinitely 
long. After some time the master errors with "Too many open files" and can not 
accept connections anymore. Thus the need to enable TCP keep alive. It 
guarantees that when the worker is stopped, the master's OS will check the 
other side and close the connection if it's not responding. 

 

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-02-05 Thread Petar Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870
 ] 

Petar Petrov commented on SPARK-23182:
--

We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / 
workers and a standard (non-preemptible) master VM. That cluster processes tons 
of jobs 24/7.

It processes about 2 jobs / day and does not stop. With time many workers 
join and get dissociated from the cluster. GCE evicts VMs without a graceful 
shutdown.

GCE does support setting a shutdown script on preemptible VMs, but it's not 
always invoked (from https://cloud.google.com/compute/docs/shutdownscript):

 
{noformat}
Compute Engine only executes shutdown scripts on a best-effort basis and does 
not guarantee that the shutdown script will be run in all cases.{noformat}
When a worker joins the cluster and is stopped without the executor gracefully 
stopped, the master keeps the connection open (although inactive) infinitely 
long. After some time the master errors with "Too many open files" and can not 
accept connections anymore. Thus the need to enable TCP keep alive. It 
guarantees that when the worker is stopped, the master's OS will check the 
other side and close the connection if it's not responding. 

 

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-02-05 Thread Petar Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Petrov updated SPARK-23182:
-
Affects Version/s: (was: 2.2.0)
   2.4.0

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-02-05 Thread Petar Petrov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Petrov updated SPARK-23182:
-
Priority: Major  (was: Minor)

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-01-22 Thread Petar Petrov (JIRA)
Petar Petrov created SPARK-23182:


 Summary: Allow enabling of TCP keep alive for master RPC 
connections
 Key: SPARK-23182
 URL: https://issues.apache.org/jira/browse/SPARK-23182
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Petar Petrov


We rely heavily on preemptible worker machines in GCP/GCE. These machines 
disappear without closing the TCP connections to the master which increases the 
number of established connections and new workers can not connect because of 
"Too many open files" on the master.

To solve the problem we need to enable TCP keep alive for the RPC connections 
to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org