[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Petrov updated SPARK-23182: - Affects Version/s: 2.2.2 > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870 ] Petar Petrov edited comment on SPARK-23182 at 2/5/18 8:14 PM: -- We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7. It processes about 2 jobs / day and does not stop. With time many workers join and get dissociated from the cluster. GCE evicts preemptible VMs without a graceful shutdown. GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):] {noformat} Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.{noformat} When a worker joins the cluster and is stopped without the executor gracefully stopped, the master keeps the connection open (although inactive) infinitely long. After some time the master errors with "Too many open files" and can not accept connections anymore. Thus the need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's OS will check the other side and close the connection if it's not responding. was (Author: pesho82): We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7. It processes about 2 jobs / day and does not stop. With time many workers join and get dissociated from the cluster. GCE evicts preemptible VMs without a graceful shutdown. GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):] {noformat} Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.{noformat} When a worker joins the cluster and is stopped without the executor gracefully stopped, the master keeps the connection open (although inactive) infinitely long. After some time the master errors with "Too many open files" and can not accept connections anymore. Thus the need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's OS will check the other side and close the connection if it's not responding. > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870 ] Petar Petrov edited comment on SPARK-23182 at 2/5/18 8:14 PM: -- We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7. It processes about 2 jobs / day and does not stop. With time many workers join and get dissociated from the cluster. GCE evicts preemptible VMs without a graceful shutdown. GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked (from [https://cloud.google.com/compute/docs/shutdownscript):] {noformat} Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.{noformat} When a worker joins the cluster and is stopped without the executor gracefully stopped, the master keeps the connection open (although inactive) infinitely long. After some time the master errors with "Too many open files" and can not accept connections anymore. Thus the need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's OS will check the other side and close the connection if it's not responding. was (Author: pesho82): We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7. It processes about 2 jobs / day and does not stop. With time many workers join and get dissociated from the cluster. GCE evicts VMs without a graceful shutdown. GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked (from https://cloud.google.com/compute/docs/shutdownscript): {noformat} Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.{noformat} When a worker joins the cluster and is stopped without the executor gracefully stopped, the master keeps the connection open (although inactive) infinitely long. After some time the master errors with "Too many open files" and can not accept connections anymore. Thus the need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's OS will check the other side and close the connection if it's not responding. > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352870#comment-16352870 ] Petar Petrov commented on SPARK-23182: -- We run a cluster of ~1000 cores in GCE using preemptible VMs for executors / workers and a standard (non-preemptible) master VM. That cluster processes tons of jobs 24/7. It processes about 2 jobs / day and does not stop. With time many workers join and get dissociated from the cluster. GCE evicts VMs without a graceful shutdown. GCE does support setting a shutdown script on preemptible VMs, but it's not always invoked (from https://cloud.google.com/compute/docs/shutdownscript): {noformat} Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.{noformat} When a worker joins the cluster and is stopped without the executor gracefully stopped, the master keeps the connection open (although inactive) infinitely long. After some time the master errors with "Too many open files" and can not accept connections anymore. Thus the need to enable TCP keep alive. It guarantees that when the worker is stopped, the master's OS will check the other side and close the connection if it's not responding. > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Petrov updated SPARK-23182: - Affects Version/s: (was: 2.2.0) 2.4.0 > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Petrov updated SPARK-23182: - Priority: Major (was: Minor) > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
Petar Petrov created SPARK-23182: Summary: Allow enabling of TCP keep alive for master RPC connections Key: SPARK-23182 URL: https://issues.apache.org/jira/browse/SPARK-23182 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Petar Petrov We rely heavily on preemptible worker machines in GCP/GCE. These machines disappear without closing the TCP connections to the master which increases the number of established connections and new workers can not connect because of "Too many open files" on the master. To solve the problem we need to enable TCP keep alive for the RPC connections to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org