[jira] [Updated] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2020-02-13 Thread Kevin Hogeland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Description: Since the restart policy of launched pods is Never, additional 
handling is required for pods that may have sidecar containers. The executor 
should be considered failed if any containers have terminated and have a 
non-zero exit code, but Spark currently only checks the pod phase. The pod 
phase will remain "running" as long as _any_ pods are still running. Kubernetes 
sidecar support in 1.18/1.19 does not address this situation, as sidecar 
containers are excluded from pod phase calculation.  (was: Since the restart 
policy of launched pods is Never, additional handling is required for pods that 
may have sidecar containers that need to restart on failure. Kubernetes sidecar 
support in 1.18/1.19 does _not_ address this situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.

(This is arguably a duplicate of SPARK-28887, but that issue is specifically 
for when the executor process fails))

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers. The executor should be 
> considered failed if any containers have terminated and have a non-zero exit 
> code, but Spark currently only checks the pod phase. The pod phase will 
> remain "running" as long as _any_ pods are still running. Kubernetes sidecar 
> support in 1.18/1.19 does not address this situation, as sidecar containers 
> are excluded from pod phase calculation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30821) Executor pods with multiple containers will not be rescheduled unless all containers fail

2020-02-13 Thread Kevin Hogeland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Summary: Executor pods with multiple containers will not be rescheduled 
unless all containers fail  (was: Sidecar containers in executor/driver may 
fail silently)

> Executor pods with multiple containers will not be rescheduled unless all 
> containers fail
> -
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers that need to restart on 
> failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
> situation (unlike 
> [SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
>  as sidecar containers are excluded from pod phase calculation.
> The pod snapshot should be considered "PodFailed" if the restart policy is 
> Never and any container has a non-zero exit code.
> (This is arguably a duplicate of SPARK-28887, but that issue is specifically 
> for when the executor process fails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30821) Sidecar containers in executor/driver may fail silently

2020-02-13 Thread Kevin Hogeland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30821:
---
Description: 
Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.

(This is arguably a duplicate of SPARK-28887, but that issue is specifically 
for when the executor process fails)

  was:
Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.


> Sidecar containers in executor/driver may fail silently
> ---
>
> Key: SPARK-30821
> URL: https://issues.apache.org/jira/browse/SPARK-30821
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> Since the restart policy of launched pods is Never, additional handling is 
> required for pods that may have sidecar containers that need to restart on 
> failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
> situation (unlike 
> [SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
>  as sidecar containers are excluded from pod phase calculation.
> The pod snapshot should be considered "PodFailed" if the restart policy is 
> Never and any container has a non-zero exit code.
> (This is arguably a duplicate of SPARK-28887, but that issue is specifically 
> for when the executor process fails)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30821) Sidecar containers in executor/driver may fail silently

2020-02-13 Thread Kevin Hogeland (Jira)
Kevin Hogeland created SPARK-30821:
--

 Summary: Sidecar containers in executor/driver may fail silently
 Key: SPARK-30821
 URL: https://issues.apache.org/jira/browse/SPARK-30821
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Kevin Hogeland


Since the restart policy of launched pods is Never, additional handling is 
required for pods that may have sidecar containers that need to restart on 
failure. Kubernetes sidecar support in 1.18/1.19 does _not_ address this 
situation (unlike 
[SPARK-28887|https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28887]),
 as sidecar containers are excluded from pod phase calculation.

The pod snapshot should be considered "PodFailed" if the restart policy is 
Never and any container has a non-zero exit code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30055) Allow configurable restart policy of driver and executor pods

2019-11-26 Thread Kevin Hogeland (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-30055:
---
Description: 
The current Kubernetes scheduler hard-codes the restart policy for all pods to 
be "Never". To restart a failed application, all pods have to be deleted and 
rescheduled, which is very slow and clears any caches the processes may have 
built. Spark should allow a configurable restart policy for both drivers and 
executors for immediate restart of crashed/killed drivers/executors as long as 
the pods are not evicted. (This is not about eviction resilience, that's 
described in this issue: SPARK-23980)

Also, as far as I can tell, there's no reason the executors should be set to 
never restart. Should that be configurable or should it just be changed to 
Always?

 

  was:
The current Kubernetes scheduler hard-codes the restart policy for all pods to 
be "Never". To restart a failed application, all pods have to be deleted and 
rescheduled, which is very slow and clears any caches the processes may have 
built. Spark should allow a configurable restart policy for both drivers and 
executors for immediate restart of crashed/killed drivers/executors as long as 
the pods are not evicted. (This is not about eviction resilience, that's 
described in this issue: 
[SPARK-23980|https://issues.apache.org/jira/browse/SPARK-23980])

Also, as far as I can tell, there's no reason the executors should be set to 
never restart. Should that be configurable or should it just be changed to 
OnFailure?

 


> Allow configurable restart policy of driver and executor pods
> -
>
> Key: SPARK-30055
> URL: https://issues.apache.org/jira/browse/SPARK-30055
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Kevin Hogeland
>Priority: Major
>
> The current Kubernetes scheduler hard-codes the restart policy for all pods 
> to be "Never". To restart a failed application, all pods have to be deleted 
> and rescheduled, which is very slow and clears any caches the processes may 
> have built. Spark should allow a configurable restart policy for both drivers 
> and executors for immediate restart of crashed/killed drivers/executors as 
> long as the pods are not evicted. (This is not about eviction resilience, 
> that's described in this issue: SPARK-23980)
> Also, as far as I can tell, there's no reason the executors should be set to 
> never restart. Should that be configurable or should it just be changed to 
> Always?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30055) Allow configurable restart policy of driver and executor pods

2019-11-26 Thread Kevin Hogeland (Jira)
Kevin Hogeland created SPARK-30055:
--

 Summary: Allow configurable restart policy of driver and executor 
pods
 Key: SPARK-30055
 URL: https://issues.apache.org/jira/browse/SPARK-30055
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.4
Reporter: Kevin Hogeland


The current Kubernetes scheduler hard-codes the restart policy for all pods to 
be "Never". To restart a failed application, all pods have to be deleted and 
rescheduled, which is very slow and clears any caches the processes may have 
built. Spark should allow a configurable restart policy for both drivers and 
executors for immediate restart of crashed/killed drivers/executors as long as 
the pods are not evicted. (This is not about eviction resilience, that's 
described in this issue: 
[SPARK-23980|https://issues.apache.org/jira/browse/SPARK-23980])

Also, as far as I can tell, there's no reason the executors should be set to 
never restart. Should that be configurable or should it just be changed to 
OnFailure?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24105) Spark 2.3.0 on kubernetes

2019-03-20 Thread Kevin Hogeland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797719#comment-16797719
 ] 

Kevin Hogeland edited comment on SPARK-24105 at 3/21/19 1:07 AM:
-

[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There is a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.


was (Author: hogeland):
[~vanzin] Why was this marked "Won't Fix"? This is a _major_ issue.
 * There is a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.

> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24105) Spark 2.3.0 on kubernetes

2019-03-20 Thread Kevin Hogeland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797719#comment-16797719
 ] 

Kevin Hogeland edited comment on SPARK-24105 at 3/21/19 1:09 AM:
-

[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There are a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors
 * Starting too many drivers at once will make it impossible for any driver to 
schedule an executor
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.


was (Author: hogeland):
[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There are a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.

> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24105) Spark 2.3.0 on kubernetes

2019-03-20 Thread Kevin Hogeland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797719#comment-16797719
 ] 

Kevin Hogeland edited comment on SPARK-24105 at 3/21/19 1:08 AM:
-

[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There are a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.


was (Author: hogeland):
[~vanzin] Why was this marked "Won't Fix"? This is a major issue.
 * There is a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.

> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24105) Spark 2.3.0 on kubernetes

2019-03-20 Thread Kevin Hogeland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797719#comment-16797719
 ] 

Kevin Hogeland commented on SPARK-24105:


[~vanzin] Why was this marked "Won't Fix"? This is a _major_ issue.
 * There is a limited amount of resources (constrained either by a 
ResourceQuota or by the size of the cluster)
 * Drivers are scheduled before executors due to the 2-layer scheduling design
 * Drivers consume from the same pool of resources as executors, making it 
possible to consume all available resources
 * If no driver can schedule an executor, all drivers are stalled indefinitely 
(even if they timeout and crash)

Starting too many drivers at the same time _will_ cause a deadlock. Any spiky 
workload is very likely to trigger this eventually. For example, if a large 
amount of Spark jobs are scheduled daily/hourly. We've been able to reproduce 
this easily in testing.

> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-11 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235632#comment-15235632
 ] 

Kevin Hogeland commented on SPARK-14437:


We can cherry pick the fix or use Akka for now.

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-08 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233211#comment-15233211
 ] 

Kevin Hogeland commented on SPARK-14437:


Will do soon. Can we get this fix applied to 1.6?

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442
 ] 

Kevin Hogeland edited comment on SPARK-14437 at 4/8/16 12:56 AM:
-

[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run with this change on the latest 
2.0.0-SNAPSHOT, possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: hogeland):
[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, 
possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 

[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231442#comment-15231442
 ] 

Kevin Hogeland commented on SPARK-14437:


[~zsxwing] Can confirm that after applying this commit to 1.6.1, the driver is 
able to connect to the block manager. Thanks for the quick patch.

I also encountered this error when trying to run on the latest 2.0.0-SNAPSHOT, 
possibly unrelated but worth documenting here:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
stage 29.0 failed 4 times, most recent failure: Lost task 3.3 in stage 29.0 
(TID 24, ip-172-16-15-0.us-west-2.compute.internal): 
java.lang.RuntimeException: Stream '/jars/' was not found.
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:223)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:121)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230733#comment-15230733
 ] 

Kevin Hogeland edited comment on SPARK-14437 at 4/7/16 6:09 PM:


Yes, {{BlockManager#doGetRemote}} requests the executor block manager address 
and attempts to fetch from it:

{code}
16/04/06 18:18:30 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 6778.0 
(TID 7351, ip-172-16-15-0.us-west-2.compute.internal): 
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
...
Caused by: java.io.IOException: Failed to connect to 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
...
Caused by: java.net.ConnectException: Connection refused: 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
{code}


was (Author: hogeland):
Yes, {{BlockManager#doGetRemote}} requests the executor block manager address 
from the driver and attempts to fetch from it:

{code}
16/04/06 18:18:30 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 6778.0 
(TID 7351, ip-172-16-15-0.us-west-2.compute.internal): 
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
...
Caused by: java.io.IOException: Failed to connect to 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
...
Caused by: java.net.ConnectException: Connection refused: 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
{code}

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230733#comment-15230733
 ] 

Kevin Hogeland edited comment on SPARK-14437 at 4/7/16 6:08 PM:


Yes, {{BlockManager#doGetRemote}} requests the executor block manager address 
from the driver and attempts to fetch from it:

{code}
16/04/06 18:18:30 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 6778.0 
(TID 7351, ip-172-16-15-0.us-west-2.compute.internal): 
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
...
Caused by: java.io.IOException: Failed to connect to 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
...
Caused by: java.net.ConnectException: Connection refused: 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
{code}


was (Author: hogeland):
Yes, {{BlockManager#doGetRemote}} requests the executor block manager address 
from the driver and attempts to fetch from it:

{code}
16/04/06 18:18:30 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 6778.0 
(TID 7351, ip-172-16-15-0.us-west-2.compute.internal): 
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
...
Caused by: java.io.IOException: Failed to connect to 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
{code}

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-07 Thread Kevin Hogeland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230733#comment-15230733
 ] 

Kevin Hogeland commented on SPARK-14437:


Yes, {{BlockManager#doGetRemote}} requests the executor block manager address 
from the driver and attempts to fetch from it:

{code}
16/04/06 18:18:30 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 6778.0 
(TID 7351, ip-172-16-15-0.us-west-2.compute.internal): 
org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at 
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
...
Caused by: java.io.IOException: Failed to connect to 
ip-172-16-14-0.us-west-2.compute.internal/172.16.14.0:33203
{code}

> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-06 Thread Kevin Hogeland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-14437:
---
Description: 
Netty can't get the correct origin address in certain network setups. Spark 
should handle this, as relying on Netty correctly reporting all addresses leads 
to incompatible and unpredictable network states. We're currently using Docker 
with Flannel on AWS. Container communication looks something like: {{Container 
1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) -> Container 
2 (4.5.6.1)}}

If the client in that setup is Container 1 (1.2.3.4), Netty channels from there 
to Container 2 will have a client address of 1.2.3.0.

The {{RequestMessage}} object that is sent over the wire already contains a 
{{senderAddress}} field that the sender can use to specify their address. In 
{{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client socket 
address when null. {{senderAddress}} in the messages sent from the executors is 
currently always null, meaning all messages will have these incorrect addresses 
(we've switched back to Akka as a temporary workaround for this). The executor 
should send its address explicitly so that the driver doesn't attempt to infer 
addresses based on possibly incorrect information from Netty.

  was:
Netty can't get the correct origin address in certain network setups. Spark 
should handle this, as relying on Netty correctly reporting all addresses leads 
to incompatible and unpredictable networking setups. We're currently using 
Docker with Flannel on AWS. Container communication looks something like: 
{{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
-> Container 2 (4.5.6.1)}}

If the client in that setup is Container 1 (1.2.3.4), Netty channels from there 
to Container 2 will have a client address of 1.2.3.0.

The {{RequestMessage}} object that is sent over the wire already contains a 
{{senderAddress}} field that the sender can use to specify their address. In 
{{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client socket 
address when null. {{senderAddress}} in the messages sent from the executors is 
currently always null, meaning all messages will have these incorrect addresses 
(we've switched back to Akka as a temporary workaround for this). The executor 
should send its address explicitly so that the driver doesn't attempt to infer 
addresses based on possibly incorrect information from Netty.


> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable network states. We're currently using 
> Docker with Flannel on AWS. Container communication looks something like: 
> {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
> -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-06 Thread Kevin Hogeland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Hogeland updated SPARK-14437:
---
Description: 
Netty can't get the correct origin address in certain network setups. Spark 
should handle this, as relying on Netty correctly reporting all addresses leads 
to incompatible and unpredictable networking setups. We're currently using 
Docker with Flannel on AWS. Container communication looks something like: 
{{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
-> Container 2 (4.5.6.1)}}

If the client in that setup is Container 1 (1.2.3.4), Netty channels from there 
to Container 2 will have a client address of 1.2.3.0.

The {{RequestMessage}} object that is sent over the wire already contains a 
{{senderAddress}} field that the sender can use to specify their address. In 
{{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client socket 
address when null. {{senderAddress}} in the messages sent from the executors is 
currently always null, meaning all messages will have these incorrect addresses 
(we've switched back to Akka as a temporary workaround for this). The executor 
should send its address explicitly so that the driver doesn't attempt to infer 
addresses based on possibly incorrect information from Netty.

  was:
Netty can't get the correct origin address in certain network setups. Spark 
should handle this, as relying on Netty correctly reporting all addresses leads 
to incompatible and unpredictable networking setups. We're currently using 
Docker with Flannel on AWS. Container communication looks something like: 
{{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
-> Container 2 (4.5.6.1)}}

If the client in that setup is Container 1 (1.2.3.4), Netty channels from there 
to Container 2 will have a client address of 1.2.3.0.

The {{RequestMessage}} object that is sent over the wire already contains a 
{{senderAddress}} field that the sender can use to specify their address. In 
{{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client socket 
address when null. {{senderAddress}} in the messages sent from the executors is 
currently always null. The executor should send its address explicitly so that 
the driver doesn't attempt to infer addresses based on possibly incorrect 
information from Netty.


> Spark using Netty RPC gets wrong address in some setups
> ---
>
> Key: SPARK-14437
> URL: https://issues.apache.org/jira/browse/SPARK-14437
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0, 1.6.1
> Environment: AWS, Docker, Flannel
>Reporter: Kevin Hogeland
>
> Netty can't get the correct origin address in certain network setups. Spark 
> should handle this, as relying on Netty correctly reporting all addresses 
> leads to incompatible and unpredictable networking setups. We're currently 
> using Docker with Flannel on AWS. Container communication looks something 
> like: {{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B 
> (4.5.6.0) -> Container 2 (4.5.6.1)}}
> If the client in that setup is Container 1 (1.2.3.4), Netty channels from 
> there to Container 2 will have a client address of 1.2.3.0.
> The {{RequestMessage}} object that is sent over the wire already contains a 
> {{senderAddress}} field that the sender can use to specify their address. In 
> {{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client 
> socket address when null. {{senderAddress}} in the messages sent from the 
> executors is currently always null, meaning all messages will have these 
> incorrect addresses (we've switched back to Akka as a temporary workaround 
> for this). The executor should send its address explicitly so that the driver 
> doesn't attempt to infer addresses based on possibly incorrect information 
> from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14437) Spark using Netty RPC gets wrong address in some setups

2016-04-06 Thread Kevin Hogeland (JIRA)
Kevin Hogeland created SPARK-14437:
--

 Summary: Spark using Netty RPC gets wrong address in some setups
 Key: SPARK-14437
 URL: https://issues.apache.org/jira/browse/SPARK-14437
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.6.1, 1.6.0
 Environment: AWS, Docker, Flannel
Reporter: Kevin Hogeland


Netty can't get the correct origin address in certain network setups. Spark 
should handle this, as relying on Netty correctly reporting all addresses leads 
to incompatible and unpredictable networking setups. We're currently using 
Docker with Flannel on AWS. Container communication looks something like: 
{{Container 1 (1.2.3.1) -> Docker host A (1.2.3.0) -> Docker host B (4.5.6.0) 
-> Container 2 (4.5.6.1)}}

If the client in that setup is Container 1 (1.2.3.4), Netty channels from there 
to Container 2 will have a client address of 1.2.3.0.

The {{RequestMessage}} object that is sent over the wire already contains a 
{{senderAddress}} field that the sender can use to specify their address. In 
{{NettyRpcEnv#internalReceive}}, this is replaced with the Netty client socket 
address when null. {{senderAddress}} in the messages sent from the executors is 
currently always null. The executor should send its address explicitly so that 
the driver doesn't attempt to infer addresses based on possibly incorrect 
information from Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org