Luca Canali created SPARK-26087:
-----------------------------------

             Summary: Spark on K8S jobs fail when returning large result sets 
in client mode under certain conditions
                 Key: SPARK-26087
                 URL: https://issues.apache.org/jira/browse/SPARK-26087
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 2.4.0
            Reporter: Luca Canali


*Short description*: Jobs on Spark on k8s fail when certain conditions are met: 
when using client mode with driver on a host external to K8s cluster and the 
task result size is greater than the value "maxDirectResultSize" and the 
executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver 
three cases are posible: (1) failure if the result size is greater than 
MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the 
size is equal or less than "maxDirectResultSize", (3) pointers to the task 
results in the block manager are sent back to the driver in the case of result 
size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into 
the block manager rather than data), the driver needs to be able to access the 
block manager address from in the executor JVM. This currently fails in our 
configuration when using a Spark driver outside the K8S cluster, as the block 
manager addresses are private K8S pods addresses and thus not routable from the 
driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *task.maxDirectResultSize* 
=<testValue>, for example with <testValue> = 100 to make this problem appear 
with most tasks even returning small result size.

Existing *workaround*:
 * Set *task.maxDirectResultSize* to the same value as 
spark.driver.maxResultSize when running in the configuration affected by this 
bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor 
blockmanagers are registered with ips private to K8S, that is why the task 
assignment works however driver cannot connect to executor blockmanager to pull 
the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to 
connect to the Spark driver over a hostname and a port that is routable from 
the Spark executors”, however, this case is about the Spark driver needing to 
connect to the executors’ block managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to