[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567358#comment-16567358 ] Erik Erlandson commented on SPARK-24817: I have been looking at the use cases for barrier-mode on the design doc. The primary story seems to be along the lines of using {{mapPartitions}} to: # write out any partitioned data (and sync) # execute some kind of ML logic (TF, etc) (possibly syncing on stages here?) # optionally move back into "normal" spark executions My mental model has been that the value proposition for Hydrogen is primarily a convergence argument: it is easier to not have to leave a Spark workflow and execute something like TF using some other toolchain. But OTOH, given that the Spark programmer has to write out the partitioned data and then invoke ML tooling like TF regardless, does the increase to convenience pay for the cost in complexity for absorbing new clustering & scheduling models into Spark, along with other consequences, for example SPARK-24615, compared to the "null hypothesis" of writing partition data, then using ML-specific clustering toolchains (kubeflow, for example), and consuming the resulting products in Spark afterward. > Implement BarrierTaskContext.barrier() > -- > > Key: SPARK-24817 > URL: https://issues.apache.org/jira/browse/SPARK-24817 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Implement BarrierTaskContext.barrier(), to support global sync between all > the tasks in a barrier stage. The global sync shall finish immediately once > all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16567301#comment-16567301 ] Erik Erlandson commented on SPARK-24817: Thanks [~jiangxb] - I'd expect that design to work out-of-box on the k8s backend. ML-specific code seems like it will have needs that are harder to predict, by definition. If it can use IP addresses in the cluster space, it should work regardless. If it wants fqdn, then perhaps additional pod configurations will be required. > Implement BarrierTaskContext.barrier() > -- > > Key: SPARK-24817 > URL: https://issues.apache.org/jira/browse/SPARK-24817 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Implement BarrierTaskContext.barrier(), to support global sync between all > the tasks in a barrier stage. The global sync shall finish immediately once > all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566278#comment-16566278 ] Jiang Xingbo commented on SPARK-24817: -- Actually the current implementation of _barrier_ function doesn't requires communications between executors, all executors just talk to a _BarrierCoordinator_ which is in the driver. But to allow launching ML workloads we do need to enable executors to communicate with each other directly, IIUC that shall be investigated under SPARK-24724 . Maybe [~mengxr] can provide more context here. > Implement BarrierTaskContext.barrier() > -- > > Key: SPARK-24817 > URL: https://issues.apache.org/jira/browse/SPARK-24817 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Implement BarrierTaskContext.barrier(), to support global sync between all > the tasks in a barrier stage. The global sync shall finish immediately once > all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566159#comment-16566159 ] Erik Erlandson commented on SPARK-24817: I'm curious about what the {{barrier}} invocations inside {{mapPartitions}} closures imply about communications between executors, for example executors running on pods in a kube cluster. It is possible that whatever is allowing shuffle data to transfer between executors will also allow these {{barrier}} coordinations to work, but we had to create a headless service for executors to register properly with the driver pod, and if every executor pod needs something like that for barrier to work, it will be an impact for kube backend support. > Implement BarrierTaskContext.barrier() > -- > > Key: SPARK-24817 > URL: https://issues.apache.org/jira/browse/SPARK-24817 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Implement BarrierTaskContext.barrier(), to support global sync between all > the tasks in a barrier stage. The global sync shall finish immediately once > all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24817) Implement BarrierTaskContext.barrier()
[ https://issues.apache.org/jira/browse/SPARK-24817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560096#comment-16560096 ] Apache Spark commented on SPARK-24817: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/21898 > Implement BarrierTaskContext.barrier() > -- > > Key: SPARK-24817 > URL: https://issues.apache.org/jira/browse/SPARK-24817 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Implement BarrierTaskContext.barrier(), to support global sync between all > the tasks in a barrier stage. The global sync shall finish immediately once > all tasks in the same barrier stage reaches the same barrier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org