[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159559#comment-14159559
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Sandy, one other thing:
While I understand the reasoning for changes to the title and the description 
of the JIRA, it would probably be better to coordinate this with the original 
submitter before making such changes in the future (similar to the way Patric 
suggested in SPARK-3174). This would alleviate potential discrepancies in the 
overall message and intentions of the JIRA. 
Anyway, I’ve edited both the title and the description taking into 
consideration your edits.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Expose pluggable architecture to facilitate native integration with third-party execution environments.

2014-10-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159573#comment-14159573
 ] 

Sean Owen commented on SPARK-3561:
--

I'd be interested to see a more specific motivating use case. Is this about 
using Tez for example, and where does it help to stack Spark on Tez on YARN? or 
MR2, etc. Spark Core and Tez overlap, to be sure, and I'm not sure how much 
value it adds to run one on the other. Kind of like running Oracle on MySQL or 
something. For whatever it is: is it maybe not more natural to integrate the 
feature into Spark itself?

It would be great if it this were all just a matter of one extra trait and 
interface. In practice I suspect there are a number of hidden assumptions 
throughout the code that may leak through attempts at this abstraction. 

I am definitely asking rather than asserting, curious to see more specifics 
about the upside.

 Expose pluggable architecture to facilitate native integration with 
 third-party execution environments.
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark _integrates with external resource-managing platforms_ such 
 as Apache Hadoop YARN and Mesos to facilitate 
 execution of Spark DAG in a distributed environment provided by those 
 platforms. 
 However, this integration is tightly coupled within Spark's implementation 
 making it rather difficult to introduce integration points with 
 other resource-managing platforms without constant modifications to Spark's 
 core (see comments below for more details). 
 In addition, Spark _does not provide any integration points to a third-party 
 **DAG-like** and **DAG-capable** execution environments_ native 
 to those platforms, thus limiting access to some of their native features 
 (e.g., MR2/Tez stateless shuffle, YARN resource localization, YARN management 
 and monitoring and more) as well as specialization aspects of
 such execution environments (open source and proprietary). As an example, 
 inability to gain access to such features are starting to affect Spark's 
 viability in large scale, batch 
 and/or ETL applications. 
 Introducing a pluggable architecture would solve both of the issues mentioned 
 above ultimately benefitting Spark's technology and community by allowing it 
 to 
 venture into co-existence and collaboration with a variety of existing Big 
 Data platforms as well as the once yet to come to the market.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI).
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via 
 master URL as _execution-context:foo.bar.MyJobExecutionContext_ with default 
 implementation containing the existing code from SparkContext, thus allowing 
 current 
 (corresponding) methods of SparkContext to delegate to such implementation 
 ensuring binary and source compatibility with older versions of Spark.  
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc and pull request for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org