[jira] [Updated] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25584:
--
Component/s: ML

> Document libsvm data source in doc site
> ---
>
> Key: SPARK-25584
> URL: https://issues.apache.org/jira/browse/SPARK-25584
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25347) Document image data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25347:
--
Component/s: ML

> Document image data source in doc site
> --
>
> Key: SPARK-25347
> URL: https://issues.apache.org/jira/browse/SPARK-25347
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25524) Spark datasource for image/libsvm user guide

2018-10-01 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634347#comment-16634347
 ] 

Xiangrui Meng commented on SPARK-25524:
---

Marked as duplicate and create SPARK-25584 for libsvm separately.

> Spark datasource for image/libsvm user guide
> 
>
> Key: SPARK-25524
> URL: https://issues.apache.org/jira/browse/SPARK-25524
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add Spark datasource for image/libsvm user guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25584:
--
Description: Currently, we only have Scala/Java API docs for libsvm data 
source. It would be nice to have some documentation in the doc site. So 
Python/R users can also discover this feature.  (was: Currently, we only have 
Scala/Java API docs for image data source. It would be nice to have some 
documentation in the doc site. So Python/R users can also discover this 
feature.)

> Document libsvm data source in doc site
> ---
>
> Key: SPARK-25584
> URL: https://issues.apache.org/jira/browse/SPARK-25584
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for libsvm data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25584:
-

 Summary: Document libsvm data source in doc site
 Key: SPARK-25584
 URL: https://issues.apache.org/jira/browse/SPARK-25584
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently, we only have Scala/Java API docs for image data source. It would be 
nice to have some documentation in the doc site. So Python/R users can also 
discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25524) Spark datasource for image/libsvm user guide

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25524.
---
Resolution: Duplicate

> Spark datasource for image/libsvm user guide
> 
>
> Key: SPARK-25524
> URL: https://issues.apache.org/jira/browse/SPARK-25524
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add Spark datasource for image/libsvm user guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-10-01 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634335#comment-16634335
 ] 

Xiangrui Meng commented on SPARK-25378:
---

I don't think I'm the right person to decide here because I know little about 
how UTF8String is being used in Spark SQL. As a user, I do want to use 
spark-tensorflow-connector w/ the upcoming Spark 2.4 release. 

I already made the change in TF connector to use ObjectType: 
https://github.com/tensorflow/ecosystem/pull/100. But they need to wait for TF 
1.12 release, which might come out in the second half of Oct. If we won't make 
the final 2.4 release by then, maybe we don't have to fix 2.4 branch. The risk 
is other data sources might have similar usage that will break, which we don't 
really know.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25322:
--
Priority: Critical  (was: Blocker)

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624124#comment-16624124
 ] 

Xiangrui Meng commented on SPARK-25319:
---

[~WeichenXu123] Could you check the OPEN subtasks and see if there are still 
TODOs left?

> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.4.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25326:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-25326
> URL: https://issues.apache.org/jira/browse/SPARK-25326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25325:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25327) Update MLlib, GraphX websites for 2.4

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25327:
-

Assignee: (was: Nick Pentreath)

> Update MLlib, GraphX websites for 2.4
> -
>
> Key: SPARK-25327
> URL: https://issues.apache.org/jira/browse/SPARK-25327
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25327) Update MLlib, GraphX websites for 2.4

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25327.
---
Resolution: Won't Do

> Update MLlib, GraphX websites for 2.4
> -
>
> Key: SPARK-25327
> URL: https://issues.apache.org/jira/browse/SPARK-25327
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25322:
-

Assignee: (was: Nick Pentreath)

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624123#comment-16624123
 ] 

Xiangrui Meng commented on SPARK-25323:
---

[~WeichenXu123] anyone is working on this? I reduced the priority to critical 
from blocker because it doesn't block the release.

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25323:
--
Priority: Critical  (was: Blocker)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bryan Cutler
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25323:
-

Assignee: (was: Bryan Cutler)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25324.
---
Resolution: Done

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624122#comment-16624122
 ] 

Xiangrui Meng commented on SPARK-25324:
---

See discussion in https://issues.apache.org/jira/browse/SPARK-25321.

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25321:
-

Assignee: Weichen Xu  (was: Yanbo Liang)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25320:
-

Assignee: Weichen Xu  (was: Bago Amirbekian)

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624116#comment-16624116
 ] 

Xiangrui Meng commented on SPARK-25320:
---

We will keep "predict()" change because it only breaks source compatibility.

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25320.
---
Resolution: Fixed

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-21 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624115#comment-16624115
 ] 

Xiangrui Meng commented on SPARK-25320:
---

We revert the tree Node incompatible change and one LDA private[ml] constructor 
change because MLeap already used it and it is not worth the cost to make the 
change.

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25321.
---
Resolution: Fixed

Issue resolved by pull request 22510
[https://github.com/apache/spark/pull/22510]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-25321:
---

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25321.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22492
[https://github.com/apache/spark/pull/22492]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495
 ] 

Xiangrui Meng edited comment on SPARK-25321 at 9/18/18 5:21 AM:


[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change in 2.4.

cc: [~hollinwilkins]


was (Author: mengxr):
[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change.

cc: [~hollinwilkins]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618495#comment-16618495
 ] 

Xiangrui Meng commented on SPARK-25321:
---

[~WeichenXu123] Could you check whether mleap is compatible with the tree Node 
breaking changes? This line is relevant: 
https://github.com/combust/mleap/blob/master/mleap-runtime/src/main/scala/ml/combust/mleap/bundle/ops/classification/DecisionTreeClassifierOp.scala

If it is hard to make MLeap upgrade, we should revert the change.

cc: [~hollinwilkins]

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-12 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612399#comment-16612399
 ] 

Xiangrui Meng commented on SPARK-25378:
---

Comments from [~vomjom] at https://github.com/tensorflow/ecosystem/pull/100:

{quote}
We currently only do releases along with TensorFlow releases, and the next one 
that'll include this is TF 1.12.
{quote}

This means Spark+TF users cannot migrate to Spark 2.4 until TF 1.12 is 
released. I think we need to decide based on the impact instead of just saying 
"this is not a public API". If it is not pubic, why didn't we hide it in the 
first place? And as [~cloud_fan] mentioned, it is hard to implement data source 
without touching those "private" APIs.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25378:
--
Summary: ArrayData.toArray(StringType) assume UTF8String in 2.4  (was: 
ArrayData.toArray assume UTF8String)

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-10 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609467#comment-16609467
 ] 

Xiangrui Meng commented on SPARK-25378:
---

I sent a PR to spark-tensorflow-connector at 
https://github.com/tensorflow/ecosystem/pull/100 to use the suggested method 
from [~hvanhovell]. 

I won't mark this ticket as resolved. The potential issue is that there are 
other data sources relying on this behavior. If that is the case, users won't 
be able to migrate to 2.4 before the data source owner published a new version. 
If there doesn't exist a simple way to check, maybe we should send a notice to 
dev@ list.



> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608171#comment-16608171
 ] 

Xiangrui Meng commented on SPARK-25378:
---

Btw, thanks for suggesting the right way to get strings out. I created a PR to 
update spark-tensorflow-connector: 
https://github.com/tensorflow/ecosystem/pull/100.

However, I still view it as a breaking change. It makes packages used to work 
with 2.3 no longer works with 2.4.

> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608165#comment-16608165
 ] 

Xiangrui Meng commented on SPARK-25378:
---

This is a breaking change anyway. The error comes from 
spark-tensorflow-connector, which worked with 2.3 but no longer with 2.4.

> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25382) Remove ImageSchema.readImages in 3.0

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25382:
--
Description: A follow-up task from SPARK-25345. We might need to support 
sampling (SPARK-25383) in order to remove readImages.  (was: A follow-up task 
from SPARK-25345.)

> Remove ImageSchema.readImages in 3.0
> 
>
> Key: SPARK-25382
> URL: https://issues.apache.org/jira/browse/SPARK-25382
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> A follow-up task from SPARK-25345. We might need to support sampling 
> (SPARK-25383) in order to remove readImages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25383) Image data source supports sample pushdown

2018-09-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25383:
-

 Summary: Image data source supports sample pushdown
 Key: SPARK-25383
 URL: https://issues.apache.org/jira/browse/SPARK-25383
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25345) Deprecate readImages APIs from ImageSchema

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25345:
--
Description: After SPARK-22666, we can deprecate the public APIs in 
ImageSchema (Scala/Python) and remove them in Spark 3.0 (SPARK-25382). So users 
get a unified approach to load images w/ Spark.  (was: After SPARK-22666, we 
can deprecate the public APIs in ImageSchema (Scala/Python) and remove them in 
Spark 3.0 (TODO: create JIRA). So users get a unified approach to load images 
w/ Spark.)

> Deprecate readImages APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> After SPARK-22666, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (SPARK-25382). So users get a 
> unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25382) Remove ImageSchema.readImages in 3.0

2018-09-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25382:
-

 Summary: Remove ImageSchema.readImages in 3.0
 Key: SPARK-25382
 URL: https://issues.apache.org/jira/browse/SPARK-25382
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


A follow-up task from SPARK-25345.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25345) Deprecate readImages APIs from ImageSchema

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25345:
--
Summary: Deprecate readImages APIs from ImageSchema  (was: Deprecate public 
APIs from ImageSchema)

> Deprecate readImages APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> After SPARK-22666, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get 
> a unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25345:
-

Assignee: Weichen Xu

> Deprecate public APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> After SPARK-22666, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get 
> a unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25345.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22349
[https://github.com/apache/spark/pull/22349]

> Deprecate public APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> After SPARK-22666, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get 
> a unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-08 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608092#comment-16608092
 ] 

Xiangrui Meng commented on SPARK-25378:
---

Seems caused by SPARK-23875. cc: [~viirya] [~hvanhovell]

> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-08 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25378:
-

 Summary: ArrayData.toArray assume UTF8String
 Key: SPARK-25378
 URL: https://issues.apache.org/jira/browse/SPARK-25378
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:

{code}
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.types.StringType

ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)

res0: Array[String] = Array(a, b)
{code}

In 2.4.0-SNAPSHOT, the error is

{code}java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
  at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
  at 
org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
  at 
org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
  at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
  ... 51 elided
{code}

cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25376) Scenarios we should handle but missed in 2.4 for barrier execution mode

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25376:
--
Summary: Scenarios we should handle but missed in 2.4 for barrier execution 
mode  (was: Scenarios we should handle but missed for barrier execution mode)

> Scenarios we should handle but missed in 2.4 for barrier execution mode
> ---
>
> Key: SPARK-25376
> URL: https://issues.apache.org/jira/browse/SPARK-25376
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> [~irashid] You mentioned that there are couple scenarios we should handle in 
> barrier execution mode but we didn't in 2.4. Could you elaborate here?
> One scenario we are aware of is that speculation is not supported by barrier 
> mode. Hence a barrier mode might hang in case of hardware issues on one node. 
> I don't have a good proposal here except letting users set a timeout for the 
> barrier stage. Would like to hear your thoughts.
> You also mentioned multi-tenancy issues. Could you say more?
> cc: [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25376) Scenarios we should handle but missed for barrier execution mode

2018-09-08 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25376:
--
Description: 
[~irashid] You mentioned that there are couple scenarios we should handle in 
barrier execution mode but we didn't in 2.4. Could you elaborate here?

One scenario we are aware of is that speculation is not supported by barrier 
mode. Hence a barrier mode might hang in case of hardware issues on one node. I 
don't have a good proposal here except letting users set a timeout for the 
barrier stage. Would like to hear your thoughts.

You also mentioned multi-tenancy issues. Could you say more?

cc: [~jiangxb1987]

  was:
[~irashid] You mentioned that there are couple scenarios we should handle in 
barrier execution mode but we didn't in 2.4. Could you elaborate here?

cc: [~jiangxb1987]


> Scenarios we should handle but missed for barrier execution mode
> 
>
> Key: SPARK-25376
> URL: https://issues.apache.org/jira/browse/SPARK-25376
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> [~irashid] You mentioned that there are couple scenarios we should handle in 
> barrier execution mode but we didn't in 2.4. Could you elaborate here?
> One scenario we are aware of is that speculation is not supported by barrier 
> mode. Hence a barrier mode might hang in case of hardware issues on one node. 
> I don't have a good proposal here except letting users set a timeout for the 
> barrier stage. Would like to hear your thoughts.
> You also mentioned multi-tenancy issues. Could you say more?
> cc: [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25376) Scenarios we should handle but missed for barrier execution mode

2018-09-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25376:
--
Description: 
[~irashid] You mentioned that there are couple scenarios we should handle in 
barrier execution mode but we didn't in 2.4. Could you elaborate here?

cc: [~jiangxb1987]

> Scenarios we should handle but missed for barrier execution mode
> 
>
> Key: SPARK-25376
> URL: https://issues.apache.org/jira/browse/SPARK-25376
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> [~irashid] You mentioned that there are couple scenarios we should handle in 
> barrier execution mode but we didn't in 2.4. Could you elaborate here?
> cc: [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25376) Scenarios we should handle but missed for barrier execution mode

2018-09-07 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25376:
-

 Summary: Scenarios we should handle but missed for barrier 
execution mode
 Key: SPARK-25376
 URL: https://issues.apache.org/jira/browse/SPARK-25376
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25347) Document image data source in doc site

2018-09-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25347:
--
Summary: Document image data source in doc site  (was: Document image data 
sources in doc site)

> Document image data source in doc site
> --
>
> Key: SPARK-25347
> URL: https://issues.apache.org/jira/browse/SPARK-25347
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25345:
--
Description: After SPARK-22328, we can deprecate the public APIs in 
ImageSchema (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So 
users get a unified approach to load images w/ Spark.  (was: After SPARK-22328, 
we can deprecate the public APIs in ImageSchema and remove them in Spark 3.0 
(TODO: create JIRA). So users get a unified approach to load images w/ Spark.)

> Deprecate public APIs from ImageSchema
> --
>
> Key: SPARK-25345
> URL: https://issues.apache.org/jira/browse/SPARK-25345
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-22328, we can deprecate the public APIs in ImageSchema 
> (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get 
> a unified approach to load images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25349) Support sample pushdown in Data Source V2

2018-09-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25349:
-

 Summary: Support sample pushdown in Data Source V2
 Key: SPARK-25349
 URL: https://issues.apache.org/jira/browse/SPARK-25349
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Support sample pushdown would help file-based data source implementation save 
I/O cost significantly if it can decide whether to read a file or not.

 

cc: [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25348) Data source for binary files

2018-09-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25348:
-

 Summary: Data source for binary files
 Key: SPARK-25348
 URL: https://issues.apache.org/jira/browse/SPARK-25348
 Project: Spark
  Issue Type: Story
  Components: ML, SQL
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25347) Document image data sources in doc site

2018-09-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25347:
-

 Summary: Document image data sources in doc site
 Key: SPARK-25347
 URL: https://issues.apache.org/jira/browse/SPARK-25347
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently, we only have Scala/Java API docs for image data source. It would be 
nice to have some documentation in the doc site. So Python/R users can also 
discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25346) Document Spark builtin data sources

2018-09-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25346:
--
Summary: Document Spark builtin data sources  (was: Document Spark built-in 
data sources)

> Document Spark builtin data sources
> ---
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25346) Document Spark built-in data sources

2018-09-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25346:
--
Summary: Document Spark built-in data sources  (was: Document Spark buit-in 
data sources)

> Document Spark built-in data sources
> 
>
> Key: SPARK-25346
> URL: https://issues.apache.org/jira/browse/SPARK-25346
> Project: Spark
>  Issue Type: Story
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> It would be nice to list built-in data sources in the doc site. So users know 
> what are available by default. However, I didn't find any from 2.3.1 docs.
>  
> cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25346) Document Spark buit-in data sources

2018-09-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25346:
-

 Summary: Document Spark buit-in data sources
 Key: SPARK-25346
 URL: https://issues.apache.org/jira/browse/SPARK-25346
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


It would be nice to list built-in data sources in the doc site. So users know 
what are available by default. However, I didn't find any from 2.3.1 docs.

 

cc: [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25345) Deprecate public APIs from ImageSchema

2018-09-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25345:
-

 Summary: Deprecate public APIs from ImageSchema
 Key: SPARK-25345
 URL: https://issues.apache.org/jira/browse/SPARK-25345
 Project: Spark
  Issue Type: Story
  Components: ML
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


After SPARK-22328, we can deprecate the public APIs in ImageSchema and remove 
them in Spark 3.0 (TODO: create JIRA). So users get a unified approach to load 
images w/ Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22666) Spark datasource for image format

2018-09-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-22666.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22328
[https://github.com/apache/spark/pull/22328]

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-09-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25248.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22240
[https://github.com/apache/spark/pull/22240]

> Audit barrier APIs for Spark 2.4
> 
>
> Key: SPARK-25248
> URL: https://issues.apache.org/jira/browse/SPARK-25248
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22666) Spark datasource for image format

2018-09-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-22666:
-

Assignee: Weichen Xu

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Assignee: Weichen Xu
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25266) Fix memory leak in Barrier Execution Mode

2018-08-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25266.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22258
[https://github.com/apache/spark/pull/22258]

> Fix memory leak in Barrier Execution Mode
> -
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
> Fix For: 2.4.0
>
>
> BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-08-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25248:
-

 Summary: Audit barrier APIs for Spark 2.4
 Key: SPARK-25248
 URL: https://issues.apache.org/jira/browse/SPARK-25248
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25247) Make RDDBarrier configurable

2018-08-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25247:
-

 Summary: Make RDDBarrier configurable
 Key: SPARK-25247
 URL: https://issues.apache.org/jira/browse/SPARK-25247
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently we only offer one method under `RDDBarrier`. Users might want to have 
better control over a barrier stage, e.g., timeout behavior, failure recovery, 
etc. This Jira is to discuss what options we should provide under RDDBarrier.

 

Note: Users can use multiple RDDBarrier in a single barrier stage. So we also 
need to discuss how to merge the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25234) SparkR:::parallelize doesn't handle integer overflow properly

2018-08-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25234.
---
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 5
[https://github.com/apache/spark/pull/5]

> SparkR:::parallelize doesn't handle integer overflow properly
> -
>
> Key: SPARK-25234
> URL: https://issues.apache.org/jira/browse/SPARK-25234
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
>
> parallelize uses integer multiplication, which cannot handle size over 
> ~47000. This cause issues with lapply
>  
> {code:java}
> SparkR:::parallelize(sc, 1:47000, 47000)
> Error in rep(start, end - start) : invalid 'times' argument
> Error in rep(start, end - start) : invalid 'times' argument
> In addition: Warning message:
> In x * length(coll) : NAs produced by integer overflow{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25234) SparkR:::parallelize doesn't handle integer overflow properly

2018-08-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25234:
-

Assignee: Xiangrui Meng

> SparkR:::parallelize doesn't handle integer overflow properly
> -
>
> Key: SPARK-25234
> URL: https://issues.apache.org/jira/browse/SPARK-25234
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> parallelize uses integer multiplication, which cannot handle size over 
> ~47000. This cause issues with lapply
>  
> {code:java}
> SparkR:::parallelize(sc, 1:47000, 47000)
> Error in rep(start, end - start) : invalid 'times' argument
> Error in rep(start, end - start) : invalid 'times' argument
> In addition: Warning message:
> In x * length(coll) : NAs produced by integer overflow{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25234) SparkR:::parallelize doesn't handle integer overflow properly

2018-08-24 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25234:
--
Description: 
parallelize uses integer multiplication, which cannot handle size over ~47000. 
This cause issues with lapply

 
{code:java}
SparkR:::parallelize(sc, 1:47000, 47000)

Error in rep(start, end - start) : invalid 'times' argument
Error in rep(start, end - start) : invalid 'times' argument
In addition: Warning message:
In x * length(coll) : NAs produced by integer overflow{code}

  was:parallelize uses integer multiplication, which cannot handle size over 
~47000.


> SparkR:::parallelize doesn't handle integer overflow properly
> -
>
> Key: SPARK-25234
> URL: https://issues.apache.org/jira/browse/SPARK-25234
> Project: Spark
>  Issue Type: Story
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> parallelize uses integer multiplication, which cannot handle size over 
> ~47000. This cause issues with lapply
>  
> {code:java}
> SparkR:::parallelize(sc, 1:47000, 47000)
> Error in rep(start, end - start) : invalid 'times' argument
> Error in rep(start, end - start) : invalid 'times' argument
> In addition: Warning message:
> In x * length(coll) : NAs produced by integer overflow{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25234) SparkR:::parallelize doesn't handle integer overflow properly

2018-08-24 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25234:
-

 Summary: SparkR:::parallelize doesn't handle integer overflow 
properly
 Key: SPARK-25234
 URL: https://issues.apache.org/jira/browse/SPARK-25234
 Project: Spark
  Issue Type: Story
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Xiangrui Meng


parallelize uses integer multiplication, which cannot handle size over ~47000.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25095) Python support for BarrierTaskContext

2018-08-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25095.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22085
[https://github.com/apache/spark/pull/22085]

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25095) Python support for BarrierTaskContext

2018-08-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25095:
-

Assignee: Jiang Xingbo

> Python support for BarrierTaskContext
> -
>
> Key: SPARK-25095
> URL: https://issues.apache.org/jira/browse/SPARK-25095
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Enable call `BarrierTaskContext.barrier()` from python side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25161:
-

Assignee: Jiang Xingbo

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-21 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25161.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22158
[https://github.com/apache/spark/pull/22158]

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-08-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24819.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22001
[https://github.com/apache/spark/pull/22001]

> Fail fast when no enough slots to launch the barrier stage on job submitted
> ---
>
> Key: SPARK-24819
> URL: https://issues.apache.org/jira/browse/SPARK-24819
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Check all the barrier stages on job submitted, to see whether the barrier 
> stages require more slots (to be able to launch all the barrier tasks in the 
> same stage together) than currently active slots in the cluster. If the job 
> requires more slots than available (both busy and free slots), fail the job 
> on submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-08-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24819:
-

Assignee: Jiang Xingbo

> Fail fast when no enough slots to launch the barrier stage on job submitted
> ---
>
> Key: SPARK-24819
> URL: https://issues.apache.org/jira/browse/SPARK-24819
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Check all the barrier stages on job submitted, to see whether the barrier 
> stages require more slots (to be able to launch all the barrier tasks in the 
> same stage together) than currently active slots in the cluster. If the job 
> requires more slots than available (both busy and free slots), fail the job 
> on submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25045) Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions`

2018-08-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25045.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22026
[https://github.com/apache/spark/pull/22026]

> Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions`
> ---
>
> Key: SPARK-25045
> URL: https://issues.apache.org/jira/browse/SPARK-25045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Signature of the function passed to `RDDBarrier.mapPartitions()` is different 
> from that of `RDD.mapPartitions`. The latter doesn’t take a TaskContext. We 
> shall make the function signature the same to avoid confusion and misusage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25045) Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions`

2018-08-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25045:
-

Assignee: Jiang Xingbo

> Make `RDDBarrier.mapParititions` similar to `RDD.mapPartitions`
> ---
>
> Key: SPARK-25045
> URL: https://issues.apache.org/jira/browse/SPARK-25045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Signature of the function passed to `RDDBarrier.mapPartitions()` is different 
> from that of `RDD.mapPartitions`. The latter doesn’t take a TaskContext. We 
> shall make the function signature the same to avoid confusion and misusage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25030:
--
Summary: SparkSubmit.doSubmit will not return result if the mainClass 
submitted creates a Timer()  (was: SparkSubmit will not return result if the 
mainClass submitted creates a Timer())

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25030) SparkSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570951#comment-16570951
 ] 

Xiangrui Meng commented on SPARK-25030:
---

[~jiangxb1987] Could you create a PR to demonstrate the test failures?

[~vanzin] [~jerryshao] Do you know who is the best person to investigate this 
issue?

> SparkSubmit will not return result if the mainClass submitted creates a 
> Timer()
> ---
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-08-03 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24954:
-

Assignee: Jiang Xingbo

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-08-03 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24954.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21915
[https://github.com/apache/spark/pull/21915]

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24821.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21927
[https://github.com/apache/spark/pull/21927]

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24821:
-

Assignee: Jiang Xingbo

> Fail fast when submitted job compute on a subset of all the partitions for a 
> barrier stage
> --
>
> Key: SPARK-24821
> URL: https://issues.apache.org/jira/browse/SPARK-24821
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
> partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24820:
-

Assignee: Jiang Xingbo

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24820.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21927
[https://github.com/apache/spark/pull/21927]

> Fail fast when submitted job contains PartitionPruningRDD in a barrier stage
> 
>
> Key: SPARK-24820
> URL: https://issues.apache.org/jira/browse/SPARK-24820
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.4.0
>
>
> Detect SparkContext.runJob() launch a barrier stage including 
> PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24719) ClusteringEvaluator supports integer type labels

2018-08-02 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566425#comment-16566425
 ] 

Xiangrui Meng commented on SPARK-24719:
---

[~mgaido] Sorry it was my bad! The error happened when a user tried 
BisectingKMeans with ClusterEvaluator and CrossValidator. So I reported the 
issue here. The error was actually from DoubleParam instead of the evaluator. 
I'll do some investigation. Closing this ticket for now. Thanks for taking a 
look!

> ClusteringEvaluator supports integer type labels
> 
>
> Key: SPARK-24719
> URL: https://issues.apache.org/jira/browse/SPARK-24719
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> ClusterEvaluator should support integer labels because we output integer 
> labels in BisectingKMeans. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77].
>  We should cast numeric types to double in ClusteringEvaluator.
> [~mgaido] Do you have time to work on the fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24719) ClusteringEvaluator supports integer type labels

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24719.
---
Resolution: Not A Problem

> ClusteringEvaluator supports integer type labels
> 
>
> Key: SPARK-24719
> URL: https://issues.apache.org/jira/browse/SPARK-24719
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> ClusterEvaluator should support integer labels because we output integer 
> labels in BisectingKMeans. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala#L77].
>  We should cast numeric types to double in ClusteringEvaluator.
> [~mgaido] Do you have time to work on the fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24557) ClusteringEvaluator support array input

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24557.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21563
[https://github.com/apache/spark/pull/21563]

> ClusteringEvaluator support array input
> ---
>
> Key: SPARK-24557
> URL: https://issues.apache.org/jira/browse/SPARK-24557
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.4.0
>
>
> Since clustering algs already suppot array input,
> {{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24557) ClusteringEvaluator support array input

2018-08-02 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24557:
-

Assignee: zhengruifeng

> ClusteringEvaluator support array input
> ---
>
> Key: SPARK-24557
> URL: https://issues.apache.org/jira/browse/SPARK-24557
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.4.0
>
>
> Since clustering algs already suppot array input,
> {{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24726.
---
  Resolution: Resolved
Target Version/s: 2.4.0  (was: 3.0.0)

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550050#comment-16550050
 ] 

Xiangrui Meng commented on SPARK-24726:
---

I'm closing this ticket as resolved since with passwordless SSH on a standalone 
cluster, users should be able to do other things via SSH.

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24726) Discuss necessary info and access in barrier mode + Standalone

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24726:
-

Assignee: Xiangrui Meng

> Discuss necessary info and access in barrier mode + Standalone
> --
>
> Key: SPARK-24726
> URL: https://issues.apache.org/jira/browse/SPARK-24726
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Standalone. For MPI, 
> what we need is password-less SSH access among workers. We might also 
> consider other distributed frameworks, like distributed tensorflow, H2O, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24723:
--
Description: 
In barrier mode, to run hybrid distributed DL training jobs, we need to provide 
users sufficient info and access so they can set up a hybrid distributed 
training job, e.g., using MPI.

This ticket limits the scope of discussion to Spark + YARN. There were some 
past attempts from the Hadoop community. So we should find someone with good 
knowledge to lead the discussion here.

 

Requirements:
 * understand how to set up YARN to run MPI job as a YARN application
 * figure out how to do it with Spark w/ Barrier

  was:
In barrier mode, to run hybrid distributed DL training jobs, we need to provide 
users sufficient info and access so they can set up a hybrid distributed 
training job, e.g., using MPI.

This ticket limits the scope of discussion to Spark + YARN. There were some 
past attempts from the Hadoop community. So we should find someone with good 
knowledge to lead the discussion here.


> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24724) Discuss necessary info and access in barrier mode + Kubernetes

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550049#comment-16550049
 ] 

Xiangrui Meng commented on SPARK-24724:
---

[~liyinan926] Any updates?

> Discuss necessary info and access in barrier mode + Kubernetes
> --
>
> Key: SPARK-24724
> URL: https://issues.apache.org/jira/browse/SPARK-24724
> Project: Spark
>  Issue Type: Story
>  Components: Kubernetes, ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Yinan Li
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + Kubernetes. There were 
> some past and on-going attempts from the Kubenetes community. So we should 
> find someone with good knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24723:
-

Assignee: Saisai Shao

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550048#comment-16550048
 ] 

Xiangrui Meng commented on SPARK-24723:
---

[~jerryshao] Does YARN have the feature that will by default configure 
passwordless SSH on all containers (or per application)? If Spark generates the 
key files in barrier mode on YARN, it might break this feature provided by 
YARN. And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550023#comment-16550023
 ] 

Xiangrui Meng commented on SPARK-24615:
---

[~tgraves] Could you help link some past requests on configurable CPU/memory 
per stage? And you are suggesting making the API generalizable to those 
scenarios, but not including the feature under the scope of this proposal, 
correct?

Btw, how do you like the following API?
{code:java}
rdd.withResources
  .prefer("/gpu/k80", 2) // prefix of resource logical name, amount
  .require("/cpu", 1)
  .require("/memory", 819200)
  .require("/disk", 1){code}
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-07-19 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24374:
--
Summary: SPIP: Support Barrier Execution Mode in Apache Spark  (was: SPIP: 
Support Barrier Scheduling in Apache Spark)

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24615:
--
Summary: Accelerator-aware task scheduling for Spark  (was: Accelerator 
aware task scheduling for Spark)

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-07-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24615:
-

Assignee: Saisai Shao

> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24747:
--
Shepherd: Xiangrui Meng

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-24747:
-

Assignee: Bago Amirbekian

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-03 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531737#comment-16531737
 ] 

Xiangrui Meng commented on SPARK-24579:
---

[~sethah] [~kiszk] [~rxin] Please request comment permissions on the doc. I 
didn't give everyone comment permissions by default to avoid spam. If I 
addressed a comment, it will be reflected in the current version of the doc.

You can also post comments here.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >