[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513384#comment-16513384
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

Yes - thats what I meant [~felixcheung]

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will 

[jira] [Commented] (SPARK-24566) spark.storage.blockManagerSlaveTimeoutMs default config

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513379#comment-16513379
 ] 

Apache Spark commented on SPARK-24566:
--

User 'xueyumusic' has created a pull request for this issue:
https://github.com/apache/spark/pull/21575

> spark.storage.blockManagerSlaveTimeoutMs default config
> ---
>
> Key: SPARK-24566
> URL: https://issues.apache.org/jira/browse/SPARK-24566
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>
> As configuration doc said, use "spark.network.timeout" replacing 
> "spark.storage.blockManagerSlaveTimeoutMs" when it is not configured. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24566) spark.storage.blockManagerSlaveTimeoutMs default config

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24566:


Assignee: (was: Apache Spark)

> spark.storage.blockManagerSlaveTimeoutMs default config
> ---
>
> Key: SPARK-24566
> URL: https://issues.apache.org/jira/browse/SPARK-24566
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>
> As configuration doc said, use "spark.network.timeout" replacing 
> "spark.storage.blockManagerSlaveTimeoutMs" when it is not configured. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24566) spark.storage.blockManagerSlaveTimeoutMs default config

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24566:


Assignee: Apache Spark

> spark.storage.blockManagerSlaveTimeoutMs default config
> ---
>
> Key: SPARK-24566
> URL: https://issues.apache.org/jira/browse/SPARK-24566
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Assignee: Apache Spark
>Priority: Major
>
> As configuration doc said, use "spark.network.timeout" replacing 
> "spark.storage.blockManagerSlaveTimeoutMs" when it is not configured. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24566) spark.storage.blockManagerSlaveTimeoutMs default config

2018-06-14 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu updated SPARK-24566:
--
External issue URL: https://github.com/apache/spark/pull/21575

> spark.storage.blockManagerSlaveTimeoutMs default config
> ---
>
> Key: SPARK-24566
> URL: https://issues.apache.org/jira/browse/SPARK-24566
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>
> As configuration doc said, use "spark.network.timeout" replacing 
> "spark.storage.blockManagerSlaveTimeoutMs" when it is not configured. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513372#comment-16513372
 ] 

Felix Cheung commented on SPARK-24535:
--

is this only failing on windows?

I wonder if this is related to how stdout is redirected in launchScript

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24566) spark.storage.blockManagerSlaveTimeoutMs default config

2018-06-14 Thread xueyu (JIRA)
xueyu created SPARK-24566:
-

 Summary: spark.storage.blockManagerSlaveTimeoutMs default config
 Key: SPARK-24566
 URL: https://issues.apache.org/jira/browse/SPARK-24566
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: xueyu


As configuration doc said, use "spark.network.timeout" replacing 
"spark.storage.blockManagerSlaveTimeoutMs" when it is not configured. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513363#comment-16513363
 ] 

Felix Cheung commented on SPARK-24359:
--

[~shivaram] sure - do you mean 2.3.1.1 though? 2.4.0 release is not out yet

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic 

[jira] [Resolved] (SPARK-24267) explicitly keep DataSourceReader in DataSourceV2Relation

2018-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24267.
-
Resolution: Won't Fix

> explicitly keep DataSourceReader in DataSourceV2Relation
> 
>
> Key: SPARK-24267
> URL: https://issues.apache.org/jira/browse/SPARK-24267
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513332#comment-16513332
 ] 

Apache Spark commented on SPARK-24478:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21574

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24478:
---

Assignee: Ryan Blue

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24478) DataSourceV2 should push filters and projection at physical plan conversion

2018-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24478.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21503
[https://github.com/apache/spark/pull/21503]

> DataSourceV2 should push filters and projection at physical plan conversion
> ---
>
> Key: SPARK-24478
> URL: https://issues.apache.org/jira/browse/SPARK-24478
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 currently pushes filters and projected columns in the optimized 
> plan, but this requires creating and configuring a reader multiple times and 
> prevents the v2 relation's output from being a fixed argument of the relation 
> case class. It is also much cleaner (see PR 
> [#21262|https://github.com/apache/spark/pull/21262]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu updated SPARK-24560:
--
Docs Text:   (was: There are some places using "getTimeAsMs" rather than 
"getTimeAsSeconds". This will return a wrong value when the user specifies a 
value without a time unit.)

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>
> There are some places using "getTimeAsMs" rather than "getTimeAsSeconds". 
> This will return a wrong value when the user specifies a value without a time 
> unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu updated SPARK-24560:
--
Description: There are some places using "getTimeAsMs" rather than 
"getTimeAsSeconds". This will return a wrong value when the user specifies a 
value without a time unit.

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>
> There are some places using "getTimeAsMs" rather than "getTimeAsSeconds". 
> This will return a wrong value when the user specifies a value without a time 
> unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21743) top-most limit should not cause memory leak

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513233#comment-16513233
 ] 

Apache Spark commented on SPARK-21743:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21573

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24565) Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame

2018-06-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-24565:
--
Description: 
Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
user through any public API. This was because we did not want to expose the 
micro-batches, so that all the APIs we expose, we can eventually support them 
in the Continuous engine. But now that we have a better sense of building a 
ContinuousExecution, I am considering adding APIs which will run only the 
MicroBatchExecution. I have quite a few use cases where exposing the 
micro-batch output as a dataframe is useful. 
- Pass the output rows of each batch to a library that is designed only the 
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exist 
(e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch. 
This is not the most elegant thing to do for multiple-output streaming queries 
but is likely to be better than running two streaming queries processing the 
same data twice.

The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
Scala/Java/Python {{DataStreamWriter}}.


  was:
Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
user through any public API. This was because we did not want to expose the 
micro-batches, so that all the APIs we expose, we can eventually support them 
in the Continuous engine. But now that we have a better sense of building a 
ContinuousExecution, I am considering adding APIs which will run only the 
MicroBatchExecution. I have quite a few use cases where exposing the 
micro-batch output as a dataframe is useful. 
- Pass the output rows of each batch to a library that is designed only the 
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exist 
(e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch. 
This is not the most elegant thing to do for multiple-output streaming queries 
but is likely to be better than running two streaming queries processing the 
same data twice.

The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
Scala/Java/Python `DataStreamWriter`.



> Add API for in Structured Streaming for exposing output rows of each 
> microbatch as a DataFrame
> --
>
> Key: SPARK-24565
> URL: https://issues.apache.org/jira/browse/SPARK-24565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
> user through any public API. This was because we did not want to expose the 
> micro-batches, so that all the APIs we expose, we can eventually support them 
> in the Continuous engine. But now that we have a better sense of building a 
> ContinuousExecution, I am considering adding APIs which will run only the 
> MicroBatchExecution. I have quite a few use cases where exposing the 
> micro-batch output as a dataframe is useful. 
> - Pass the output rows of each batch to a library that is designed only the 
> batch jobs (example, uses many ML libraries need to collect() while learning).
> - Reuse batch data sources for output whose streaming version does not exist 
> (e.g. redshift data source).
> - Writer the output rows to multiple places by writing twice for each batch. 
> This is not the most elegant thing to do for multiple-output streaming 
> queries but is likely to be better than running two streaming queries 
> processing the same data twice.
> The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
> Scala/Java/Python {{DataStreamWriter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-14 Thread Ricardo Martinelli de Oliveira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513159#comment-16513159
 ] 

Ricardo Martinelli de Oliveira commented on SPARK-24534:


Guys, I sent a PR as proposal for this Jira: 
[https://github.com/apache/spark/pull/21572]

Would you mind review it?

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-06-14 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-24248.

   Resolution: Fixed
Fix Version/s: 2.4.0

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
> Fix For: 2.4.0
>
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24565) Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24565:


Assignee: Tathagata Das  (was: Apache Spark)

> Add API for in Structured Streaming for exposing output rows of each 
> microbatch as a DataFrame
> --
>
> Key: SPARK-24565
> URL: https://issues.apache.org/jira/browse/SPARK-24565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
> user through any public API. This was because we did not want to expose the 
> micro-batches, so that all the APIs we expose, we can eventually support them 
> in the Continuous engine. But now that we have a better sense of building a 
> ContinuousExecution, I am considering adding APIs which will run only the 
> MicroBatchExecution. I have quite a few use cases where exposing the 
> micro-batch output as a dataframe is useful. 
> - Pass the output rows of each batch to a library that is designed only the 
> batch jobs (example, uses many ML libraries need to collect() while learning).
> - Reuse batch data sources for output whose streaming version does not exist 
> (e.g. redshift data source).
> - Writer the output rows to multiple places by writing twice for each batch. 
> This is not the most elegant thing to do for multiple-output streaming 
> queries but is likely to be better than running two streaming queries 
> processing the same data twice.
> The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
> Scala/Java/Python `DataStreamWriter`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24565) Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24565:


Assignee: Apache Spark  (was: Tathagata Das)

> Add API for in Structured Streaming for exposing output rows of each 
> microbatch as a DataFrame
> --
>
> Key: SPARK-24565
> URL: https://issues.apache.org/jira/browse/SPARK-24565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
> user through any public API. This was because we did not want to expose the 
> micro-batches, so that all the APIs we expose, we can eventually support them 
> in the Continuous engine. But now that we have a better sense of building a 
> ContinuousExecution, I am considering adding APIs which will run only the 
> MicroBatchExecution. I have quite a few use cases where exposing the 
> micro-batch output as a dataframe is useful. 
> - Pass the output rows of each batch to a library that is designed only the 
> batch jobs (example, uses many ML libraries need to collect() while learning).
> - Reuse batch data sources for output whose streaming version does not exist 
> (e.g. redshift data source).
> - Writer the output rows to multiple places by writing twice for each batch. 
> This is not the most elegant thing to do for multiple-output streaming 
> queries but is likely to be better than running two streaming queries 
> processing the same data twice.
> The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
> Scala/Java/Python `DataStreamWriter`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24565) Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513110#comment-16513110
 ] 

Apache Spark commented on SPARK-24565:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21571

> Add API for in Structured Streaming for exposing output rows of each 
> microbatch as a DataFrame
> --
>
> Key: SPARK-24565
> URL: https://issues.apache.org/jira/browse/SPARK-24565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
> user through any public API. This was because we did not want to expose the 
> micro-batches, so that all the APIs we expose, we can eventually support them 
> in the Continuous engine. But now that we have a better sense of building a 
> ContinuousExecution, I am considering adding APIs which will run only the 
> MicroBatchExecution. I have quite a few use cases where exposing the 
> micro-batch output as a dataframe is useful. 
> - Pass the output rows of each batch to a library that is designed only the 
> batch jobs (example, uses many ML libraries need to collect() while learning).
> - Reuse batch data sources for output whose streaming version does not exist 
> (e.g. redshift data source).
> - Writer the output rows to multiple places by writing twice for each batch. 
> This is not the most elegant thing to do for multiple-output streaming 
> queries but is likely to be better than running two streaming queries 
> processing the same data twice.
> The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
> Scala/Java/Python `DataStreamWriter`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24565) Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame

2018-06-14 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-24565:
-

 Summary: Add API for in Structured Streaming for exposing output 
rows of each microbatch as a DataFrame
 Key: SPARK-24565
 URL: https://issues.apache.org/jira/browse/SPARK-24565
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, the micro-batches in the MicroBatchExecution is not exposed to the 
user through any public API. This was because we did not want to expose the 
micro-batches, so that all the APIs we expose, we can eventually support them 
in the Continuous engine. But now that we have a better sense of building a 
ContinuousExecution, I am considering adding APIs which will run only the 
MicroBatchExecution. I have quite a few use cases where exposing the 
micro-batch output as a dataframe is useful. 
- Pass the output rows of each batch to a library that is designed only the 
batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exist 
(e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch. 
This is not the most elegant thing to do for multiple-output streaming queries 
but is likely to be better than running two streaming queries processing the 
same data twice.

The proposal is to add a method {{foreachBatch(f: Dataset[T] => Unit)}} to 
Scala/Java/Python `DataStreamWriter`.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24564) Add test suite for RecordBinaryComparator

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513095#comment-16513095
 ] 

Apache Spark commented on SPARK-24564:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21570

> Add test suite for RecordBinaryComparator
> -
>
> Key: SPARK-24564
> URL: https://issues.apache.org/jira/browse/SPARK-24564
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24564) Add test suite for RecordBinaryComparator

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24564:


Assignee: Apache Spark

> Add test suite for RecordBinaryComparator
> -
>
> Key: SPARK-24564
> URL: https://issues.apache.org/jira/browse/SPARK-24564
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24564) Add test suite for RecordBinaryComparator

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24564:


Assignee: (was: Apache Spark)

> Add test suite for RecordBinaryComparator
> -
>
> Key: SPARK-24564
> URL: https://issues.apache.org/jira/browse/SPARK-24564
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24564) Add test suite for RecordBinaryComparator

2018-06-14 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-24564:


 Summary: Add test suite for RecordBinaryComparator
 Key: SPARK-24564
 URL: https://issues.apache.org/jira/browse/SPARK-24564
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jiang Xingbo






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-06-14 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513064#comment-16513064
 ] 

Yinan Li commented on SPARK-24434:
--

[~skonto] Thanks! Will take a look at the design doc once I'm back from 
vacation.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24319) run-example can not print usage

2018-06-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24319.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21450
[https://github.com/apache/spark/pull/21450]

> run-example can not print usage
> ---
>
> Key: SPARK-24319
> URL: https://issues.apache.org/jira/browse/SPARK-24319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 2.4.0
>
>
> Running "bin/run-example" with no args or with "–help" will not print usage 
> and just gives the error
> {noformat}
> $ bin/run-example
> Exception in thread "main" java.lang.IllegalArgumentException: Missing 
> application resource.
>     at 
> org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
>     at org.apache.spark.launcher.Main.main(Main.java:86){noformat}
> it looks like there is an env var in the script that shows usage, but it's 
> getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24319) run-example can not print usage

2018-06-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24319:
--

Assignee: Gabor Somogyi

> run-example can not print usage
> ---
>
> Key: SPARK-24319
> URL: https://issues.apache.org/jira/browse/SPARK-24319
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 2.4.0
>
>
> Running "bin/run-example" with no args or with "–help" will not print usage 
> and just gives the error
> {noformat}
> $ bin/run-example
> Exception in thread "main" java.lang.IllegalArgumentException: Missing 
> application resource.
>     at 
> org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
>     at 
> org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
>     at org.apache.spark.launcher.Main.main(Main.java:86){noformat}
> it looks like there is an env var in the script that shows usage, but it's 
> getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24559) Some zip files passed with spark-submit --archives causing "invalid CEN header" error

2018-06-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513036#comment-16513036
 ] 

Marcelo Vanzin commented on SPARK-24559:


{{\-\-archives}} is completely handled by YARN, so if there's anything to fix 
it will be on YARN's side.

> Some zip files passed with spark-submit --archives causing "invalid CEN 
> header" error
> -
>
> Key: SPARK-24559
> URL: https://issues.apache.org/jira/browse/SPARK-24559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: James Porritt
>Priority: Major
>
> I'm encountering an error when submitting some zip files to spark-submit 
> using --archive that are over 2Gb and have the zip64 flag set.
> {{PYSPARK_PYTHON=./ROOT/myspark/bin/python 
> /usr/hdp/current/spark2-client/bin/spark-submit \}}
> {{ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ROOT/myspark/bin/python \}}
> {{ --master=yarn \}}
> {{ --deploy-mode=cluster \}}
> {{ --driver-memory=4g \}}
> {{ --archives=myspark.zip#ROOT \}}
> {{ --num-executors=32 \}}
> {{ --packages com.databricks:spark-avro_2.11:4.0.0 \}}
> {{ foo.py}}
> (As a bit of background, I'm trying to prepare files using the trick of 
> zipping a conda environment and passing the zip file via --archives, as per: 
> https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html)
> myspark.zip is a zipped conda environment. It was created using python with 
> the zipfile pacakge. The files are stored without deflation and with the 
> zip64 flag set. foo.py is my application code. This normally works, but if 
> myspark.zip is greater than 2Gb and has the zip64 flag set I get:
> java.util.zip.ZipException: invalid CEN header (bad signature)
> There seems to be much written on the subject, and I was able to write Java 
> code that utilises the java.util.zip library that both does and doesn't 
> encounter this error for one of the problematic zip files.
> Spark compile info:
> {{Welcome to}}
> {{  __}}
> {{ / __/__ ___ _/ /__}}
> {{ _\ \/ _ \/ _ `/ __/ '_/}}
> {{ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4.0-91}}
> {{ /_/}}
> {{Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112}}
> {{Branch HEAD}}
> {{Compiled by user jenkins on 2018-01-04T10:41:05Z}}
> {{Revision a24017869f5450397136ee8b11be818e7cd3facb}}
> {{Url g...@github.com:hortonworks/spark2.git}}
> {{Type --help for more information.}}
> YARN logs on console after above command. I've tried both 
> --deploy-mode=cluster and --deploy-mode=client.
> {{18/06/13 16:00:22 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable}}
> {{18/06/13 16:00:23 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.}}
> {{18/06/13 16:00:23 INFO RMProxy: Connecting to ResourceManager at 
> myhost2.myfirm.com/10.87.11.17:8050}}
> {{18/06/13 16:00:23 INFO Client: Requesting a new application from cluster 
> with 6 NodeManagers}}
> {{18/06/13 16:00:23 INFO Client: Verifying our application has not requested 
> more than the maximum memory capability of the cluster (221184 MB per 
> container)}}
> {{18/06/13 16:00:23 INFO Client: Will allocate AM container, with 18022 MB 
> memory including 1638 MB overhead}}
> {{18/06/13 16:00:23 INFO Client: Setting up container launch context for our 
> AM}}
> {{18/06/13 16:00:23 INFO Client: Setting up the launch environment for our AM 
> container}}
> {{18/06/13 16:00:23 INFO Client: Preparing resources for our AM container}}
> {{18/06/13 16:00:24 INFO Client: Use hdfs cache file as spark.yarn.archive 
> for HDP, 
> hdfsCacheFile:hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
> {{18/06/13 16:00:24 INFO Client: Source and destination file systems are the 
> same. Not copying 
> hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
> {{18/06/13 16:00:24 INFO Client: Uploading resource 
> file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar -> 
> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.databri}}
> {{cks_spark-avro_2.11-4.0.0.jar}}
> {{18/06/13 16:00:26 INFO Client: Uploading resource 
> file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> 
> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.slf4j_slf4j-api-1.}}
> {{7.5.jar}}
> {{18/06/13 16:00:26 INFO Client: Uploading resource 
> file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> 
> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.apache.avro_avro-}}
> {{1.7.6.jar}}
> 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513008#comment-16513008
 ] 

Hossein Falaki commented on SPARK-24359:


[~shivaram] I like that.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513001#comment-16513001
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

Sounds good. Thanks [~falaki].

 [~felixcheung], on a related note maybe we can formalize these 2.4.0.1 
releases for SparkR as well ? i.e. where we only have changes in R code and it 
is compatible with 2.4.0 of SparkR (we might need to revisit some of the code 
that figures out Spark version based on SparkR version). I can open a new JIRA 
for that ?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-14 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512959#comment-16512959
 ] 

Hossein Falaki commented on SPARK-24359:


Considering that I am volunteering myself to do the housekeeping needed for any 
SparkML maintenance branches, I conclude that we are going to keep this as part 
of main repository. I expect that we will submit to CRAN only when the 
community feels comfortable about stability of the new package (following alpha 
=> beta => GA) process.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% 

[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-14 Thread Trevor McKay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512949#comment-16512949
 ] 

Trevor McKay commented on SPARK-24534:
--

This is useful in situations like this for the spark-on-k8s-operator, where 
there is a need to execute the userid preamble but fall through to another 
command. I think that would be preferable to copy-pasting the boilerplate 
across the imagesphere :)

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/189

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24543) Support any DataType as DDL string for from_json's schema

2018-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24543.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21550
[https://github.com/apache/spark/pull/21550]

> Support any DataType as DDL string for from_json's schema
> -
>
> Key: SPARK-24543
> URL: https://issues.apache.org/jira/browse/SPARK-24543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, schema for from_json can be specified as DataType or a string in 
> the following formats:
> * in SQL, as sequence of fields like _INT a, STRING b
> * in Scala, Python and etc, in JSON format or as in SQL 
> The ticket aims to support arbitrary DataType as DDL string for from_json. 
> For example:
> {code:sql}
> select from_json('{"a":1, "b":2}', 'map')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24543) Support any DataType as DDL string for from_json's schema

2018-06-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24543:
---

Assignee: Maxim Gekk

> Support any DataType as DDL string for from_json's schema
> -
>
> Key: SPARK-24543
> URL: https://issues.apache.org/jira/browse/SPARK-24543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, schema for from_json can be specified as DataType or a string in 
> the following formats:
> * in SQL, as sequence of fields like _INT a, STRING b
> * in Scala, Python and etc, in JSON format or as in SQL 
> The ticket aims to support arbitrary DataType as DDL string for from_json. 
> For example:
> {code:sql}
> select from_json('{"a":1, "b":2}', 'map')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24563.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21569
[https://github.com/apache/spark/pull/21569]

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24563:
--

Assignee: Li Jin

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512893#comment-16512893
 ] 

Apache Spark commented on SPARK-24563:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/21569

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24563:


Assignee: Apache Spark

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24563:


Assignee: (was: Apache Spark)

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512844#comment-16512844
 ] 

Li Jin commented on SPARK-24563:


Will submit a PR soon

> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-24563:
---
Description: 
A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

disallows running PySpark shell without Hive.

Per discussion on mailing list, the behavior change is unintended.

  was:
A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

Disallows running PySpark shell without Hive. Per discussion on mailing list, 
the behavior change is unintended.


> Allow running PySpark shell without Hive
> 
>
> Key: SPARK-24563
> URL: https://issues.apache.org/jira/browse/SPARK-24563
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Li Jin
>Priority: Major
>
> A previous commit: 
> [https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]
> disallows running PySpark shell without Hive.
> Per discussion on mailing list, the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24563) Allow running PySpark shell without Hive

2018-06-14 Thread Li Jin (JIRA)
Li Jin created SPARK-24563:
--

 Summary: Allow running PySpark shell without Hive
 Key: SPARK-24563
 URL: https://issues.apache.org/jira/browse/SPARK-24563
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Li Jin


A previous commit: 

[https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5]

Disallows running PySpark shell without Hive. Per discussion on mailing list, 
the behavior change is unintended.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24562:


Assignee: (was: Apache Spark)

> Allow running same tests with multiple configs in SQLQueryTestSuite
> ---
>
> Key: SPARK-24562
> URL: https://issues.apache.org/jira/browse/SPARK-24562
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> We often need to run the same queries with different configs in order to 
> check their behavior in any condition. In particular, we have 2 cases:
>  - same queries with different configs should have same result;
>  - same queries with different configs should have different results.
> This ticket aims to introduce the support for both cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512769#comment-16512769
 ] 

Apache Spark commented on SPARK-24562:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21568

> Allow running same tests with multiple configs in SQLQueryTestSuite
> ---
>
> Key: SPARK-24562
> URL: https://issues.apache.org/jira/browse/SPARK-24562
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> We often need to run the same queries with different configs in order to 
> check their behavior in any condition. In particular, we have 2 cases:
>  - same queries with different configs should have same result;
>  - same queries with different configs should have different results.
> This ticket aims to introduce the support for both cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24562:


Assignee: Apache Spark

> Allow running same tests with multiple configs in SQLQueryTestSuite
> ---
>
> Key: SPARK-24562
> URL: https://issues.apache.org/jira/browse/SPARK-24562
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> We often need to run the same queries with different configs in order to 
> check their behavior in any condition. In particular, we have 2 cases:
>  - same queries with different configs should have same result;
>  - same queries with different configs should have different results.
> This ticket aims to introduce the support for both cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24562) Allow running same tests with multiple configs in SQLQueryTestSuite

2018-06-14 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-24562:
---

 Summary: Allow running same tests with multiple configs in 
SQLQueryTestSuite
 Key: SPARK-24562
 URL: https://issues.apache.org/jira/browse/SPARK-24562
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Marco Gaido


We often need to run the same queries with different configs in order to check 
their behavior in any condition. In particular, we have 2 cases:
 - same queries with different configs should have same result;
 - same queries with different configs should have different results.

This ticket aims to introduce the support for both cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-14 Thread Rafael Hernandez Murcia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512732#comment-16512732
 ] 

Rafael Hernandez Murcia commented on SPARK-17025:
-

Any news about this? It seems that there's a nice workaround over there: 
[https://stackoverflow.com/questions/41399399/serialize-a-custom-transformer-using-python-to-be-used-within-a-pyspark-ml-pipel]
 but I wouldn't like to keep it as a permanent solution...

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24495) SortMergeJoin with duplicate keys wrong results

2018-06-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512711#comment-16512711
 ] 

Xiao Li commented on SPARK-24495:
-

We might need to release 2.3.2 since this is a serious bug. The impact is 
large. 

> SortMergeJoin with duplicate keys wrong results
> ---
>
> Key: SPARK-24495
> URL: https://issues.apache.org/jira/browse/SPARK-24495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> To reproduce:
> {code:java}
> // the bug is in SortMergeJoin but the Shuffles are correct. with the default 
> 200 it might split the data in such small partitions that the SortMergeJoin 
> cannot return wrong results anymore
> spark.conf.set("spark.sql.shuffle.partitions", "1")
> // disable this, otherwise it would filter results before join, hiding the bug
> spark.conf.set("spark.sql.constraintPropagation.enabled", "false")
> sql("select id as a1 from range(1000)").createOrReplaceTempView("t1")
> sql("select id * 2 as b1, -id as b2 from 
> range(1000)").createOrReplaceTempView("t2")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> sql("""select b1, a1, b2 FROM t1 INNER JOIN t2 ON b1 = a1 AND b2 = a1""").show
> {code}
> In the results, it's expected that all columns are equal (see join condition).
> But the result is:
> {code:java}
> +---+---+---+
> | b1| a1| b2|
> +---+---+---+
> |  0|  0|  0|
> |  2|  2| -1|
> |  4|  4| -2|
> |  6|  6| -3|
> |  8|  8| -4|
> 
> {code}
> I traced it to {{EnsureRequirements.reorder}} which was introduced by 
> [https://github.com/apache/spark/pull/16985] and 
> [https://github.com/apache/spark/pull/20041]
> It leads to an incorrect plan:
> {code:java}
> == Physical Plan ==
> *(5) Project [b1#735672L, a1#735669L, b2#735673L]
> +- *(5) SortMergeJoin [a1#735669L, a1#735669L], [b1#735672L, b1#735672L], 
> Inner
>:- *(2) Sort [a1#735669L ASC NULLS FIRST, a1#735669L ASC NULLS FIRST], 
> false, 0
>:  +- Exchange hashpartitioning(a1#735669L, a1#735669L, 1)
>: +- *(1) Project [id#735670L AS a1#735669L]
>:+- *(1) Range (0, 1000, step=1, splits=8)
>+- *(4) Sort [b1#735672L ASC NULLS FIRST, b2#735673L ASC NULLS FIRST], 
> false, 0
>   +- Exchange hashpartitioning(b1#735672L, b2#735673L, 1)
>  +- *(3) Project [(id#735674L * 2) AS b1#735672L, -id#735674L AS 
> b2#735673L]
> +- *(3) Range (0, 1000, step=1, splits=8)
> {code}
> The SortMergeJoin keys are wrong: key b2 is missing completely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24495) SortMergeJoin with duplicate keys wrong results

2018-06-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24495.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0
   2.3.2

> SortMergeJoin with duplicate keys wrong results
> ---
>
> Key: SPARK-24495
> URL: https://issues.apache.org/jira/browse/SPARK-24495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Marco Gaido
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> To reproduce:
> {code:java}
> // the bug is in SortMergeJoin but the Shuffles are correct. with the default 
> 200 it might split the data in such small partitions that the SortMergeJoin 
> cannot return wrong results anymore
> spark.conf.set("spark.sql.shuffle.partitions", "1")
> // disable this, otherwise it would filter results before join, hiding the bug
> spark.conf.set("spark.sql.constraintPropagation.enabled", "false")
> sql("select id as a1 from range(1000)").createOrReplaceTempView("t1")
> sql("select id * 2 as b1, -id as b2 from 
> range(1000)").createOrReplaceTempView("t2")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> sql("""select b1, a1, b2 FROM t1 INNER JOIN t2 ON b1 = a1 AND b2 = a1""").show
> {code}
> In the results, it's expected that all columns are equal (see join condition).
> But the result is:
> {code:java}
> +---+---+---+
> | b1| a1| b2|
> +---+---+---+
> |  0|  0|  0|
> |  2|  2| -1|
> |  4|  4| -2|
> |  6|  6| -3|
> |  8|  8| -4|
> 
> {code}
> I traced it to {{EnsureRequirements.reorder}} which was introduced by 
> [https://github.com/apache/spark/pull/16985] and 
> [https://github.com/apache/spark/pull/20041]
> It leads to an incorrect plan:
> {code:java}
> == Physical Plan ==
> *(5) Project [b1#735672L, a1#735669L, b2#735673L]
> +- *(5) SortMergeJoin [a1#735669L, a1#735669L], [b1#735672L, b1#735672L], 
> Inner
>:- *(2) Sort [a1#735669L ASC NULLS FIRST, a1#735669L ASC NULLS FIRST], 
> false, 0
>:  +- Exchange hashpartitioning(a1#735669L, a1#735669L, 1)
>: +- *(1) Project [id#735670L AS a1#735669L]
>:+- *(1) Range (0, 1000, step=1, splits=8)
>+- *(4) Sort [b1#735672L ASC NULLS FIRST, b2#735673L ASC NULLS FIRST], 
> false, 0
>   +- Exchange hashpartitioning(b1#735672L, b2#735673L, 1)
>  +- *(3) Project [(id#735674L * 2) AS b1#735672L, -id#735674L AS 
> b2#735673L]
> +- *(3) Range (0, 1000, step=1, splits=8)
> {code}
> The SortMergeJoin keys are wrong: key b2 is missing completely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24495) SortMergeJoin with duplicate keys wrong results

2018-06-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24495:

Priority: Blocker  (was: Major)

> SortMergeJoin with duplicate keys wrong results
> ---
>
> Key: SPARK-24495
> URL: https://issues.apache.org/jira/browse/SPARK-24495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> To reproduce:
> {code:java}
> // the bug is in SortMergeJoin but the Shuffles are correct. with the default 
> 200 it might split the data in such small partitions that the SortMergeJoin 
> cannot return wrong results anymore
> spark.conf.set("spark.sql.shuffle.partitions", "1")
> // disable this, otherwise it would filter results before join, hiding the bug
> spark.conf.set("spark.sql.constraintPropagation.enabled", "false")
> sql("select id as a1 from range(1000)").createOrReplaceTempView("t1")
> sql("select id * 2 as b1, -id as b2 from 
> range(1000)").createOrReplaceTempView("t2")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> sql("""select b1, a1, b2 FROM t1 INNER JOIN t2 ON b1 = a1 AND b2 = a1""").show
> {code}
> In the results, it's expected that all columns are equal (see join condition).
> But the result is:
> {code:java}
> +---+---+---+
> | b1| a1| b2|
> +---+---+---+
> |  0|  0|  0|
> |  2|  2| -1|
> |  4|  4| -2|
> |  6|  6| -3|
> |  8|  8| -4|
> 
> {code}
> I traced it to {{EnsureRequirements.reorder}} which was introduced by 
> [https://github.com/apache/spark/pull/16985] and 
> [https://github.com/apache/spark/pull/20041]
> It leads to an incorrect plan:
> {code:java}
> == Physical Plan ==
> *(5) Project [b1#735672L, a1#735669L, b2#735673L]
> +- *(5) SortMergeJoin [a1#735669L, a1#735669L], [b1#735672L, b1#735672L], 
> Inner
>:- *(2) Sort [a1#735669L ASC NULLS FIRST, a1#735669L ASC NULLS FIRST], 
> false, 0
>:  +- Exchange hashpartitioning(a1#735669L, a1#735669L, 1)
>: +- *(1) Project [id#735670L AS a1#735669L]
>:+- *(1) Range (0, 1000, step=1, splits=8)
>+- *(4) Sort [b1#735672L ASC NULLS FIRST, b2#735673L ASC NULLS FIRST], 
> false, 0
>   +- Exchange hashpartitioning(b1#735672L, b2#735673L, 1)
>  +- *(3) Project [(id#735674L * 2) AS b1#735672L, -id#735674L AS 
> b2#735673L]
> +- *(3) Range (0, 1000, step=1, splits=8)
> {code}
> The SortMergeJoin keys are wrong: key b2 is missing completely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2018-06-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512691#comment-16512691
 ] 

Thomas Graves commented on SPARK-22148:
---

ok, just update if you start working on it. thanks.

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Priority: Major
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24539) HistoryServer does not display metrics from tasks that complete after stage failure

2018-06-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512643#comment-16512643
 ] 

Thomas Graves commented on SPARK-24539:
---

Its possible, I thought when I checked the history server I was actually seeing 
them aggregated properly but I don't know if I checked the specific task 
events.  Probably all related.

> HistoryServer does not display metrics from tasks that complete after stage 
> failure
> ---
>
> Key: SPARK-24539
> URL: https://issues.apache.org/jira/browse/SPARK-24539
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Major
>
> I noticed that task metrics for completed tasks with a stage failure do not 
> show up in the new history server.  I have a feeling this is because all of 
> the tasks succeeded *after* the stage had been failed (so they were 
> completions from a "zombie" taskset).  The task metrics (eg. the shuffle read 
> size & shuffle write size) do not show up at all, either in the task table, 
> the executor table, or the overall stage summary metrics.  (they might not 
> show up in the job summary page either, but in the event logs I have, there 
> is another successful stage attempt after this one, and that is the only 
> thing which shows up in the jobs page.)  If you get task details from the api 
> endpoint (eg. 
> http://[host]:[port]/api/v1/applications/[app-id]/stages/[stage-id]/[stage-attempt])
>  then you can see the successful tasks and all the metrics
> Unfortunately the event logs I have are huge and I don't have a small repro 
> handy, but I hope that description is enough to go on.
> I loaded the event logs I have in the SHS from spark 2.2 and they appear fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24553) Job UI redirect causing http 302 error

2018-06-14 Thread Steven Kallman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Kallman updated SPARK-24553:
---
Description: 
When on spark UI port 4040 jobs or stages tab, the href links for the 
individual jobs or stages are missing a '/' before the '?id' this causes a 
redirect to the address with a '/' which is breaking the use of a reverse proxy

 

localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2

localhost:4040/stages/stage?id=3=0 --> 
localhost:4040/stages/stage/?id=3=0

 

Will submit pull request with proposed fix

  was:
When on spark UI port of 4040 jobs or stages tab, the href links for the 
individual jobs or stages are missing a '/' before the '?id' this causes a 
redirect to the address with a '/' which is breaking the use of a reverse proxy

 

localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2

localhost:4040/stages/stage?id=3=0 --> 
localhost:4040/stages/stage/?id=3=0

 

Will submit pull request with proposed fix


> Job UI redirect causing http 302 error
> --
>
> Key: SPARK-24553
> URL: https://issues.apache.org/jira/browse/SPARK-24553
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Steven Kallman
>Priority: Minor
>
> When on spark UI port 4040 jobs or stages tab, the href links for the 
> individual jobs or stages are missing a '/' before the '?id' this causes a 
> redirect to the address with a '/' which is breaking the use of a reverse 
> proxy
>  
> localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2
> localhost:4040/stages/stage?id=3=0 --> 
> localhost:4040/stages/stage/?id=3=0
>  
> Will submit pull request with proposed fix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512519#comment-16512519
 ] 

Thomas Graves commented on SPARK-24552:
---

sorry just realized the v2 api is still marked experiment so downgrading to 
critical

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Critical
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24552:
--
Priority: Critical  (was: Blocker)

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Critical
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2018-06-14 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512510#comment-16512510
 ] 

Imran Rashid commented on SPARK-22148:
--

[~tgraves] we might be able to work on this soon -- a week or two out at least, 
though.  I know you mentioned some interest in looking at this too, so please 
let us know if you want to take it up.

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Priority: Major
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512504#comment-16512504
 ] 

Thomas Graves commented on SPARK-24552:
---

Note if this is a correctness bug and can cause data loss/corruption it needs 
to be a blocker, changed to blocker, if I am misunderstanding please update.

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24552:
--
Priority: Blocker  (was: Major)

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512500#comment-16512500
 ] 

Thomas Graves commented on SPARK-24552:
---

I agree, I don't think changing the attempt number at this point does much help 
and could cause confusion.  I would rather see something like this change if we 
do major reworking of the scheduler.

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22239) User-defined window functions with pandas udf (unbounded window)

2018-06-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22239:
---
Description: 
Window function is another place we can benefit from vectored udf and add 
another useful function to the pandas_udf suite.

Example usage (preliminary):
{code:java}
w = Window.partitionBy('id').rowsBetween(Window.unbounedPreceding, 
Window.unbounedFollowing)

@pandas_udf(DoubleType())
def mean_udf(v):
return v.mean()

df.withColumn('v_mean', mean_udf(df.v1).over(window))
{code}

  was:
Window function is another place we can benefit from vectored udf and add 
another useful function to the pandas_udf suite.

Example usage (preliminary):

{code:java}
w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)

@pandas_udf(DoubleType())
def ema(v1):
return v1.ewm(alpha=0.5).mean().iloc[-1]

df.withColumn('v1_ema', ema(df.v1).over(window))
{code}




> User-defined window functions with pandas udf (unbounded window)
> 
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').rowsBetween(Window.unbounedPreceding, 
> Window.unbounedFollowing)
> @pandas_udf(DoubleType())
> def mean_udf(v):
> return v.mean()
> df.withColumn('v_mean', mean_udf(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24561) User-defined window functions with pandas udf (bounded window)

2018-06-14 Thread Li Jin (JIRA)
Li Jin created SPARK-24561:
--

 Summary: User-defined window functions with pandas udf (bounded 
window)
 Key: SPARK-24561
 URL: https://issues.apache.org/jira/browse/SPARK-24561
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Li Jin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22239) User-defined window functions with pandas udf (unbounded window)

2018-06-14 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22239:
---
Summary: User-defined window functions with pandas udf (unbounded window)  
(was: User-defined window functions with pandas udf)

> User-defined window functions with pandas udf (unbounded window)
> 
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2018-06-14 Thread Matt Mould (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512433#comment-16512433
 ] 

Matt Mould edited comment on SPARK-13587 at 6/14/18 1:21 PM:
-

What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but it doesn't work for me with the following command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}


was (Author: mattmould):
What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but the it doesn't work for me with the following 
command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2018-06-14 Thread Matt Mould (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512433#comment-16512433
 ] 

Matt Mould commented on SPARK-13587:


What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but the it doesn't work for me with the following 
command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread xueyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xueyu updated SPARK-24560:
--
External issue URL: https://github.com/apache/spark/pull/21567

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512330#comment-16512330
 ] 

Apache Spark commented on SPARK-24560:
--

User 'xueyumusic' has created a pull request for this issue:
https://github.com/apache/spark/pull/21567

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24560:


Assignee: (was: Apache Spark)

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24560:


Assignee: Apache Spark

> Fix some getTimeAsMs as getTimeAsSeconds
> 
>
> Key: SPARK-24560
> URL: https://issues.apache.org/jira/browse/SPARK-24560
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Spark Core
>Affects Versions: 2.3.1
>Reporter: xueyu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24560) Fix some getTimeAsMs as getTimeAsSeconds

2018-06-14 Thread xueyu (JIRA)
xueyu created SPARK-24560:
-

 Summary: Fix some getTimeAsMs as getTimeAsSeconds
 Key: SPARK-24560
 URL: https://issues.apache.org/jira/browse/SPARK-24560
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core
Affects Versions: 2.3.1
Reporter: xueyu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18739) Models in pyspark.classification and regression support setXXXCol methods

2018-06-14 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-18739.
--
Resolution: Not A Problem

> Models in pyspark.classification and regression support setXXXCol methods
> -
>
> Key: SPARK-18739
> URL: https://issues.apache.org/jira/browse/SPARK-18739
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Major
>
> Now, models in pyspark don't suport {{setXXCol}} methods at all.
> I update models in {{classification.py}} according the hierarchy in the scala 
> side:
> 1, add {{setFeaturesCol}} and {{setPredictionCol}} in class 
> {{JavaPredictionModel}}
> 2, add {{setRawPredictionCol}} in class {{JavaClassificationModel}}
> 3, create class {{JavaProbabilisticClassificationModel}} inherit 
> {{JavaClassificationModel}}, and add {{setProbabilityCol}} in it
> 4, {{LogisticRegressionModel}}, {{DecisionTreeClassificationModel}}, 
> {{RandomForestClassificationModel}} and {{NaiveBayesModel}} inherit 
> {{JavaProbabilisticClassificationModel}}
> 5, {{GBTClassificationModel}} and {{MultilayerPerceptronClassificationModel}} 
> inherit {{JavaClassificationModel}}
> 6, {{OneVsRestModel}} inherit {{JavaModel}}, and add {{setFeaturesCol}} and 
> {{setPredictionCol}} method.
> With regard to models in clustering and features, I suggest that we first add 
> some abstract classes like {{ClusteringModel}}, 
> {{ProbabilisticClusteringModel}},  {{FeatureModel}} in the scala side, 
> otherwise we need to manually add setXXXCol methods one by one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-24558:
--
Comment: was deleted

(was: pull request https://github.com/apache/spark/pull/21565)

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Priority: Minor
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24558:


Assignee: Apache Spark

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Assignee: Apache Spark
>Priority: Minor
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512301#comment-16512301
 ] 

Apache Spark commented on SPARK-24558:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/21565

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Priority: Minor
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24558:


Assignee: (was: Apache Spark)

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Priority: Minor
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24559) Some zip files passed with spark-submit --archives causing "invalid CEN header" error

2018-06-14 Thread James Porritt (JIRA)
James Porritt created SPARK-24559:
-

 Summary: Some zip files passed with spark-submit --archives 
causing "invalid CEN header" error
 Key: SPARK-24559
 URL: https://issues.apache.org/jira/browse/SPARK-24559
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: James Porritt


I'm encountering an error when submitting some zip files to spark-submit using 
--archive that are over 2Gb and have the zip64 flag set.

{{PYSPARK_PYTHON=./ROOT/myspark/bin/python 
/usr/hdp/current/spark2-client/bin/spark-submit \}}
{{ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ROOT/myspark/bin/python \}}
{{ --master=yarn \}}
{{ --deploy-mode=cluster \}}
{{ --driver-memory=4g \}}
{{ --archives=myspark.zip#ROOT \}}
{{ --num-executors=32 \}}
{{ --packages com.databricks:spark-avro_2.11:4.0.0 \}}
{{ foo.py}}

(As a bit of background, I'm trying to prepare files using the trick of zipping 
a conda environment and passing the zip file via --archives, as per: 
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html)

myspark.zip is a zipped conda environment. It was created using python with the 
zipfile pacakge. The files are stored without deflation and with the zip64 flag 
set. foo.py is my application code. This normally works, but if myspark.zip is 
greater than 2Gb and has the zip64 flag set I get:

java.util.zip.ZipException: invalid CEN header (bad signature)

There seems to be much written on the subject, and I was able to write Java 
code that utilises the java.util.zip library that both does and doesn't 
encounter this error for one of the problematic zip files.

Spark compile info:

{{Welcome to}}
{{  __}}
{{ / __/__ ___ _/ /__}}
{{ _\ \/ _ \/ _ `/ __/ '_/}}
{{ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4.0-91}}
{{ /_/}}

{{Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112}}
{{Branch HEAD}}
{{Compiled by user jenkins on 2018-01-04T10:41:05Z}}
{{Revision a24017869f5450397136ee8b11be818e7cd3facb}}
{{Url g...@github.com:hortonworks/spark2.git}}
{{Type --help for more information.}}

YARN logs on console after above command. I've tried both --deploy-mode=cluster 
and --deploy-mode=client.

{{18/06/13 16:00:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable}}
{{18/06/13 16:00:23 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.}}
{{18/06/13 16:00:23 INFO RMProxy: Connecting to ResourceManager at 
myhost2.myfirm.com/10.87.11.17:8050}}
{{18/06/13 16:00:23 INFO Client: Requesting a new application from cluster with 
6 NodeManagers}}
{{18/06/13 16:00:23 INFO Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (221184 MB per 
container)}}
{{18/06/13 16:00:23 INFO Client: Will allocate AM container, with 18022 MB 
memory including 1638 MB overhead}}
{{18/06/13 16:00:23 INFO Client: Setting up container launch context for our 
AM}}
{{18/06/13 16:00:23 INFO Client: Setting up the launch environment for our AM 
container}}
{{18/06/13 16:00:23 INFO Client: Preparing resources for our AM container}}
{{18/06/13 16:00:24 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.databri}}
{{cks_spark-avro_2.11-4.0.0.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.slf4j_slf4j-api-1.}}
{{7.5.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.apache.avro_avro-}}
{{1.7.6.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org}}
{{.codehaus.jackson_jackson-core-asl-1.9.13.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/o}}

[jira] [Commented] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512288#comment-16512288
 ] 

sandeep katta commented on SPARK-24558:
---

pull request https://github.com/apache/spark/pull/21565

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Priority: Minor
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24327) Verify and normalize a partition column name based on the JDBC resolved schema

2018-06-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-24327:
-
Description: We need to modify JDBC datasource code to verify and normalize 
a partition column based on the JDBC resolved schema before building 
{{JDBCRelation}}.

> Verify and normalize a partition column name based on the JDBC resolved schema
> --
>
> Key: SPARK-24327
> URL: https://issues.apache.org/jira/browse/SPARK-24327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We need to modify JDBC datasource code to verify and normalize a partition 
> column based on the JDBC resolved schema before building {{JDBCRelation}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24327) Verify and normalize a partition column name based on the JDBC resolved schema

2018-06-14 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-24327:
-
Summary: Verify and normalize a partition column name based on the JDBC 
resolved schema  (was: Add an option to quote a partition column name in 
JDBCRelation)

> Verify and normalize a partition column name based on the JDBC resolved schema
> --
>
> Key: SPARK-24327
> URL: https://issues.apache.org/jira/browse/SPARK-24327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-06-14 Thread sandeep katta (JIRA)
sandeep katta created SPARK-24558:
-

 Summary: Driver prints the wrong info in the log when the executor 
which holds cacheBlock is IDLE.Time-out value displayed is not as per 
configuration value.
 Key: SPARK-24558
 URL: https://issues.apache.org/jira/browse/SPARK-24558
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: sandeep katta


 

launch spark-sql

spark-sql>cache table sample;

2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
registered (new total is 1)
2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
(TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
Registered executor NettyRpcEndpointRef(spark-client://Executor) 
(10.18.99.35:44288) with ID 1
2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
registered (new total is 2)

...
2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
executorIds: 2
2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
executor(s) 2
2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
executor(s) to be killed is 2
2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
because it has been idle for 60 seconds (new desired total will be 1) *//It 
should be 120 not 60*
2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
executorIds: 1
2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
executor(s) 1
2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
executor(s) to be killed is 1
2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Implement the Mini-Batch KMeans

2018-06-14 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512214#comment-16512214
 ] 

zhengruifeng commented on SPARK-14174:
--

[~mlnick] [~mengxr] [~josephkb] Mini-Batch KMeans is much faster than KMeans, 
do you have any plan to involve it in MLLIb? Thanks

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19422) Cache input data in algorithms

2018-06-14 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-19422.
--
Resolution: Not A Problem

> Cache input data in algorithms
> --
>
> Key: SPARK-19422
> URL: https://issues.apache.org/jira/browse/SPARK-19422
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> Now some algorithms cache the input dataset if it was not cached any more 
> {{StorageLevel.NONE}}:
> {{FeedForwardTrainer}}, {{LogisticRegression}}, {{OneVsRest}}, {{KMeans}}, 
> {{AFTSurvivalRegression}}, {{IsotonicRegression}}, {{LinearRegression}} with 
> non-WSL solver
> It maybe reasonable to cache input for others:
> {{DecisionTreeClassifier}}, {{GBTClassifier}}, {{RandomForestClassifier}}, 
> {{LinearSVC}}
> {{BisectingKMeans}}, {{GaussianMixture}}, {{LDA}}
> {{DecisionTreeRegressor}}, {{GBTRegressor}}, {{GeneralizedLinearRegression}} 
> with IRLS solver, {{RandomForestRegressor}}
> {{NaiveBayes}} is not included since it only make one pass on the data.
> {{MultilayerPerceptronClassifier}} is not included since the data is cached 
> in {{FeedForwardTrainer.train}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512174#comment-16512174
 ] 

Apache Spark commented on SPARK-24556:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21564

> ReusedExchange should rewrite output partitioning also when child's 
> partitioning is RangePartitioning
> -
>
> Key: SPARK-24556
> URL: https://issues.apache.org/jira/browse/SPARK-24556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: yucai
>Priority: Major
>
> Currently, ReusedExchange would rewrite output partitioning if child's 
> partitioning is HashPartitioning, but it does not do the same when child's 
> partitioning is RangePartitioning, sometimes, it could introduce extra 
> shuffle, see:
> {code:java}
> val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
> t.cache.orderBy($"t2.j").explain
> {code}
> Before fix:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
>+- InMemoryTableScan [i#5, j#6, i#13, j#14]
>  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
>+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>   :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as...
>   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 
> 200)
>   :+- LocalTableScan [i#5, j#6]
>   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>  +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}
> Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
> 200)", like:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- InMemoryTableScan [i#5, j#6, i#13, j#14]
>   +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
> +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>:- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
>:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
>:+- LocalTableScan [i#5, j#6]
>+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>   +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24556:


Assignee: Apache Spark

> ReusedExchange should rewrite output partitioning also when child's 
> partitioning is RangePartitioning
> -
>
> Key: SPARK-24556
> URL: https://issues.apache.org/jira/browse/SPARK-24556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>
> Currently, ReusedExchange would rewrite output partitioning if child's 
> partitioning is HashPartitioning, but it does not do the same when child's 
> partitioning is RangePartitioning, sometimes, it could introduce extra 
> shuffle, see:
> {code:java}
> val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
> t.cache.orderBy($"t2.j").explain
> {code}
> Before fix:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
>+- InMemoryTableScan [i#5, j#6, i#13, j#14]
>  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
>+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>   :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as...
>   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 
> 200)
>   :+- LocalTableScan [i#5, j#6]
>   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>  +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}
> Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
> 200)", like:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- InMemoryTableScan [i#5, j#6, i#13, j#14]
>   +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
> +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>:- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
>:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
>:+- LocalTableScan [i#5, j#6]
>+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>   +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24556:


Assignee: (was: Apache Spark)

> ReusedExchange should rewrite output partitioning also when child's 
> partitioning is RangePartitioning
> -
>
> Key: SPARK-24556
> URL: https://issues.apache.org/jira/browse/SPARK-24556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: yucai
>Priority: Major
>
> Currently, ReusedExchange would rewrite output partitioning if child's 
> partitioning is HashPartitioning, but it does not do the same when child's 
> partitioning is RangePartitioning, sometimes, it could introduce extra 
> shuffle, see:
> {code:java}
> val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
> val df1 = df.as("t1")
> val df2 = df.as("t2")
> val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
> t.cache.orderBy($"t2.j").explain
> {code}
> Before fix:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
>+- InMemoryTableScan [i#5, j#6, i#13, j#14]
>  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
>+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>   :- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as...
>   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 
> 200)
>   :+- LocalTableScan [i#5, j#6]
>   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>  +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}
> Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
> 200)", like:
> {code:sql}
> == Physical Plan ==
> *(1) Sort [j#14 ASC NULLS FIRST], true, 0
> +- InMemoryTableScan [i#5, j#6, i#13, j#14]
>   +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
> +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
>:- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
>:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
>: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
>:+- LocalTableScan [i#5, j#6]
>+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
>   +- ReusedExchange [i#13, j#14], Exchange 
> rangepartitioning(j#6 ASC NULLS FIRST, 200)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512162#comment-16512162
 ] 

Lee Dongjin commented on SPARK-11107:
-

[~josephkb]] Excuse me. Is there any reason this issue is still opened?

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24557) ClusteringEvaluator support array input

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24557:


Assignee: (was: Apache Spark)

> ClusteringEvaluator support array input
> ---
>
> Key: SPARK-24557
> URL: https://issues.apache.org/jira/browse/SPARK-24557
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> Since clustering algs already suppot array input,
> {{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24557) ClusteringEvaluator support array input

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24557:


Assignee: Apache Spark

> ClusteringEvaluator support array input
> ---
>
> Key: SPARK-24557
> URL: https://issues.apache.org/jira/browse/SPARK-24557
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> Since clustering algs already suppot array input,
> {{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24557) ClusteringEvaluator support array input

2018-06-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512158#comment-16512158
 ] 

Apache Spark commented on SPARK-24557:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/21563

> ClusteringEvaluator support array input
> ---
>
> Key: SPARK-24557
> URL: https://issues.apache.org/jira/browse/SPARK-24557
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> Since clustering algs already suppot array input,
> {{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512159#comment-16512159
 ] 

Lee Dongjin commented on SPARK-24530:
-

OMG, I am sorry; I was misunderstanding. The documentation is also broken in my 
environment.

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24557) ClusteringEvaluator support array input

2018-06-14 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-24557:


 Summary: ClusteringEvaluator support array input
 Key: SPARK-24557
 URL: https://issues.apache.org/jira/browse/SPARK-24557
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: zhengruifeng


Since clustering algs already suppot array input,

{{{ClusteringEvaluator}}} should also support it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-14 Thread Lee Dongjin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512153#comment-16512153
 ] 

Lee Dongjin commented on SPARK-4591:


[~josephkb] Excuse me. By SPARK-14376 was resolved recently, I think we should 
make this issue be resolve also.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24556:
--
Description: 
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code:java}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as...
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

  was:
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}


> ReusedExchange should rewrite output partitioning also when child's 
> partitioning is RangePartitioning
> -
>
> Key: SPARK-24556
> URL: https://issues.apache.org/jira/browse/SPARK-24556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: yucai
>Priority: Major
>
> Currently, ReusedExchange would rewrite output partitioning if 

[jira] [Updated] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24556:
--
Description: 
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:
{code}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}
Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}
Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], CachedRDDBuilder...
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

  was:
Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:

{code:scala}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}

Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  

[jira] [Created] (SPARK-24556) ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning

2018-06-14 Thread yucai (JIRA)
yucai created SPARK-24556:
-

 Summary: ReusedExchange should rewrite output partitioning also 
when child's partitioning is RangePartitioning
 Key: SPARK-24556
 URL: https://issues.apache.org/jira/browse/SPARK-24556
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: yucai


Currently, ReusedExchange would rewrite output partitioning if child's 
partitioning is HashPartitioning, but it does not do the same when child's 
partitioning is RangePartitioning, sometimes, it could introduce extra shuffle, 
see:

{code:scala}
val df = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j")
val df1 = df.as("t1")
val df2 = df.as("t2")
val t = df1.orderBy("j").join(df2.orderBy("j"), $"t1.i" === $"t2.i", "right")
t.cache.orderBy($"t2.j").explain
{code}

Before fix:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(j#14 ASC NULLS FIRST, 200)
   +- InMemoryTableScan [i#5, j#6, i#13, j#14]
 +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
   +- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
  :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
  :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
  : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
  :+- LocalTableScan [i#5, j#6]
  +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
 +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}

Better plan should avoid "Exchange rangepartitioning(j#14 ASC NULLS FIRST, 
200)", like:
{code:sql}
== Physical Plan ==
*(1) Sort [j#14 ASC NULLS FIRST], true, 0
+- InMemoryTableScan [i#5, j#6, i#13, j#14]
  +- InMemoryRelation [i#5, j#6, i#13, j#14], 
CachedRDDBuilder(true,1,StorageLevel(disk, memory, deserialized, 1 
replicas),*(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] 
as bigint)))
:  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
: +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
:+- LocalTableScan [i#5, j#6]
+- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
   +- ReusedExchange [i#13, j#14], Exchange rangepartitioning(j#6 ASC NULLS 
FIRST, 200)
,None)
+- *(2) BroadcastHashJoin [i#5], [i#13], RightOuter, BuildLeft
   :- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   :  +- *(1) Sort [j#6 ASC NULLS FIRST], true, 0
   : +- Exchange rangepartitioning(j#6 ASC NULLS FIRST, 200)
   :+- LocalTableScan [i#5, j#6]
   +- *(2) Sort [j#14 ASC NULLS FIRST], true, 0
  +- ReusedExchange [i#13, j#14], Exchange 
rangepartitioning(j#6 ASC NULLS FIRST, 200)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20932) CountVectorizer support handle persistence

2018-06-14 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-20932.
--
Resolution: Not A Problem

> CountVectorizer support handle persistence
> --
>
> Key: SPARK-20932
> URL: https://issues.apache.org/jira/browse/SPARK-20932
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Major
>
> in {{CountVectorizer.fit}}, RDDs {{input}} & {{wordCounts}} should be 
> unpersisted after computation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22971) OneVsRestModel should use temporary RawPredictionCol

2018-06-14 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-22971.
--
Resolution: Not A Problem

> OneVsRestModel should use temporary RawPredictionCol
> 
>
> Key: SPARK-22971
> URL: https://issues.apache.org/jira/browse/SPARK-22971
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Issue occurs when I transform one dataframe with two different classification 
> models, first by a {{RandomForestClassificationModel}}, then a 
> {{OneVsRestModel}}.
> The first transform generate a new colum "rawPrediction", which will be 
> internally used in {{OneVsRestModel#transform}} and cause failure.
> {code}
> scala> val df = 
> spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 18/01/05 17:08:18 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val rf = new RandomForestClassifier()
> rf: org.apache.spark.ml.classification.RandomForestClassifier = 
> rfc_c11b1e1e1f7f
> scala> val rfm = rf.fit(df)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_c11b1e1e1f7f) with 20 trees
> scala> val lr = new LogisticRegression().setMaxIter(1)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_f5a5285eba06
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_8f5584190634
> scala> val ovrModel = ovr.fit(df)
> ovrModel: org.apache.spark.ml.classification.OneVsRestModel = 
> oneVsRest_8f5584190634
> scala> val df2 = rfm.setPredictionCol("rfPred").transform(df)
> df2: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 
> more fields]
> scala> val df3 = ovrModel.setPredictionCol("ovrPred").transform(df2)
> java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
> already exists.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:101)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:91)
>   at 
> org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:43)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:77)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:904)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:904)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:104)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:184)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:173)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:173)
>   ... 50 elided
> {code}
> {{OneVsRestModel#transform}} only generates a new prediction column, and 
> should not fail by other columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24555) logNumExamples in KMeans/BiKM/GMM/AFT/NB

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24555:


Assignee: Apache Spark

> logNumExamples in KMeans/BiKM/GMM/AFT/NB
> 
>
> Key: SPARK-24555
> URL: https://issues.apache.org/jira/browse/SPARK-24555
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> log numExamples in KMeans/BiKM/GMM/AFT/NB
> avoid extra pass on dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24555) logNumExamples in KMeans/BiKM/GMM/AFT/NB

2018-06-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24555:


Assignee: (was: Apache Spark)

> logNumExamples in KMeans/BiKM/GMM/AFT/NB
> 
>
> Key: SPARK-24555
> URL: https://issues.apache.org/jira/browse/SPARK-24555
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> log numExamples in KMeans/BiKM/GMM/AFT/NB
> avoid extra pass on dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >