[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160758#comment-16160758
 ] 

Apache Spark commented on SPARK-21972:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19186

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21972:


Assignee: (was: Apache Spark)

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21972:


Assignee: Apache Spark

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>Assignee: Apache Spark
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately 
([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624
 ] 

Siddharth Murching commented on SPARK-21972:


Work has already begun on this in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}} (algorithms will not 
try to persist uncached input).


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624
 ] 

Siddharth Murching edited comment on SPARK-21972 at 9/11/17 3:46 AM:
-

This issue is being worked on in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]


was (Author: siddharth murching):
Work has already begun on this in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}} (algorithms will not 
try to persist uncached input).

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}} (algorithms will 
> not try to persist uncached input).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance. 
Unfortunately, these algorithms a) check input persistence inaccurately (as 
described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) 
and b) check the persistence level of the input dataset but not any of its 
parents; both of these issues can result in unwanted double-caching of input 
data & degraded performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call `cache()` on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].
> This ticket proposes adding a boolean `handlePersistence` param 
> (org.apache.spark.ml.param) to the abovementioned estimators so that users 
> can specify whether an ML algorithm should try to cache un-cached input data. 
> `handlePersistence` will be `true` by default, corresponding to existing 
> behavior (always persisting uncached input), but users can achieve 
> finer-grained control over input persistence by setting `handlePersistence` 
> to `false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call `cache()` on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean `handlePersistence` param 
> (org.apache.spark.ml.param) to the abovementioned estimators so that users 
> can specify whether an ML algorithm should try to cache un-cached input data. 
> `handlePersistence` will be `true` by default, corresponding to existing 
> behavior (always persisting uncached input), but users can achieve 
> finer-grained control over input persistence by setting `handlePersistence` 
> to `false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-21972:
--

 Summary: Allow users to control input data persistence in ML 
Estimators via a handlePersistence ml.Param
 Key: SPARK-21972
 URL: https://issues.apache.org/jira/browse/SPARK-21972
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Siddharth Murching


Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance. 
Unfortunately, these algorithms a) check input persistence inaccurately (as 
described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) 
and b) check the persistence level of the input dataset but not any of its 
parents; both of these issues can result in unwanted double-caching of input 
data & degraded performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Description: 
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
  @Override
  public void registerChannel(Channel channel, long streamId) {
if (streams.containsKey(streamId)) {
  streams.get(streamId).associatedChannel = channel;
}
  }
this is only chance associatedChannel  is set


public void connectionTerminated(Channel channel) {
// Close all streams which have been associated with the channel.
for (Map.Entry entry: streams.entrySet()) {
  StreamState state = entry.getValue();
  if (state.associatedChannel == channel) {
streams.remove(entry.getKey());

// Release all remaining buffers.
while (state.buffers.hasNext()) {
  state.buffers.next().release();
}
  }
}
this is only chance state.buffers is released.

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 
So, channel can not be set. 
So, we can see some leaked Buffer in OneForOneStreamManager

!screenshot-1.png!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever.
Which may lead to OOM in NodeManager.

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


We should set channel when  we registerStream, so buffer can be released. 

  was:
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
  @Override
  public void registerChannel(Channel channel, long streamId) {
if (streams.containsKey(streamId)) {
  streams.get(streamId).associatedChannel = channel;
}
  }
this is only chance associatedChannel  is set


public void connectionTerminated(Channel channel) {
// Close all streams which have been associated with the channel.
for (Map.Entry entry: streams.entrySet()) {
  StreamState state = entry.getValue();
  if (state.associatedChannel == channel) {
streams.remove(entry.getKey());

// Release all remaining buffers.
while (state.buffers.hasNext()) {
  state.buffers.next().release();
}
  }
}
this is only chance state.buffers is released.

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 
So, channel can not be set. 
So, we can see some leaked Buffer in OneForOneStreamManager

!screenshot-1.png!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever.
Which may lead to OOM in NodeManager.

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
> Attachments: screenshot-1.png
>
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> 

[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Description: 
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
  @Override
  public void registerChannel(Channel channel, long streamId) {
if (streams.containsKey(streamId)) {
  streams.get(streamId).associatedChannel = channel;
}
  }
this is only chance associatedChannel  is set


public void connectionTerminated(Channel channel) {
// Close all streams which have been associated with the channel.
for (Map.Entry entry: streams.entrySet()) {
  StreamState state = entry.getValue();
  if (state.associatedChannel == channel) {
streams.remove(entry.getKey());

// Release all remaining buffers.
while (state.buffers.hasNext()) {
  state.buffers.next().release();
}
  }
}
this is only chance state.buffers is released.

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 
So, channel can not be set. 
So, we can see some leaked Buffer in OneForOneStreamManager

!screenshot-1.png!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.

  was:
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager

!attachment-name.jpg|thumbnail!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
> Attachments: screenshot-1.png
>
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest registerChannel
> org.apache.spark.network.server.OneForOneStreamManager#registerChannel
> fill with streamState with channel 
> In 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 
> OpenBlocks  -> ChunkFetchRequest   come in sequnce. 
> spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
>   @Override
>   public void registerChannel(Channel channel, long streamId) {
> if (streams.containsKey(streamId)) {
>   streams.get(streamId).associatedChannel = channel;
> }
>   }
> this is only chance associatedChannel  is set
> public void connectionTerminated(Channel channel) {
> // Close all streams which have been associated with the channel.
> for (Map.Entry entry: streams.entrySet()) {
>   StreamState state = entry.getValue();
>   if (state.associatedChannel == channel) {
> 

[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Description: 
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
  @Override
  public void registerChannel(Channel channel, long streamId) {
if (streams.containsKey(streamId)) {
  streams.get(streamId).associatedChannel = channel;
}
  }
this is only chance associatedChannel  is set


public void connectionTerminated(Channel channel) {
// Close all streams which have been associated with the channel.
for (Map.Entry entry: streams.entrySet()) {
  StreamState state = entry.getValue();
  if (state.associatedChannel == channel) {
streams.remove(entry.getKey());

// Release all remaining buffers.
while (state.buffers.hasNext()) {
  state.buffers.next().release();
}
  }
}
this is only chance state.buffers is released.

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 
So, channel can not be set. 
So, we can see some leaked Buffer in OneForOneStreamManager

!screenshot-1.png!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever.
Which may lead to OOM in NodeManager.

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.

  was:
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

spark-1.6.1\network\common\src\main\java\org\apache\spark\network\server\OneForOneStreamManager.java
  @Override
  public void registerChannel(Channel channel, long streamId) {
if (streams.containsKey(streamId)) {
  streams.get(streamId).associatedChannel = channel;
}
  }
this is only chance associatedChannel  is set


public void connectionTerminated(Channel channel) {
// Close all streams which have been associated with the channel.
for (Map.Entry entry: streams.entrySet()) {
  StreamState state = entry.getValue();
  if (state.associatedChannel == channel) {
streams.remove(entry.getKey());

// Release all remaining buffers.
while (state.buffers.hasNext()) {
  state.buffers.next().release();
}
  }
}
this is only chance state.buffers is released.

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 
So, channel can not be set. 
So, we can see some leaked Buffer in OneForOneStreamManager

!screenshot-1.png!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
> Attachments: screenshot-1.png
>
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest 

[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Attachment: screenshot-1.png

> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
> Attachments: screenshot-1.png
>
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest registerChannel
> org.apache.spark.network.server.OneForOneStreamManager#registerChannel
> fill with streamState with channel 
> In 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 
> OpenBlocks  -> ChunkFetchRequest   come in sequnce. 
> If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
> then. 
> So, we can see some leaked Buffer in OneForOneStreamManager
> !attachment-name.jpg|thumbnail!
> if 
> org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
>   is not set, then after search the code , it will remain in memory forever. 
> Because the only way to release it was in channel close , or someone read the 
> last piece of block. 
> OneForOneStreamManager#registerStream we can set channel in this method, just 
> in case of this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21854) Python interface for MLOR summary

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21854:


Assignee: Apache Spark

> Python interface for MLOR summary
> -
>
> Key: SPARK-21854
> URL: https://issues.apache.org/jira/browse/SPARK-21854
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21971:
--
Affects Version/s: 2.2.0

> Too many open files in Spark due to concurrent files being opened
> -
>
> Key: SPARK-21971
> URL: https://issues.apache.org/jira/browse/SPARK-21971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
> consistently fails with "too many open files" exception.
> {noformat}
> O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
> 394 ms on machine111.xyz (executor 2) (189/200)
> 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
> 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 
> 6) (190/200)
> 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 
> 844.0 (TID 243904, machine1.xyz, executor 1): 
> java.nio.file.FileSystemException: 
> /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
>  Too many open files
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
> at java.nio.channels.FileChannel.open(FileChannel.java:287)
> at java.nio.channels.FileChannel.open(FileChannel.java:335)
> at 
> org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
> {noformat}
> Cluster was configured with multiple cores per executor. 
> Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
> causes large number of spills in larger dataset. With multiple cores per 
> executor, this reproduces easily. 
> {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
> all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
> in its constructor and closes the file later as a part of its close() call. 
> This causes too many open files issue.
> Note that this is not a file leak, but more of concurrent files being open at 
> any given time depending on the dataset being processed.
> One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
> so that fewer spill files are generated, but it is hard to determine the 
> sweetspot for all workload. Another option is to set ulimit to "unlimited" 
> for files, but that would not be a good production setting. It would be good 
> to consider reducing the number of concurrent 
> "UnsafeExternalSorter::getIterator".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21854) Python interface for MLOR summary

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21854:


Assignee: (was: Apache Spark)

> Python interface for MLOR summary
> -
>
> Key: SPARK-21854
> URL: https://issues.apache.org/jira/browse/SPARK-21854
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21854) Python interface for MLOR summary

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160616#comment-16160616
 ] 

Apache Spark commented on SPARK-21854:
--

User 'jmwdpk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19185

> Python interface for MLOR summary
> -
>
> Key: SPARK-21854
> URL: https://issues.apache.org/jira/browse/SPARK-21854
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Description: 
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager

!attachment-name.jpg|thumbnail!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.

  was:
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager



if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest registerChannel
> org.apache.spark.network.server.OneForOneStreamManager#registerChannel
> fill with streamState with channel 
> In 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 
> OpenBlocks  -> ChunkFetchRequest   come in sequnce. 
> If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
> then. 
> So, we can see some leaked Buffer in OneForOneStreamManager
> !attachment-name.jpg|thumbnail!
> if 
> org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
>   is not set, then after search the code , it will remain in memory forever. 
> Because the only way to release it was in channel close , or someone read the 
> last piece of block. 
> OneForOneStreamManager#registerStream we can set channel in this method, just 
> in case of this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-10 Thread poseidon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

poseidon updated SPARK-21955:
-
Description: 
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager



if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.

  was:
just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager

!attachment-name.jpg|thumbnail!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.


> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest registerChannel
> org.apache.spark.network.server.OneForOneStreamManager#registerChannel
> fill with streamState with channel 
> In 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 
> OpenBlocks  -> ChunkFetchRequest   come in sequnce. 
> If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
> then. 
> So, we can see some leaked Buffer in OneForOneStreamManager
> if 
> org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
>   is not set, then after search the code , it will remain in memory forever. 
> Because the only way to release it was in channel close , or someone read the 
> last piece of block. 
> OneForOneStreamManager#registerStream we can set channel in this method, just 
> in case of this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-09-10 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160550#comment-16160550
 ] 

xinzhang edited comment on SPARK-21067 at 9/11/17 2:02 AM:
---

[~dricard]

Thanks for your reply.
So do we . Use the parquet . But another pro is when u use sql like "insert 
overwrite table a partition(pt='2') select" . 
It will also cause the thriftserver fail . Do you happen to have the same 
problem?
Only happend with the table which use partitions . this all right when use 
parquet without partition. "insert overwrite table a  select"


was (Author: zhangxin0112zx):
[~dricard]

Thanks for your reply.
So do we . Use the parquet . But another pro is when u use sql like "insert 
overwrite table a partition(pt='2') select" . 
It will also cause the thriftserver fail . Do you happen to have the same 
problem?

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> 

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-09-10 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160550#comment-16160550
 ] 

xinzhang commented on SPARK-21067:
--

[~dricard]

Thanks for your reply.
So do we . Use the parquet . But another pro is when u use sql like "insert 
overwrite table a partition(pt='2') select" . 
It will also cause the thriftserver fail . Do you happen to have the same 
problem?

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> 

[jira] [Assigned] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21971:


Assignee: (was: Apache Spark)

> Too many open files in Spark due to concurrent files being opened
> -
>
> Key: SPARK-21971
> URL: https://issues.apache.org/jira/browse/SPARK-21971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
> consistently fails with "too many open files" exception.
> {noformat}
> O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
> 394 ms on machine111.xyz (executor 2) (189/200)
> 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
> 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 
> 6) (190/200)
> 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 
> 844.0 (TID 243904, machine1.xyz, executor 1): 
> java.nio.file.FileSystemException: 
> /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
>  Too many open files
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
> at java.nio.channels.FileChannel.open(FileChannel.java:287)
> at java.nio.channels.FileChannel.open(FileChannel.java:335)
> at 
> org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
> {noformat}
> Cluster was configured with multiple cores per executor. 
> Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
> causes large number of spills in larger dataset. With multiple cores per 
> executor, this reproduces easily. 
> {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
> all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
> in its constructor and closes the file later as a part of its close() call. 
> This causes too many open files issue.
> Note that this is not a file leak, but more of concurrent files being open at 
> any given time depending on the dataset being processed.
> One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
> so that fewer spill files are generated, but it is hard to determine the 
> sweetspot for all workload. Another option is to set ulimit to "unlimited" 
> for files, but that would not be a good production setting. It would be good 
> to consider reducing the number of concurrent 
> "UnsafeExternalSorter::getIterator".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21971:


Assignee: Apache Spark

> Too many open files in Spark due to concurrent files being opened
> -
>
> Key: SPARK-21971
> URL: https://issues.apache.org/jira/browse/SPARK-21971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>Priority: Minor
>
> When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
> consistently fails with "too many open files" exception.
> {noformat}
> O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
> 394 ms on machine111.xyz (executor 2) (189/200)
> 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
> 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 
> 6) (190/200)
> 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 
> 844.0 (TID 243904, machine1.xyz, executor 1): 
> java.nio.file.FileSystemException: 
> /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
>  Too many open files
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
> at java.nio.channels.FileChannel.open(FileChannel.java:287)
> at java.nio.channels.FileChannel.open(FileChannel.java:335)
> at 
> org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
> {noformat}
> Cluster was configured with multiple cores per executor. 
> Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
> causes large number of spills in larger dataset. With multiple cores per 
> executor, this reproduces easily. 
> {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
> all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
> in its constructor and closes the file later as a part of its close() call. 
> This causes too many open files issue.
> Note that this is not a file leak, but more of concurrent files being open at 
> any given time depending on the dataset being processed.
> One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
> so that fewer spill files are generated, but it is hard to determine the 
> sweetspot for all workload. Another option is to set ulimit to "unlimited" 
> for files, but that would not be a good production setting. It would be good 
> to consider reducing the number of concurrent 
> "UnsafeExternalSorter::getIterator".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160547#comment-16160547
 ] 

Apache Spark commented on SPARK-21971:
--

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19184

> Too many open files in Spark due to concurrent files being opened
> -
>
> Key: SPARK-21971
> URL: https://issues.apache.org/jira/browse/SPARK-21971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Rajesh Balamohan
>Priority: Minor
>
> When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
> consistently fails with "too many open files" exception.
> {noformat}
> O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
> 394 ms on machine111.xyz (executor 2) (189/200)
> 17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
> 844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 
> 6) (190/200)
> 17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 
> 844.0 (TID 243904, machine1.xyz, executor 1): 
> java.nio.file.FileSystemException: 
> /grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
>  Too many open files
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
> at java.nio.channels.FileChannel.open(FileChannel.java:287)
> at java.nio.channels.FileChannel.open(FileChannel.java:335)
> at 
> org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
> at 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
> {noformat}
> Cluster was configured with multiple cores per executor. 
> Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
> causes large number of spills in larger dataset. With multiple cores per 
> executor, this reproduces easily. 
> {{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
> all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
> in its constructor and closes the file later as a part of its close() call. 
> This causes too many open files issue.
> Note that this is not a file leak, but more of concurrent files being open at 
> any given time depending on the dataset being processed.
> One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
> so that fewer spill files are generated, but it is hard to determine the 
> sweetspot for all workload. Another option is to set ulimit to "unlimited" 
> for files, but that would not be a good production setting. It would be good 
> to consider reducing the number of concurrent 
> "UnsafeExternalSorter::getIterator".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-09-10 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-20098:
---

Assignee: Peter Szalai

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
>Assignee: Peter Szalai
> Fix For: 2.2.1, 2.3.0
>
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21971) Too many open files in Spark due to concurrent files being opened

2017-09-10 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-21971:


 Summary: Too many open files in Spark due to concurrent files 
being opened
 Key: SPARK-21971
 URL: https://issues.apache.org/jira/browse/SPARK-21971
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Rajesh Balamohan
Priority: Minor


When running Q67 of TPC-DS at 1 TB dataset on multi node cluster, it 
consistently fails with "too many open files" exception.

{noformat}
O scheduler.TaskSetManager: Finished task 25.0 in stage 844.0 (TID 243786) in 
394 ms on machine111.xyz (executor 2) (189/200)
17/08/20 10:33:45 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 
844.0 (TID 243932) in 11996 ms on cn116-10.l42scl.hortonworks.com (executor 6) 
(190/200)
17/08/20 10:37:40 WARN scheduler.TaskSetManager: Lost task 144.0 in stage 844.0 
(TID 243904, machine1.xyz, executor 1): java.nio.file.FileSystemException: 
/grid/3/hadoop/yarn/local/usercache/rbalamohan/appcache/application_1490656001509_7207/blockmgr-5180e3f0-f7ed-44bb-affc-8f99f09ba7bc/28/temp_local_690afbf7-172d-4fdb-8492-3e2ebd8d5183:
 Too many open files
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at 
org.apache.spark.io.NioBufferedFileInputStream.(NioBufferedFileInputStream.java:43)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.(UnsafeSorterSpillReader.java:75)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getIterator(UnsafeExternalSorter.java:607)
at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:169)
at 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.generateIterator(ExternalAppendOnlyUnsafeRowArray.scala:173)
{noformat}

Cluster was configured with multiple cores per executor. 

Window function uses "spark.sql.windowExec.buffer.spill.threshold=4096" which 
causes large number of spills in larger dataset. With multiple cores per 
executor, this reproduces easily. 

{{UnsafeExternalSorter::getIterator()}} invokes {{spillWriter.getReader}} for 
all the available spillWriters. {{UnsafeSorterSpillReader}} opens up the file 
in its constructor and closes the file later as a part of its close() call. 
This causes too many open files issue.
Note that this is not a file leak, but more of concurrent files being open at 
any given time depending on the dataset being processed.

One option could be to increase "spark.sql.windowExec.buffer.spill.threshold" 
so that fewer spill files are generated, but it is hard to determine the 
sweetspot for all workload. Another option is to set ulimit to "unlimited" for 
files, but that would not be a good production setting. It would be good to 
consider reducing the number of concurrent "UnsafeExternalSorter::getIterator".










--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160491#comment-16160491
 ] 

Apache Spark commented on SPARK-21960:
--

User 'karth295' has created a pull request for this issue:
https://github.com/apache/spark/pull/19183

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21960:


Assignee: Apache Spark

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Assignee: Apache Spark
>Priority: Minor
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21960:


Assignee: (was: Apache Spark)

> Spark Streaming Dynamic Allocation should respect spark.executor.instances
> --
>
> Key: SPARK-21960
> URL: https://issues.apache.org/jira/browse/SPARK-21960
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> This check enforces that spark.executor.instances (aka --num-executors) is 
> either unset or explicitly set to 0. 
> https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207
> If spark.executor.instances is unset, the check is fine, and the property 
> defaults to 2. Spark requests the cluster manager for 2 executors to start 
> with, then adds/removes executors appropriately.
> However, if you explicitly set it to 0, the check also succeeds, but Spark 
> never asks the cluster manager for any executors. When running on YARN, I 
> repeatedly saw:
> {code:java}
> 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
> Initial job has not accepted any resources; check your cluster UI to ensure 
> that workers are registered and have sufficient resources
> {code}
> I noticed that at least Google Dataproc and Ambari explicitly set 
> spark.executor.instances to a positive number, meaning that to use dynamic 
> allocation, you would have to edit spark-defaults.conf to remove the 
> property. That's obnoxious.
> In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
> for --num-executors or --conf spark.executor.instances: 
> https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9
> It is much more reasonable for Streaming DRA to use spark.executor.instances, 
> just like Core DRA. I'll open a pull request to remove the check if there are 
> no objections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21970:


Assignee: (was: Apache Spark)

> Do a Project Wide Sweep for Redundant Throws Declarations
> -
>
> Key: SPARK-21970
> URL: https://issues.apache.org/jira/browse/SPARK-21970
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Armin Braun
>Priority: Trivial
>  Labels: cleanup
>
> Unfortunately, redundant throws declarations are not caught by Checkstyle and 
> there are quite a few in the current Java codebase.
> In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead 
> code too.
> I think it's worthwhile to do a sweep for these and remove them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21970:


Assignee: Apache Spark

> Do a Project Wide Sweep for Redundant Throws Declarations
> -
>
> Key: SPARK-21970
> URL: https://issues.apache.org/jira/browse/SPARK-21970
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Armin Braun
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: cleanup
>
> Unfortunately, redundant throws declarations are not caught by Checkstyle and 
> there are quite a few in the current Java codebase.
> In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead 
> code too.
> I think it's worthwhile to do a sweep for these and remove them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160470#comment-16160470
 ] 

Apache Spark commented on SPARK-21970:
--

User 'original-brownbear' has created a pull request for this issue:
https://github.com/apache/spark/pull/19182

> Do a Project Wide Sweep for Redundant Throws Declarations
> -
>
> Key: SPARK-21970
> URL: https://issues.apache.org/jira/browse/SPARK-21970
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Armin Braun
>Priority: Trivial
>  Labels: cleanup
>
> Unfortunately, redundant throws declarations are not caught by Checkstyle and 
> there are quite a few in the current Java codebase.
> In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead 
> code too.
> I think it's worthwhile to do a sweep for these and remove them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21970) Do a Project Wide Sweep for Redundant Throws Declarations

2017-09-10 Thread Armin Braun (JIRA)
Armin Braun created SPARK-21970:
---

 Summary: Do a Project Wide Sweep for Redundant Throws Declarations
 Key: SPARK-21970
 URL: https://issues.apache.org/jira/browse/SPARK-21970
 Project: Spark
  Issue Type: Bug
  Components: Examples, Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Armin Braun
Priority: Trivial


Unfortunately, redundant throws declarations are not caught by Checkstyle and 
there are quite a few in the current Java codebase.
In one case `ShuffleExternalSorter#closeAndGetSpills` this hides some dead code 
too.

I think it's worthwhile to do a sweep for these and remove them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable

2017-09-10 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21969:

Description: 
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{code}
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
{code}

Produces:
{code}
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{code}


{code}
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
{code}

After append something, the same stats are used
{code}
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{code}

Manually refreshing the table removes the stats
{code}
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
{code}

{code}
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
{code}

  was:
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
}}

Produces:
{{
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}


{{
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
}}

After append something, the same stats are used
{{
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}

Manually refreshing the table removes the stats
{{
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
}}

{{
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
}}


> CommandUtils.updateTableStats should call refreshTable
> --
>
> Key: SPARK-21969
> URL: https://issues.apache.org/jira/browse/SPARK-21969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> The table is cached so even though statistics are removed, they will still be 
> used by the existing sessions.
> {code}
> spark.range(100).write.saveAsTable("tab1")
> sql("analyze table tab1 compute statistics")
> sql("explain cost select distinct * from tab1").show(false)
> {code}
> Produces:
> {code}
> Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> {code}
> {code}
> spark.range(100).write.mode("append").saveAsTable("tab1")
> sql("explain cost select distinct * from tab1").show(false)
> {code}
> After append something, the same stats are used
> {code}
> Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> {code}
> Manually refreshing the table removes the stats
> {code}
> spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
> sql("explain cost select distinct * from tab1").show(false)
> {code}
> {code}
> Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable

2017-09-10 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21969:

Description: 
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
}}

Produces:
{{
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}


{{
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
}}

After append something, the same stats are used
{{
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
}}

Manually refreshing the table removes the stats
{{
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
}}

{{
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
}}

  was:
The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{code}}
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

Produces:
{{code}}
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}


{{code}}
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

After append something, the same stats are used
{{code}}
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}

Manually refreshing the table removes the stats
{{code}}
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
{{code}}

{{code}}
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
{{code}}


> CommandUtils.updateTableStats should call refreshTable
> --
>
> Key: SPARK-21969
> URL: https://issues.apache.org/jira/browse/SPARK-21969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>
> The table is cached so even though statistics are removed, they will still be 
> used by the existing sessions.
> {{
> spark.range(100).write.saveAsTable("tab1")
> sql("analyze table tab1 compute statistics")
> sql("explain cost select distinct * from tab1").show(false)
> }}
> Produces:
> {{
> Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> }}
> {{
> spark.range(100).write.mode("append").saveAsTable("tab1")
> sql("explain cost select distinct * from tab1").show(false)
> }}
> After append something, the same stats are used
> {{
> Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
> hints=none)
> }}
> Manually refreshing the table removes the stats
> {{
> spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
> sql("explain cost select distinct * from tab1").show(false)
> }}
> {{
> Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21969) CommandUtils.updateTableStats should call refreshTable

2017-09-10 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-21969:
---

 Summary: CommandUtils.updateTableStats should call refreshTable
 Key: SPARK-21969
 URL: https://issues.apache.org/jira/browse/SPARK-21969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bogdan Raducanu


The table is cached so even though statistics are removed, they will still be 
used by the existing sessions.


{{code}}
spark.range(100).write.saveAsTable("tab1")
sql("analyze table tab1 compute statistics")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

Produces:
{{code}}
Relation[id#103L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}


{{code}}
spark.range(100).write.mode("append").saveAsTable("tab1")
sql("explain cost select distinct * from tab1").show(false)
{{code}}

After append something, the same stats are used
{{code}}
Relation[id#135L] parquet, Statistics(sizeInBytes=784.0 B, rowCount=100, 
hints=none)
{{code}}

Manually refreshing the table removes the stats
{{code}}
spark.sessionState.catalog.refreshTable(TableIdentifier("tab1"))
sql("explain cost select distinct * from tab1").show(false)
{{code}}

{{code}}
Relation[id#155L] parquet, Statistics(sizeInBytes=1568.0 B, hints=none)
{{code}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21968) Improved KernelDensity support

2017-09-10 Thread Brian (JIRA)
Brian created SPARK-21968:
-

 Summary: Improved KernelDensity support
 Key: SPARK-21968
 URL: https://issues.apache.org/jira/browse/SPARK-21968
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.2.0
Reporter: Brian


Related to SPARK-7753.  The KernelDensity API still does not provide a way to 
specify a kernel as described in the 7753 ticket, and requires the client to 
calculate their own optimal bandwidth.

Specifying a kernel could be something like:
def
setKernel(kernel: Function2[Double,Double]): KernelDensity.this.type

There could be something providing the user with a few options for kernels they 
could pass here so they don't need to implement each kernel themselves. Here 
are some example kernels:
https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use

functions could also be provided to get more optimal bandwidth settings without 
the user needing to calculate it themselves, e.g. the "rule of thumb" and/or 
"solve the equation" bandwidth described here:
https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2017-09-10 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160420#comment-16160420
 ] 

Felix Cheung commented on SPARK-20684:
--

I"m making this primary JIRA for tracking this issue and keeping this open.
Please see the discussion in the PR.


> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2017-09-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20684:
-
Summary: expose createOrReplaceGlobalTempView/createGlobalTempView and 
dropGlobalTempView in SparkR  (was: expose createGlobalTempView and 
dropGlobalTempView in SparkR)

> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-09-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reopened SPARK-20684:
--

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18591) Replace hash-based aggregates with sort-based ones if inputs already sorted

2017-09-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160388#comment-16160388
 ] 

Takeshi Yamamuro commented on SPARK-18591:
--

I just kindly give a head-up for the discussion on this thread; since we've 
already have LogicalPlanVisitor, we might easily realise bottom-up 
transformation in SparkStrategies like 
https://github.com/apache/spark/compare/master...maropu:SPARK-18591. I'm not 
sure this is the good timing now (cuz, probably, I think many committers and 
qualified developers spending much time on Dataset API v2 reviews and others) 
to change the transformation way though, I think it'd be better to modify this 
in future because the bottom-up transformation makes catalyst easily select 
better physical plans based on bottom sub-tree condition (costs and 
partition/sort conditions), e.g., we could easily fix  SPARK-12978 and this 
ticket. cc: [~smilegator]

> Replace hash-based aggregates with sort-based ones if inputs already sorted
> ---
>
> Key: SPARK-18591
> URL: https://issues.apache.org/jira/browse/SPARK-18591
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently uses sort-based aggregates only in limited condition; the 
> cases where spark cannot use partial aggregates and hash-based ones.
> However, if input ordering has already satisfied the requirements of 
> sort-based aggregates, it seems sort-based ones are faster than the other.
> {code}
> ./bin/spark-shell --conf spark.sql.shuffle.partitions=1
> val df = spark.range(1000).selectExpr("id AS key", "id % 10 AS 
> value").sort($"key").cache
> def timer[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   val t1 = System.nanoTime()
>   println("Elapsed time: " + ((t1 - t0 + 0.0) / 10.0)+ "s")
>   result
> }
> timer {
>   df.groupBy("key").count().count
> }
> // codegen'd hash aggregate
> Elapsed time: 7.116962977s
> // non-codegen'd sort aggregarte
> Elapsed time: 3.088816662s
> {code}
> If codegen'd sort-based aggregates are supported in SPARK-16844, this seems 
> to make the performance gap bigger;
> {code}
> - codegen'd sort aggregate
> Elapsed time: 1.645234684s
> {code} 
> Therefore, it'd be better to use sort-based ones in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21907:


Assignee: Apache Spark

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Assigned] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21907:


Assignee: (was: Apache Spark)

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160364#comment-16160364
 ] 

Apache Spark commented on SPARK-21907:
--

User 'eyalfa' has created a pull request for this issue:
https://github.com/apache/spark/pull/19181

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> 

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-10 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160366#comment-16160366
 ] 

Eyal Farago commented on SPARK-21907:
-

opened PR: https://github.com/apache/spark/pull/19181

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Assigned] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21967:


Assignee: (was: Apache Spark)

> org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at 
> a Time for Better Performance
> --
>
> Key: SPARK-21967
> URL: https://issues.apache.org/jira/browse/SPARK-21967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Armin Braun
>Priority: Minor
>  Labels: perfomance
>
> org.apache.spark.unsafe.types.UTF8String#compareTo contains the following 
> TODO:
> {code}
> int len = Math.min(numBytes, other.numBytes);
> // TODO: compare 8 bytes as unsigned long
> for (int i = 0; i < len; i ++) {
>   // In UTF-8, the byte should be unsigned, so we should compare them as 
> unsigned int.
> {code}
> The todo should be resolved by comparing the maximum number of 64bit words 
> possible in this method, before falling back to unsigned int comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160338#comment-16160338
 ] 

Apache Spark commented on SPARK-21967:
--

User 'original-brownbear' has created a pull request for this issue:
https://github.com/apache/spark/pull/19180

> org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at 
> a Time for Better Performance
> --
>
> Key: SPARK-21967
> URL: https://issues.apache.org/jira/browse/SPARK-21967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Armin Braun
>Priority: Minor
>  Labels: perfomance
>
> org.apache.spark.unsafe.types.UTF8String#compareTo contains the following 
> TODO:
> {code}
> int len = Math.min(numBytes, other.numBytes);
> // TODO: compare 8 bytes as unsigned long
> for (int i = 0; i < len; i ++) {
>   // In UTF-8, the byte should be unsigned, so we should compare them as 
> unsigned int.
> {code}
> The todo should be resolved by comparing the maximum number of 64bit words 
> possible in this method, before falling back to unsigned int comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21967:


Assignee: Apache Spark

> org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at 
> a Time for Better Performance
> --
>
> Key: SPARK-21967
> URL: https://issues.apache.org/jira/browse/SPARK-21967
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Armin Braun
>Assignee: Apache Spark
>Priority: Minor
>  Labels: perfomance
>
> org.apache.spark.unsafe.types.UTF8String#compareTo contains the following 
> TODO:
> {code}
> int len = Math.min(numBytes, other.numBytes);
> // TODO: compare 8 bytes as unsigned long
> for (int i = 0; i < len; i ++) {
>   // In UTF-8, the byte should be unsigned, so we should compare them as 
> unsigned int.
> {code}
> The todo should be resolved by comparing the maximum number of 64bit words 
> possible in this method, before falling back to unsigned int comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21967) org.apache.spark.unsafe.types.UTF8String#compareTo Should Compare 8 Bytes at a Time for Better Performance

2017-09-10 Thread Armin Braun (JIRA)
Armin Braun created SPARK-21967:
---

 Summary: org.apache.spark.unsafe.types.UTF8String#compareTo Should 
Compare 8 Bytes at a Time for Better Performance
 Key: SPARK-21967
 URL: https://issues.apache.org/jira/browse/SPARK-21967
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Armin Braun
Priority: Minor


org.apache.spark.unsafe.types.UTF8String#compareTo contains the following TODO:

{code}
int len = Math.min(numBytes, other.numBytes);
// TODO: compare 8 bytes as unsigned long
for (int i = 0; i < len; i ++) {
  // In UTF-8, the byte should be unsigned, so we should compare them as 
unsigned int.
{code}

The todo should be resolved by comparing the maximum number of 64bit words 
possible in this method, before falling back to unsigned int comparison.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21966:


Assignee: (was: Apache Spark)

> ResolveMissingReference rule should not ignore the Union operator
> -
>
> Key: SPARK-21966
> URL: https://issues.apache.org/jira/browse/SPARK-21966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>
> The below example will fail.
> {code:java}
> val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
> val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b")
> val df3 = df1.cube("a").sum("b")
> val df4 = df2.cube("a").sum("b")
> val df5 = df3.union(df4).filter("grouping_id()=0").show()
> {code}
> It will thow an Exception:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`spark_grouping_id`' given input columns: [a, sum(b)];;
> 'Filter ('spark_grouping_id > 0)
> +- Union
>:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) 
> AS sum(b)#14L]
>:  +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#17, spark_grouping_id#15]
>: +- Project [a#5, b#6, a#5 AS a#16]
>:+- Project [_1#0 AS a#5, _2#1 AS b#6]
>:   +- LocalRelation [_1#0, _2#1]
>+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) 
> AS sum(b)#27L]
>   +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#30, spark_grouping_id#28]
>  +- Project [a#5, b#6, a#5 AS a#29]
> +- Project [_1#0 AS a#5, _2#1 AS b#6]
>+- LocalRelation [_1#0, _2#1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77)
>   at 
> 

[jira] [Commented] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator

2017-09-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160279#comment-16160279
 ] 

Apache Spark commented on SPARK-21966:
--

User 'DonnyZone' has created a pull request for this issue:
https://github.com/apache/spark/pull/19178

> ResolveMissingReference rule should not ignore the Union operator
> -
>
> Key: SPARK-21966
> URL: https://issues.apache.org/jira/browse/SPARK-21966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>
> The below example will fail.
> {code:java}
> val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
> val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b")
> val df3 = df1.cube("a").sum("b")
> val df4 = df2.cube("a").sum("b")
> val df5 = df3.union(df4).filter("grouping_id()=0").show()
> {code}
> It will thow an Exception:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`spark_grouping_id`' given input columns: [a, sum(b)];;
> 'Filter ('spark_grouping_id > 0)
> +- Union
>:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) 
> AS sum(b)#14L]
>:  +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#17, spark_grouping_id#15]
>: +- Project [a#5, b#6, a#5 AS a#16]
>:+- Project [_1#0 AS a#5, _2#1 AS b#6]
>:   +- LocalRelation [_1#0, _2#1]
>+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) 
> AS sum(b)#27L]
>   +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#30, spark_grouping_id#28]
>  +- Project [a#5, b#6, a#5 AS a#29]
> +- Project [_1#0 AS a#5, _2#1 AS b#6]
>+- LocalRelation [_1#0, _2#1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77)
>   at 
> 

[jira] [Assigned] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator

2017-09-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21966:


Assignee: Apache Spark

> ResolveMissingReference rule should not ignore the Union operator
> -
>
> Key: SPARK-21966
> URL: https://issues.apache.org/jira/browse/SPARK-21966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>Assignee: Apache Spark
>
> The below example will fail.
> {code:java}
> val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
> val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b")
> val df3 = df1.cube("a").sum("b")
> val df4 = df2.cube("a").sum("b")
> val df5 = df3.union(df4).filter("grouping_id()=0").show()
> {code}
> It will thow an Exception:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`spark_grouping_id`' given input columns: [a, sum(b)];;
> 'Filter ('spark_grouping_id > 0)
> +- Union
>:- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) 
> AS sum(b)#14L]
>:  +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#17, spark_grouping_id#15]
>: +- Project [a#5, b#6, a#5 AS a#16]
>:+- Project [_1#0 AS a#5, _2#1 AS b#6]
>:   +- LocalRelation [_1#0, _2#1]
>+- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) 
> AS sum(b)#27L]
>   +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, 
> b#6, a#30, spark_grouping_id#28]
>  +- Project [a#5, b#6, a#5 AS a#29]
> +- Project [_1#0 AS a#5, _2#1 AS b#6]
>+- LocalRelation [_1#0, _2#1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77)
>   at 
> 

[jira] [Created] (SPARK-21966) ResolveMissingReference rule should not ignore the Union operator

2017-09-10 Thread Feng Zhu (JIRA)
Feng Zhu created SPARK-21966:


 Summary: ResolveMissingReference rule should not ignore the Union 
operator
 Key: SPARK-21966
 URL: https://issues.apache.org/jira/browse/SPARK-21966
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: Feng Zhu


The below example will fail.
{code:java}
val df1 = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
val df2 = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 3))).toDF("a", "b")
val df3 = df1.cube("a").sum("b")
val df4 = df2.cube("a").sum("b")
val df5 = df3.union(df4).filter("grouping_id()=0").show()
{code}

It will thow an Exception:

{code:java}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '`spark_grouping_id`' given input columns: [a, sum(b)];;
'Filter ('spark_grouping_id > 0)
+- Union
   :- Aggregate [a#17, spark_grouping_id#15], [a#17, sum(cast(b#6 as bigint)) 
AS sum(b)#14L]
   :  +- Expand [List(a#5, b#6, a#16, 0), List(a#5, b#6, null, 1)], [a#5, b#6, 
a#17, spark_grouping_id#15]
   : +- Project [a#5, b#6, a#5 AS a#16]
   :+- Project [_1#0 AS a#5, _2#1 AS b#6]
   :   +- LocalRelation [_1#0, _2#1]
   +- Aggregate [a#30, spark_grouping_id#28], [a#30, sum(cast(b#6 as bigint)) 
AS sum(b)#27L]
  +- Expand [List(a#5, b#6, a#29, 0), List(a#5, b#6, null, 1)], [a#5, b#6, 
a#30, spark_grouping_id#28]
 +- Project [a#5, b#6, a#5 AS a#29]
+- Project [_1#0 AS a#5, _2#1 AS b#6]
   +- LocalRelation [_1#0, _2#1]

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:72)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:77)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:77)
at 
org.apache.spark.sql.execution.QueryExecution.(QueryExecution.scala:79)
at 
org.apache.spark.sql.internal.SessionState.executePlan(SessionState.scala:169)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:58)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2827)
at 

[jira] [Closed] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-21965.
---
Resolution: Duplicate

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-09-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160276#comment-16160276
 ] 

Hyukjin Kwon commented on SPARK-20098:
--

[~jerryshao], I am sorry for asking this again but would you mind if i ask set 
the role for this user - 
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=szalai1 and assign 
this user to this JIRA?

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
> Fix For: 2.2.1, 2.3.0
>
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-09-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20098.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 17435
[https://github.com/apache/spark/pull/17435]

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
> Fix For: 2.2.1, 2.3.0
>
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org