[jira] [Commented] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713576#comment-16713576
 ] 

Apache Spark commented on SPARK-26312:
--

User 'eatoncys' has created a pull request for this issue:
https://github.com/apache/spark/pull/23262

> Converting converters in RDDConversions into arrays to improve their access 
> performance
> ---
>
> Key: SPARK-26312
> URL: https://issues.apache.org/jira/browse/SPARK-26312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> `RDDConversions` would get disproportionately slower as the number of columns 
> in the query increased.
> This PR converts the `converters` in `RDDConversions` into arrays to improve 
> their access performance, the type of `converters` before is 
> `scala.collection.immutable.::` which is a subtype of list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713577#comment-16713577
 ] 

Apache Spark commented on SPARK-26312:
--

User 'eatoncys' has created a pull request for this issue:
https://github.com/apache/spark/pull/23262

> Converting converters in RDDConversions into arrays to improve their access 
> performance
> ---
>
> Key: SPARK-26312
> URL: https://issues.apache.org/jira/browse/SPARK-26312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> `RDDConversions` would get disproportionately slower as the number of columns 
> in the query increased.
> This PR converts the `converters` in `RDDConversions` into arrays to improve 
> their access performance, the type of `converters` before is 
> `scala.collection.immutable.::` which is a subtype of list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26312:


Assignee: (was: Apache Spark)

> Converting converters in RDDConversions into arrays to improve their access 
> performance
> ---
>
> Key: SPARK-26312
> URL: https://issues.apache.org/jira/browse/SPARK-26312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Priority: Major
>
> `RDDConversions` would get disproportionately slower as the number of columns 
> in the query increased.
> This PR converts the `converters` in `RDDConversions` into arrays to improve 
> their access performance, the type of `converters` before is 
> `scala.collection.immutable.::` which is a subtype of list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26312:


Assignee: Apache Spark

> Converting converters in RDDConversions into arrays to improve their access 
> performance
> ---
>
> Key: SPARK-26312
> URL: https://issues.apache.org/jira/browse/SPARK-26312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Major
>
> `RDDConversions` would get disproportionately slower as the number of columns 
> in the query increased.
> This PR converts the `converters` in `RDDConversions` into arrays to improve 
> their access performance, the type of `converters` before is 
> `scala.collection.immutable.::` which is a subtype of list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26312) Converting converters in RDDConversions into arrays to improve their access performance

2018-12-07 Thread eaton (JIRA)
eaton created SPARK-26312:
-

 Summary: Converting converters in RDDConversions into arrays to 
improve their access performance
 Key: SPARK-26312
 URL: https://issues.apache.org/jira/browse/SPARK-26312
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: eaton


`RDDConversions` would get disproportionately slower as the number of columns 
in the query increased.
This PR converts the `converters` in `RDDConversions` into arrays to improve 
their access performance, the type of `converters` before is 
`scala.collection.immutable.::` which is a subtype of list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713565#comment-16713565
 ] 

Apache Spark commented on SPARK-26311:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/23260

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713568#comment-16713568
 ] 

Apache Spark commented on SPARK-23674:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23261

> Add Spark ML Listener for Tracking ML Pipeline Status
> -
>
> Key: SPARK-23674
> URL: https://issues.apache.org/jira/browse/SPARK-23674
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Priority: Major
>
> Currently, Spark provides status monitoring for different components of 
> Spark, like spark history server, streaming listener, sql listener and etc. 
> The use case would be (1) front UI to track the status of training coverage 
> rate during iteration, then DS can understand how the job converge when 
> training, like K-means, Logistic and other linear regression model.  (2) 
> tracking the data lineage for the input and output of training data.  
> In this proposal, we hope to provide Spark ML pipeline listener to track the 
> status of Spark ML pipeline status includes: 
>  # ML pipeline create and saved 
>  # ML pipeline model created, saved and load  
>  # ML model training status monitoring  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23674) Add Spark ML Listener for Tracking ML Pipeline Status

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713567#comment-16713567
 ] 

Apache Spark commented on SPARK-23674:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23261

> Add Spark ML Listener for Tracking ML Pipeline Status
> -
>
> Key: SPARK-23674
> URL: https://issues.apache.org/jira/browse/SPARK-23674
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Mingjie Tang
>Priority: Major
>
> Currently, Spark provides status monitoring for different components of 
> Spark, like spark history server, streaming listener, sql listener and etc. 
> The use case would be (1) front UI to track the status of training coverage 
> rate during iteration, then DS can understand how the job converge when 
> training, like K-means, Logistic and other linear regression model.  (2) 
> tracking the data lineage for the input and output of training data.  
> In this proposal, we hope to provide Spark ML pipeline listener to track the 
> status of Spark ML pipeline status includes: 
>  # ML pipeline create and saved 
>  # ML pipeline model created, saved and load  
>  # ML model training status monitoring  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26311:


Assignee: (was: Apache Spark)

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26311:


Assignee: Apache Spark

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713564#comment-16713564
 ] 

Apache Spark commented on SPARK-26311:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/23260

> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-07 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-26311:


 Summary: [YARN] New feature: custom log URL for stdout/stderr
 Key: SPARK-26311
 URL: https://issues.apache.org/jira/browse/SPARK-26311
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.4.0
Reporter: Jungtaek Lim


Spark has been setting static log URLs for YARN application, which points to 
NodeManager webapp. Normally it would work for both running apps and finished 
apps, but there're also other approaches on maintaining application logs, like 
having external log service which enables to avoid application log url to be a 
deadlink when NodeManager is not accessible. (Node decommissioned, elastic 
nodes, etc.)

Spark can provide a new configuration for custom log url on YARN mode, which 
end users can set it properly to point application log to external log service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.

2018-12-07 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26224:
-
Component/s: (was: Spark Core)
 SQL

> Results in stackOverFlowError when trying to add 3000 new columns using 
> withColumn function of dataframe.
> -
>
> Key: SPARK-26224
> URL: https://issues.apache.org/jira/browse/SPARK-26224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: On macbook, used Intellij editor. Ran the above sample 
> code as unit test.
>Reporter: Dorjee Tsering
>Priority: Minor
>
> Reproduction step:
> Run this sample code on your laptop. I am trying to add 3000 new columns to a 
> base dataframe with 1 column.
>  
>  
> {code:java}
> import spark.implicits._
> val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
> StructField("field_" + i, DataTypes.LongType)
> val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")
> val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
> df.withColumn(newColumn.name, lit(0)))
> result.show(false)
>  
> {code}
> Ends up with following stacktrace:
> java.lang.StackOverflowError
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>  at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>  at scala.collection.immutable.List.map(List.scala:296)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713540#comment-16713540
 ] 

Apache Spark commented on SPARK-26215:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23259

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26215) define reserved keywords after SQL standard

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26215:


Assignee: Apache Spark

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713538#comment-16713538
 ] 

Apache Spark commented on SPARK-26215:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/23259

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26215) define reserved keywords after SQL standard

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26215:


Assignee: (was: Apache Spark)

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713516#comment-16713516
 ] 

Apache Spark commented on SPARK-23375:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23258

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713514#comment-16713514
 ] 

Apache Spark commented on SPARK-23375:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23258

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.

2018-12-07 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713494#comment-16713494
 ] 

Liang-Chi Hsieh commented on SPARK-26224:
-

I think it is not specified to withColumn. withColumn simply adds a projection 
on original dataframe.

I think It is because you create a very deep query plan. So the analyzer or 
optimizer encounters when traversing down the query plan.

Even it can traverse down such deep query plan, it might be not efficient to do 
that. I'd recommend not to create such deep query plan.

This should not be a bug.

> Results in stackOverFlowError when trying to add 3000 new columns using 
> withColumn function of dataframe.
> -
>
> Key: SPARK-26224
> URL: https://issues.apache.org/jira/browse/SPARK-26224
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: On macbook, used Intellij editor. Ran the above sample 
> code as unit test.
>Reporter: Dorjee Tsering
>Priority: Minor
>
> Reproduction step:
> Run this sample code on your laptop. I am trying to add 3000 new columns to a 
> base dataframe with 1 column.
>  
>  
> {code:java}
> import spark.implicits._
> val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
> StructField("field_" + i, DataTypes.LongType)
> val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")
> val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
> df.withColumn(newColumn.name, lit(0)))
> result.show(false)
>  
> {code}
> Ends up with following stacktrace:
> java.lang.StackOverflowError
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>  at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>  at scala.collection.immutable.List.map(List.scala:296)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-12-07 Thread Stanley Poon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Poon resolved SPARK-23734.
--
   Resolution: Fixed
Fix Version/s: 2.3.1

> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5 (Feb 22 2018)
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
> Fix For: 2.3.1
>
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala)
>  



--
This message was sent by 

[jira] [Commented] (SPARK-23734) InvalidSchemaException While Saving ALSModel

2018-12-07 Thread Stanley Poon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713403#comment-16713403
 ] 

Stanley Poon commented on SPARK-23734:
--

Just confirmed the problem is fixed in Spark 2.3.1. The test environment uses 
Scala 2.11.11. And there are no other dependency. I will close the case.

> InvalidSchemaException While Saving ALSModel
> 
>
> Key: SPARK-23734
> URL: https://issues.apache.org/jira/browse/SPARK-23734
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
> Environment: macOS 10.13.2
> Scala 2.11.8
> Spark 2.3.0  v2.3.0-rc5 (Feb 22 2018)
>Reporter: Stanley Poon
>Priority: Major
>  Labels: ALS, parquet, persistence
>
> After fitting an ALSModel, get following error while saving the model:
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> Exactly the same code ran ok on 2.2.1.
> Same issue also occurs on other ALSModels we have.
> h2. *To reproduce*
> Get ALSExample: 
> [https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala]
>  and add the following line to save the model right before "spark.stop".
> {quote}   model.write.overwrite().save("SparkExampleALSModel") 
> {quote}
> h2. Stack Trace
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:444)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:444)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> at 
> org.apache.spark.ml.recommendation.ALSModel$ALSModelWriter.saveImpl(ALS.scala:510)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
> at com.vitalmove.model.ALSExample$.main(ALSExample.scala:83)
> at com.vitalmove.model.ALSExample.main(ALSExample.scala)
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
> not be empty. Parquet does not support empty group without leaves. Empty 
> group: spark_schema
> at org.apache.parquet.schema.GroupType.(GroupType.java:92)
> at org.apache.parquet.schema.GroupType.(GroupType.java:48)
> at org.apache.parquet.schema.MessageType.(MessageType.java:50)
> at org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.(ParquetSchemaConverter.scala:567)
> at 
> 

[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713388#comment-16713388
 ] 

Dongjoon Hyun commented on SPARK-26282:
---

Great, thanks again for this and email notifications.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19526.

Resolution: Cannot Reproduce

> Spark should raise an exception when it tries to read a Hive view but it 
> doesn't have read access on the corresponding table(s)
> ---
>
> Key: SPARK-19526
> URL: https://issues.apache.org/jira/browse/SPARK-19526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0
>Reporter: Reza Safi
>Priority: Major
>
> Spark sees a Hive views as a set of hdfs "files". So to read anything from a 
> Hive view, Spark needs access to all of the files that belongs to the 
> table(s) that the view queries them.  In other words a Spark user cannot be 
> granted fine grained permissions at the levels of Hive columns or records.
> Consider that there is a Spark job that contains a SQL query that tries to 
> read a Hive view. Currently the Spark job will finish successfully if the 
> user that runs the Spark job doesn't have proper read access permissions to 
> the tables that the Hive view has been built on top of them. It will just 
> return an empty result set. This can be confusing for the users, since the 
> job will be finishes without any exception or error. 
> Spark should raise an exception like  AccessDenied when it tries to run a 
> Hive view query and its user doesn't have proper permissions to the tables 
> that the Hive view is created on top of them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)

2018-12-07 Thread Reza Safi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713368#comment-16713368
 ] 

Reza Safi commented on SPARK-19526:
---

It seems that this can be resolved since we can't reproduce the issue. Spark 
will give an error message if the user doesn't have proper access to the 
underlying table of a view. It won't just return null results. Thank you 
[~attilapiros] and [~vanzin] for verifying this.

> Spark should raise an exception when it tries to read a Hive view but it 
> doesn't have read access on the corresponding table(s)
> ---
>
> Key: SPARK-19526
> URL: https://issues.apache.org/jira/browse/SPARK-19526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0
>Reporter: Reza Safi
>Priority: Major
>
> Spark sees a Hive views as a set of hdfs "files". So to read anything from a 
> Hive view, Spark needs access to all of the files that belongs to the 
> table(s) that the view queries them.  In other words a Spark user cannot be 
> granted fine grained permissions at the levels of Hive columns or records.
> Consider that there is a Spark job that contains a SQL query that tries to 
> read a Hive view. Currently the Spark job will finish successfully if the 
> user that runs the Spark job doesn't have proper read access permissions to 
> the tables that the Hive view has been built on top of them. It will just 
> return an empty result set. This can be confusing for the users, since the 
> job will be finishes without any exception or error. 
> Spark should raise an exception like  AccessDenied when it tries to run a 
> Hive view query and its user doesn't have proper permissions to the tables 
> that the Hive view is created on top of them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713353#comment-16713353
 ] 

shane knapp commented on SPARK-26282:
-

test build passed!

 

[https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-maven-hadoop-2.7-java-8.191/1/]

 

deploying this now.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-26282.
-
Resolution: Fixed

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713359#comment-16713359
 ] 

shane knapp commented on SPARK-26282:
-

done.  about to email dev@ for a heads-up.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-12-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24333.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21465
[https://github.com/apache/spark/pull/21465]

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26304:
--

Assignee: Gabor Somogyi

> Add default value to spark.kafka.sasl.kerberos.service.name parameter
> -
>
> Key: SPARK-26304
> URL: https://issues.apache.org/jira/browse/SPARK-26304
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> The reasoning behind:
> * Kafka's configuration guide suggest the same value: 
> https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
> * It would be easier for spark users by providing less configuration
> * Other streaming engines are doing the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26304.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23254
[https://github.com/apache/spark/pull/23254]

> Add default value to spark.kafka.sasl.kerberos.service.name parameter
> -
>
> Key: SPARK-26304
> URL: https://issues.apache.org/jira/browse/SPARK-26304
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> The reasoning behind:
> * Kafka's configuration guide suggest the same value: 
> https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
> * It would be easier for spark users by providing less configuration
> * Other streaming engines are doing the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-12-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24333:


Assignee: Huaxin Gao

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Huaxin Gao
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26310) Verification of JSON options

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26310:


Assignee: (was: Apache Spark)

> Verification of JSON options
> 
>
> Key: SPARK-26310
> URL: https://issues.apache.org/jira/browse/SPARK-26310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For JSON options used only in write, the following exception should be raised 
> if those options are used in read. The same exception should be raised in the 
> opposite case when read option is used in write:
> {code}
> java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is 
> not applicable in write.
> {code}
> The verification can be disabled via the SQL config: 
> {code}
> spark.sql.verifyDataSourceOptions
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26310) Verification of JSON options

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713285#comment-16713285
 ] 

Apache Spark commented on SPARK-26310:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23257

> Verification of JSON options
> 
>
> Key: SPARK-26310
> URL: https://issues.apache.org/jira/browse/SPARK-26310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For JSON options used only in write, the following exception should be raised 
> if those options are used in read. The same exception should be raised in the 
> opposite case when read option is used in write:
> {code}
> java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is 
> not applicable in write.
> {code}
> The verification can be disabled via the SQL config: 
> {code}
> spark.sql.verifyDataSourceOptions
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26310) Verification of JSON options

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26310:


Assignee: Apache Spark

> Verification of JSON options
> 
>
> Key: SPARK-26310
> URL: https://issues.apache.org/jira/browse/SPARK-26310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> For JSON options used only in write, the following exception should be raised 
> if those options are used in read. The same exception should be raised in the 
> opposite case when read option is used in write:
> {code}
> java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is 
> not applicable in write.
> {code}
> The verification can be disabled via the SQL config: 
> {code}
> spark.sql.verifyDataSourceOptions
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26310) Verification of JSON options

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713283#comment-16713283
 ] 

Apache Spark commented on SPARK-26310:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23257

> Verification of JSON options
> 
>
> Key: SPARK-26310
> URL: https://issues.apache.org/jira/browse/SPARK-26310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> For JSON options used only in write, the following exception should be raised 
> if those options are used in read. The same exception should be raised in the 
> opposite case when read option is used in write:
> {code}
> java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is 
> not applicable in write.
> {code}
> The verification can be disabled via the SQL config: 
> {code}
> spark.sql.verifyDataSourceOptions
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25696) The storage memory displayed on spark Application UI is incorrect.

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25696:
-

   Docs Text: In Spark 3.0, the web UI and log statements now 
consistently report units in KiB, MiB, etc, (i.e. multiples of 1024) rather 
than KB and MB (i.e. multiples of 1000). For example, 1024000 bytes is now 
displayed as 1000 KiB rather than 1024 KB.
Assignee: hantiantian
Target Version/s: 3.0.0
  Labels: release-notes  (was: )
 Component/s: Web UI
  Issue Type: Improvement  (was: Bug)

(I'm marking this much more of an improvement than fix, as I believe the 
displays were correct, but just in inconsistent units. There were a few log 
statements that were incorrect, but nothing functional, it appears.)

> The storage memory displayed on spark Application UI is incorrect.
> --
>
> Key: SPARK-25696
> URL: https://issues.apache.org/jira/browse/SPARK-25696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
>  Labels: release-notes
>
> In the reported heartbeat information, the unit of the memory data is bytes, 
> which is converted by the formatBytes() function in the utils.js file before 
> being displayed in the interface. The cardinality of the unit conversion in 
> the formatBytes function is 1000, which should be 1024.
> function formatBytes(bytes, type)
> {    if (type !== 'display') return bytes;    if (bytes == 0) return '0.0 B'; 
>    var k = 1000;    var dm = 1;    var sizes = ['B', 'KB', 'MB', 'GB', 'TB', 
> 'PB', 'EB', 'ZB', 'YB'];    var i = Math.floor(Math.log(bytes) / 
> Math.log(k)); return parseFloat((bytes / Math.pow(k, i)).toFixed(dm)) + ' ' + 
> sizes[i]; }
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26310) Verification of JSON options

2018-12-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26310:
--

 Summary: Verification of JSON options
 Key: SPARK-26310
 URL: https://issues.apache.org/jira/browse/SPARK-26310
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


For JSON options used only in write, the following exception should be raised 
if those options are used in read. The same exception should be raised in the 
opposite case when read option is used in write:
{code}
java.lang.IllegalArgumentException: The JSON option "dropFieldIfAllNull" is not 
applicable in write.
{code}

The verification can be disabled via the SQL config: 
{code}
spark.sql.verifyDataSourceOptions
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26196) Total tasks message in the stage is incorrect, when there are failed or killed tasks

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26196.
---
   Resolution: Fixed
 Assignee: shahid
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/23160

> Total tasks message in the stage is incorrect, when there are failed or 
> killed tasks
> 
>
> Key: SPARK-26196
> URL: https://issues.apache.org/jira/browse/SPARK-26196
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> Total tasks message in the stage page is incorrect when there are failed or 
> killed tasks.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26281.
---
   Resolution: Fixed
 Assignee: shahid
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/23160

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26309) Verification of Data source options

2018-12-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26309:
--

 Summary: Verification of Data source options
 Key: SPARK-26309
 URL: https://issues.apache.org/jira/browse/SPARK-26309
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, applicability of datasource options passed to DataFrameReader and 
DataFrameWriter are not checked fully. For example, If an option is used only 
in write, it will be silently ignored in read. Such behavior of built-in 
datasource usually confuses users. The ticket aims to implement additional 
verification of datasource option and detect option misusing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26281:
--
  Priority: Minor  (was: Major)
Issue Type: Bug  (was: Improvement)

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25299:
--

Assignee: (was: Marcelo Vanzin)

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25299:
--

Assignee: Marcelo Vanzin

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Assignee: Marcelo Vanzin
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26294) Delete Unnecessary If statement

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26294.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23247
[https://github.com/apache/spark/pull/23247]

> Delete Unnecessary If statement
> ---
>
> Key: SPARK-26294
> URL: https://issues.apache.org/jira/browse/SPARK-26294
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Assignee: wangjiaochun
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Delete unnecessary If statement, because it Impossible execution when 
> records less than or equal to zero.it is only execution when records begin 
> zero.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26294) Delete Unnecessary If statement

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26294:
-

Assignee: wangjiaochun

> Delete Unnecessary If statement
> ---
>
> Key: SPARK-26294
> URL: https://issues.apache.org/jira/browse/SPARK-26294
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Assignee: wangjiaochun
>Priority: Trivial
>
> Delete unnecessary If statement, because it Impossible execution when 
> records less than or equal to zero.it is only execution when records begin 
> zero.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24207) PrefixSpan: R API

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713242#comment-16713242
 ] 

Apache Spark commented on SPARK-24207:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23256

> PrefixSpan: R API
> -
>
> Key: SPARK-24207
> URL: https://issues.apache.org/jira/browse/SPARK-24207
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2018-12-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713231#comment-16713231
 ] 

Gabor Somogyi commented on SPARK-26306:
---

Tested it on my local machine in a loop and never appeared.

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In PR builder the following issue appeared:
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at 

[jira] [Created] (SPARK-26308) Large BigDecimal value is converted to null when passed into a UDF

2018-12-07 Thread Jay Pranavamurthi (JIRA)
Jay Pranavamurthi created SPARK-26308:
-

 Summary: Large BigDecimal value is converted to null when passed 
into a UDF
 Key: SPARK-26308
 URL: https://issues.apache.org/jira/browse/SPARK-26308
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jay Pranavamurthi


We are loading a Hive table into a Spark DataFrame. The Hive table has a 
decimal(30, 0) column with values greater than Long.MAX_VALUE. The DataFrame 
loads correctly.

We then use a UDF to convert the decimal type to a String value. For decimal 
values < Long.MAX_VALUE, this works fine, but when the decimal value > 
Long.MAX_VALUE, the input to the UDF is a *null*.

Hive table schema and data:
{code:java}
create table decimal_test (col1 decimal(30, 0), col2 decimal(10, 0), col3 int, 
col4 string);
insert into decimal_test values(20110002456556, 123456789, 10, 'test1');
{code}
 

Execution in spark-shell:

_(Note that the first column in the final output is null, it should have been 
"20110002456556")_
{code:java}
scala> val df1 = spark.sqlContext.sql("select * from decimal_test")
df1: org.apache.spark.sql.DataFrame = [col1: decimal(30,0), col2: decimal(10,0) 
... 2 more fields]

scala> df1.show
++-++-+
| col1| col2|col3| col4|
++-++-+
|201100024...|123456789| 10|test1|
++-++-+


scala> val decimalToString = (value: java.math.BigDecimal) => if (value == 
null) null else { value.toBigInteger().toString }
decimalToString: java.math.BigDecimal => String = 

scala> val udf1 = org.apache.spark.sql.functions.udf(decimalToString)
udf1: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,StringType,Some(List(DecimalType(38,18

scala> val df2 = df1.withColumn("col1", udf1(df1.col("col1")))
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: decimal(10,0) ... 2 
more fields]

scala> df2.show
++-++-+
|col1| col2|col3| col4|
++-++-+
|null|123456789| 10|test1|
++-++-+
{code}
Oddly this works if we change the "decimalToString" udf to take an "Any" 
instead of a "java.math.BigDecimal"
{code:java}
scala> val decimalToString = (value: Any) => if (value == null) null else { if 
(value.isInstanceOf[java.math.BigDecimal]) 
value.asInstanceOf[java.math.BigDecimal].toBigInteger().toString else null }
decimalToString: Any => String = 

scala> val udf1 = org.apache.spark.sql.functions.udf(decimalToString)
udf1: org.apache.spark.sql.expressions.UserDefinedFunction = 
UserDefinedFunction(,StringType,None)

scala> val df2 = df1.withColumn("col1", udf1(df1.col("col1")))
df2: org.apache.spark.sql.DataFrame = [col1: string, col2: decimal(10,0) ... 2 
more fields]

scala> df2.show
++-++-+
| col1| col2|col3| col4|
++-++-+
|201100024...|123456789| 10|test1|
++-++-+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24243:
--

Assignee: Sahil Takiar

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: 3.0.0
>
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24243) Expose exceptions from InProcessAppHandle

2018-12-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24243.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23221
[https://github.com/apache/spark/pull/23221]

> Expose exceptions from InProcessAppHandle
> -
>
> Key: SPARK-24243
> URL: https://issues.apache.org/jira/browse/SPARK-24243
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: 3.0.0
>
>
> {{InProcessAppHandle}} runs {{SparkSubmit}} in a dedicated thread, any 
> exceptions thrown are logged and then the state is set to {{FAILED}}. It 
> would be nice to expose the {{Throwable}} object  to the application rather 
> than logging it and dropping it. Applications may want to manipulate the 
> underlying {{Throwable}} / control its logging at a finer granularity. For 
> example, the app might want to call 
> {{Throwables.getRootCause(throwable).getMessage()}} and expose the message to 
> the app users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713159#comment-16713159
 ] 

Dongjoon Hyun commented on SPARK-26282:
---

Thank you for sharing, [~shaneknapp]!

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713142#comment-16713142
 ] 

shane knapp commented on SPARK-26282:
-

btw, all of the compile and lint jobs have been running on java 8 191 for the 
past couple of days, and are happy and green:

[https://amplab.cs.berkeley.edu/jenkins/label/ubuntu/]

 

 

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713140#comment-16713140
 ] 

Apache Spark commented on SPARK-26307:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/23255

> Fix CTAS when INSERT a partitioned table using Hive serde
> -
>
> Key: SPARK-26307
> URL: https://issues.apache.org/jira/browse/SPARK-26307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {code:java}
> withTable("hive_test") {
>   withSQLConf(
>   "hive.exec.dynamic.partition.mode" -> "nonstrict") {
> val df = Seq(("a", 100)).toDF("part", "id")
> df.write.format("hive").partitionBy("part")
>   .mode("overwrite").saveAsTable("hive_test")
> df.write.format("hive").partitionBy("part")
>   .mode("append").saveAsTable("hive_test")
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26307:


Assignee: Apache Spark  (was: Xiao Li)

> Fix CTAS when INSERT a partitioned table using Hive serde
> -
>
> Key: SPARK-26307
> URL: https://issues.apache.org/jira/browse/SPARK-26307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> withTable("hive_test") {
>   withSQLConf(
>   "hive.exec.dynamic.partition.mode" -> "nonstrict") {
> val df = Seq(("a", 100)).toDF("part", "id")
> df.write.format("hive").partitionBy("part")
>   .mode("overwrite").saveAsTable("hive_test")
> df.write.format("hive").partitionBy("part")
>   .mode("append").saveAsTable("hive_test")
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26307:


Assignee: Xiao Li  (was: Apache Spark)

> Fix CTAS when INSERT a partitioned table using Hive serde
> -
>
> Key: SPARK-26307
> URL: https://issues.apache.org/jira/browse/SPARK-26307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {code:java}
> withTable("hive_test") {
>   withSQLConf(
>   "hive.exec.dynamic.partition.mode" -> "nonstrict") {
> val df = Seq(("a", 100)).toDF("part", "id")
> df.write.format("hive").partitionBy("part")
>   .mode("overwrite").saveAsTable("hive_test")
> df.write.format("hive").partitionBy("part")
>   .mode("append").saveAsTable("hive_test")
>   }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26267) Kafka source may reprocess data

2018-12-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26267:
-
Priority: Blocker  (was: Major)

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26267) Kafka source may reprocess data

2018-12-07 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26267:
-
Labels: correctness  (was: )

> Kafka source may reprocess data
> ---
>
> Key: SPARK-26267
> URL: https://issues.apache.org/jira/browse/SPARK-26267
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>  Labels: correctness
>
> Due to KAFKA-7703, when the Kafka source tries to get the latest offset, it 
> may get an earliest offset, and then it will reprocess messages that have 
> been processed when it gets the correct latest offset in the next batch.
> This usually happens when restarting a streaming query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713134#comment-16713134
 ] 

shane knapp commented on SPARK-26282:
-

test build now running:

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.4-test-maven-hadoop-2.7-java-8.191/1

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26307) Fix CTAS when INSERT a partitioned table using Hive serde

2018-12-07 Thread Xiao Li (JIRA)
Xiao Li created SPARK-26307:
---

 Summary: Fix CTAS when INSERT a partitioned table using Hive serde
 Key: SPARK-26307
 URL: https://issues.apache.org/jira/browse/SPARK-26307
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Xiao Li
Assignee: Xiao Li


{code:java}
withTable("hive_test") {
  withSQLConf(
  "hive.exec.dynamic.partition.mode" -> "nonstrict") {
val df = Seq(("a", 100)).toDF("part", "id")
df.write.format("hive").partitionBy("part")
  .mode("overwrite").saveAsTable("hive_test")
df.write.format("hive").partitionBy("part")
  .mode("append").saveAsTable("hive_test")
  }
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713118#comment-16713118
 ] 

shane knapp commented on SPARK-26282:
-

i'm waiting on someone from databricks to merge.  i pinged the PR and it should 
hopefully happen today.

after the new years, i am planning on moving these configs to the spark repo.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713093#comment-16713093
 ] 

Dongjoon Hyun commented on SPARK-26282:
---

Hi, [~shaneknapp]. Is there any update on your PR to update the jenkins job?

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-07 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-26283:
---
Priority: Major  (was: Minor)

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-12-07 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16713030#comment-16713030
 ] 

Mihaly Toth commented on SPARK-25331:
-

I have closed my PR. I guess it should be documented that we expect the user to 
read only files that have their name written to manifest files.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2018-12-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712991#comment-16712991
 ] 

Dongjoon Hyun edited comment on SPARK-26305 at 12/7/18 3:43 PM:


+1 for the issue. I'll take a look when the design doc is given.


was (Author: dongjoon):
+1 for the idea.

> Breakthrough the memory limitation of broadcast join
> 
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 1 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is data skewing, the sortmerge 
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the 
> skew key and normal key separately. I guess Databricks Runtime has the 
> similar implementation. However, the skew join hint needs SQL users know the 
> data in table like their children. They must know which key is skewing in a 
> join. It's very hard to know since the data is changing day by day and the 
> join key isn't fixed in different queries. The users have to set a huge 
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the 
> limitation, if the broadcasted table no needs to fill to memory, in other 
> words, driver/executor stores the broadcasted table to disk only. The problem 
> mentioned above could be resolved.
> A new hint like BROADCAST_DISK or an additional parameter in original 
> BROADCAST hint will be introduced to cover this case. The original broadcast 
> behavior won’t be changed.
> I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2018-12-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712991#comment-16712991
 ] 

Dongjoon Hyun commented on SPARK-26305:
---

+1 for the idea.

> Breakthrough the memory limitation of broadcast join
> 
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 1 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is data skewing, the sortmerge 
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the 
> skew key and normal key separately. I guess Databricks Runtime has the 
> similar implementation. However, the skew join hint needs SQL users know the 
> data in table like their children. They must know which key is skewing in a 
> join. It's very hard to know since the data is changing day by day and the 
> join key isn't fixed in different queries. The users have to set a huge 
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the 
> limitation, if the broadcasted table no needs to fill to memory, in other 
> words, driver/executor stores the broadcasted table to disk only. The problem 
> mentioned above could be resolved.
> A new hint like BROADCAST_DISK or an additional parameter in original 
> BROADCAST hint will be introduced to cover this case. The original broadcast 
> behavior won’t be changed.
> I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2018-12-07 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-26305:
---
Description: 
If the join between a big table and a small one faces data skewing issue, we 
usually use a broadcast hint in SQL to resolve it. However, current broadcast 
join has many limitations. The primary restriction is memory. The small table 
which is broadcasted must be fulfilled to memory in driver/executors side. 
Although it will spill to disk when the memory is insufficient, it still causes 
OOM if the small table actually is not absolutely small, it's relatively small. 
In our company, we have many real big data SQL analysis jobs which handle 
dozens of hundreds terabytes join and shuffle. For example, the size of large 
table is 100TB, and the small one is 1 times less, still 10GB. In this 
case, broadcast join couldn't be finished since the small one is still larger 
than expected. If the join is data skewing, the sortmerge join always failed.

Hive has a skew join hint which could trigger two-stage task to handle the skew 
key and normal key separately. I guess Databricks Runtime has the similar 
implementation. However, the skew join hint needs SQL users know the data in 
table like their children. They must know which key is skewing in a join. It's 
very hard to know since the data is changing day by day and the join key isn't 
fixed in different queries. The users have to set a huge partition number to 
try their luck.

So, do we have a simple, rude and efficient way to resolve it? Back to the 
limitation, if the broadcasted table no needs to fill to memory, in other 
words, driver/executor stores the broadcasted table to disk only. The problem 
mentioned above could be resolved.

A new hint like BROADCAST_DISK or an additional parameter in original BROADCAST 
hint will be introduced to cover this case. The original broadcast behavior 
won’t be changed.

I will offer a design doc if you have same feeling about it.

  was:
If the join between a big table and a small one faces data skewing issue, we 
usually use a broadcast hint in SQL to resolve it. However, current broadcast 
join has many limitations. The primary restriction is memory. The small table 
which is broadcasted must be fulfilled to memory in driver/executors side. 
Although it will spill to disk when the memory is insufficient, it still causes 
OOM if the small table actually is not absolutely small, it's relatively small. 
In our company, we have many real big data SQL analysis jobs which handle 
dozens of hundreds terabytes join and shuffle. For example, the size of large 
table is 100TB, and the small one is 1 times less, still 10GB. In this 
case, broadcast join couldn't be finished since the small one is still larger 
than expected. If the join is data skewing, the sortmerge join always failed.

Hive has a skew join hint which could trigger two-stage task to handle the skew 
key and normal key separately. I guess Databricks Runtime has the similar 
implementation. However, the skew join hint needs SQL users know the data in 
table like their children. They must know which key is skewing in a join. It's 
very hard to know since the data is changing day by day and the join key isn't 
fixed in different queries. The users have to set a huge partition number to 
try their luck.

So, do we have a simple, rude and efficient way to resolve it? Back to the 
limitation, if the broadcasted table no needs to fill to memory, in other 
words, driver/executor stores the broadcasted table to disk only. The problem 
mentioned above could be resolved.

I will offer a design doc if you have same feeling about it.


> Breakthrough the memory limitation of broadcast join
> 
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 1 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is 

[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2018-12-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712975#comment-16712975
 ] 

Gabor Somogyi commented on SPARK-26306:
---

No idea, I've seen it only in PR builder and thought file it to help others.

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In PR builder the following issue appeared:
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at 

[jira] [Commented] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2018-12-07 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712970#comment-16712970
 ] 

Liang-Chi Hsieh commented on SPARK-26306:
-

Besides above build, is there any build that this test fails too?

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In PR builder the following issue appeared:
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at 

[jira] [Updated] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2018-12-07 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26306:
--
Component/s: (was: Spark Core)
 Tests

> Flaky test: org.apache.spark.util.collection.SorterSuite
> 
>
> Key: SPARK-26306
> URL: https://issues.apache.org/jira/browse/SPARK-26306
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In PR builder the following issue appeared:
> {code:java}
> [info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 
> seconds, 225 milliseconds)
> [info]   java.lang.OutOfMemoryError: Java heap space
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
> [info]   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
> [info]   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
> [info]   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info]   at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
> [info]   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> [info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> [info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> [info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
> Source)
> [info]   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
> Source)
> [info]   at scala.collection.immutable.List.foreach(List.scala:388)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> [info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> [info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> [info]   at org.scalatest.Suite.run(Suite.scala:1147)
> [info]   at org.scalatest.Suite.run$(Suite.scala:1129)
> [error] Uncaught exception when running 
> org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: 
> Java heap space
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
>   at 
> org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
>   at 
> org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
>   at 
> org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
> Source)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> 

[jira] [Created] (SPARK-26306) Flaky test: org.apache.spark.util.collection.SorterSuite

2018-12-07 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-26306:
-

 Summary: Flaky test: org.apache.spark.util.collection.SorterSuite
 Key: SPARK-26306
 URL: https://issues.apache.org/jira/browse/SPARK-26306
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


In PR builder the following issue appeared:

{code:java}
[info] org.apache.spark.util.collection.SorterSuite *** ABORTED *** (3 seconds, 
225 milliseconds)
[info]   java.lang.OutOfMemoryError: Java heap space
[info]   at 
org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
[info]   at 
org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
[info]   at 
org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
[info]   at 
org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
 Source)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
[info]   at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at 
org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
Source)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
[info]   at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
Source)
[info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
[info]   at org.scalatest.SuperEngine$$Lambda$129/1905082148.apply(Unknown 
Source)
[info]   at scala.collection.immutable.List.foreach(List.scala:388)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite.run(Suite.scala:1147)
[info]   at org.scalatest.Suite.run$(Suite.scala:1129)
[error] Uncaught exception when running 
org.apache.spark.util.collection.SorterSuite: java.lang.OutOfMemoryError: Java 
heap space
sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.util.collection.TestTimSort.createArray(TestTimSort.java:56)
at 
org.apache.spark.util.collection.TestTimSort.getTimSortBugTestSet(TestTimSort.java:43)
at 
org.apache.spark.util.collection.SorterSuite.$anonfun$new$8(SorterSuite.scala:70)
at 
org.apache.spark.util.collection.SorterSuite$$Lambda$11365/360747485.apply$mcV$sp(Unknown
 Source)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike$$Lambda$132/1886906768.apply(Unknown 
Source)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike$$Lambda$128/398936629.apply(Unknown 
Source)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)

[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator

2018-12-07 Thread qian han (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712905#comment-16712905
 ] 

qian han commented on SPARK-26265:
--

Okay

> deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
> --
>
> Key: SPARK-26265
> URL: https://issues.apache.org/jira/browse/SPARK-26265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: qian han
>Priority: Major
>
> The application is running on a cluster with 72000 cores and 182000G mem.
> Enviroment:
> |spark.dynamicAllocation.minExecutors|5|
> |spark.dynamicAllocation.initialExecutors|30|
> |spark.dynamicAllocation.maxExecutors|400|
> |spark.executor.cores|4|
> |spark.executor.memory|20g|
>  
>   
> Stage description:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
>  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193)
>  
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  java.lang.reflect.Method.invoke(Method.java:498) 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>  org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) 
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  
> jstack information as follow:
> Found one Java-level deadlock: = 
> "Thread-ScriptTransformation-Feed": waiting to lock monitor 
> 0x00e0cb18 (object 0x0002f1641538, a 
> org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task 
> launch worker for task 18899" "Executor task launch worker for task 18899": 
> waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by 
> "Thread-ScriptTransformation-Feed" Java stack information for the threads 
> listed above: === 
> "Thread-ScriptTransformation-Feed": at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
>  - waiting to lock <0x0002f1641538> (a 
> org.apache.spark.memory.TaskMemoryManager) at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at 
> org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66)
>  at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274)
>  - locked <0x000302faa3b0> (a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313)
>  at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:281)
>  at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270)
>  at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1995) at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270)
>  "Executor task launch worker for task 18899": at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.spill(BytesToBytesMap.java:345)
>  - waiting to lock <0x000302faa3b0> (a 
> 

[jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join

2018-12-07 Thread Wang, Gang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712873#comment-16712873
 ] 

Wang, Gang commented on SPARK-25401:


Yeah. I think so. 

And please make sure the outputOrdering of SortMergeJoin is align with the 
reordered keys. 

> Reorder the required ordering to match the table's output ordering for bucket 
> join
> --
>
> Key: SPARK-25401
> URL: https://issues.apache.org/jira/browse/SPARK-25401
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
>
> Currently, we check if SortExec is needed between a operator and its child 
> operator in method orderingSatisfies, and method orderingSatisfies require 
> the order in the SortOrders are all the same.
> While, take the following case into consideration.
>  * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is 
> 200.
>  * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is 
> 200.
>  * Table a join table b on (a1=b1, a2=b2)
> In this case, if the join is sort merge join, the query planner won't add 
> exchange on both sides, while, sort will be added on both sides. Actually, 
> sort is also unnecessary, since in the same bucket, like bucket 1 of table a, 
> and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2018-12-07 Thread Lantao Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712855#comment-16712855
 ] 

Lantao Jin commented on SPARK-26305:


CC [~jiangxb1987]  [~cloud_fan] [~dongjoon] [~hyukjin.kwon], thoughts?

> Breakthrough the memory limitation of broadcast join
> 
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Lantao Jin
>Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we 
> usually use a broadcast hint in SQL to resolve it. However, current broadcast 
> join has many limitations. The primary restriction is memory. The small table 
> which is broadcasted must be fulfilled to memory in driver/executors side. 
> Although it will spill to disk when the memory is insufficient, it still 
> causes OOM if the small table actually is not absolutely small, it's 
> relatively small. In our company, we have many real big data SQL analysis 
> jobs which handle dozens of hundreds terabytes join and shuffle. For example, 
> the size of large table is 100TB, and the small one is 1 times less, 
> still 10GB. In this case, broadcast join couldn't be finished since the small 
> one is still larger than expected. If the join is data skewing, the sortmerge 
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the 
> skew key and normal key separately. I guess Databricks Runtime has the 
> similar implementation. However, the skew join hint needs SQL users know the 
> data in table like their children. They must know which key is skewing in a 
> join. It's very hard to know since the data is changing day by day and the 
> join key isn't fixed in different queries. The users have to set a huge 
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the 
> limitation, if the broadcasted table no needs to fill to memory, in other 
> words, driver/executor stores the broadcasted table to disk only. The problem 
> mentioned above could be resolved.
> I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26305) Breakthrough the memory limitation of broadcast join

2018-12-07 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-26305:
--

 Summary: Breakthrough the memory limitation of broadcast join
 Key: SPARK-26305
 URL: https://issues.apache.org/jira/browse/SPARK-26305
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Lantao Jin


If the join between a big table and a small one faces data skewing issue, we 
usually use a broadcast hint in SQL to resolve it. However, current broadcast 
join has many limitations. The primary restriction is memory. The small table 
which is broadcasted must be fulfilled to memory in driver/executors side. 
Although it will spill to disk when the memory is insufficient, it still causes 
OOM if the small table actually is not absolutely small, it's relatively small. 
In our company, we have many real big data SQL analysis jobs which handle 
dozens of hundreds terabytes join and shuffle. For example, the size of large 
table is 100TB, and the small one is 1 times less, still 10GB. In this 
case, broadcast join couldn't be finished since the small one is still larger 
than expected. If the join is data skewing, the sortmerge join always failed.

Hive has a skew join hint which could trigger two-stage task to handle the skew 
key and normal key separately. I guess Databricks Runtime has the similar 
implementation. However, the skew join hint needs SQL users know the data in 
table like their children. They must know which key is skewing in a join. It's 
very hard to know since the data is changing day by day and the join key isn't 
fixed in different queries. The users have to set a huge partition number to 
try their luck.

So, do we have a simple, rude and efficient way to resolve it? Back to the 
limitation, if the broadcasted table no needs to fill to memory, in other 
words, driver/executor stores the broadcasted table to disk only. The problem 
mentioned above could be resolved.

I will offer a design doc if you have same feeling about it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project

2018-12-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712843#comment-16712843
 ] 

Steve Loughran commented on SPARK-26254:


maybe ask the Kafka people for opinions [~jkreps] can probably nominate someone

bq. token-providers provided dependency to kafka-sql project => It's kinda' 
weird but at the moment looks the least problematic

probably makes sense then

> Move delegation token providers into a separate project
> ---
>
> Key: SPARK-26254
> URL: https://issues.apache.org/jira/browse/SPARK-26254
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There was a discussion in 
> [PR#22598|https://github.com/apache/spark/pull/22598] that there are several 
> provided dependencies inside core project which shouldn't be there (for ex. 
> hive and kafka). This jira is to solve this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26266) Update to Scala 2.12.8

2018-12-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26266:
--
Docs Text: Use Spark with the latest maintenance release of Java, for 
security and bug fixes, and to ensure compatibility with Scala.
   Labels: release-notes  (was: )

> Update to Scala 2.12.8
> --
>
> Key: SPARK-26266
> URL: https://issues.apache.org/jira/browse/SPARK-26266
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Yuming Wang
>Priority: Minor
>  Labels: release-notes
>
> [~yumwang] notes that Scala 2.12.8 is out and fixes two minor issues:
> Don't reject views with result types which are TypeVars (#7295)
> Don't emit static forwarders (which simplify the use of methods in top-level 
> objects from Java) for bridge methods (#7469)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join

2018-12-07 Thread David Vrba (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712350#comment-16712350
 ] 

David Vrba edited comment on SPARK-25401 at 12/7/18 10:59 AM:
--

I was looking at it and i believe that in the class EnsureRequirements we could 
reorder the join predicates for SortMergeJoin once more - just before we check 
if child outputOrdering satisfies the requiredOrdering - and we can align the 
predicate keys with the child outputOrdering. In such case it is not going to 
add the unnecessary SortExec and also it is not going to add unnecessary 
Exchange either, because Exchange is handled before.

 

What do you guys think? Is it a good approach? (Please be patient with me, this 
is my first Jira on Spark)


was (Author: vrbad):
I was looking at it and i believe that it the class EnsureRequirements we could 
reorder the join predicates for SortMergeJoin once more - just before we check 
if child outputOrdering satisfies the requiredOrdering - and we can align the 
predicate keys with the child outputOrdering. In such case it is not going to 
add the unnecessary SortExec and also it is not going to add unnecessary 
Exchange either, because Exchange is handled before.

 

What do you guys think? Is it a good approach? (Please be patient with me, this 
is my first Jira on Spark)

> Reorder the required ordering to match the table's output ordering for bucket 
> join
> --
>
> Key: SPARK-25401
> URL: https://issues.apache.org/jira/browse/SPARK-25401
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang, Gang
>Priority: Major
>
> Currently, we check if SortExec is needed between a operator and its child 
> operator in method orderingSatisfies, and method orderingSatisfies require 
> the order in the SortOrders are all the same.
> While, take the following case into consideration.
>  * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is 
> 200.
>  * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is 
> 200.
>  * Table a join table b on (a1=b1, a2=b2)
> In this case, if the join is sort merge join, the query planner won't add 
> exchange on both sides, while, sort will be added on both sides. Actually, 
> sort is also unnecessary, since in the same bucket, like bucket 1 of table a, 
> and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26304:


Assignee: (was: Apache Spark)

> Add default value to spark.kafka.sasl.kerberos.service.name parameter
> -
>
> Key: SPARK-26304
> URL: https://issues.apache.org/jira/browse/SPARK-26304
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The reasoning behind:
> * Kafka's configuration guide suggest the same value: 
> https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
> * It would be easier for spark users by providing less configuration
> * Other streaming engines are doing the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712639#comment-16712639
 ] 

Apache Spark commented on SPARK-26304:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/23254

> Add default value to spark.kafka.sasl.kerberos.service.name parameter
> -
>
> Key: SPARK-26304
> URL: https://issues.apache.org/jira/browse/SPARK-26304
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The reasoning behind:
> * Kafka's configuration guide suggest the same value: 
> https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
> * It would be easier for spark users by providing less configuration
> * Other streaming engines are doing the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26304:


Assignee: Apache Spark

> Add default value to spark.kafka.sasl.kerberos.service.name parameter
> -
>
> Key: SPARK-26304
> URL: https://issues.apache.org/jira/browse/SPARK-26304
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> The reasoning behind:
> * Kafka's configuration guide suggest the same value: 
> https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
> * It would be easier for spark users by providing less configuration
> * Other streaming engines are doing the same



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26304) Add default value to spark.kafka.sasl.kerberos.service.name parameter

2018-12-07 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-26304:
-

 Summary: Add default value to 
spark.kafka.sasl.kerberos.service.name parameter
 Key: SPARK-26304
 URL: https://issues.apache.org/jira/browse/SPARK-26304
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


The reasoning behind:
* Kafka's configuration guide suggest the same value: 
https://kafka.apache.org/documentation/#security_sasl_kerberos_brokerconfig
* It would be easier for spark users by providing less configuration
* Other streaming engines are doing the same




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26290) [K8s] Driver Pods no mounted volumes on submissions from older spark versions

2018-12-07 Thread Martin Buchleitner (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Buchleitner updated SPARK-26290:
---
Environment: 
Kuberentes: 1.10.6
Container: Spark 2.4.0 

Spark containers are built from the archive served by 
[www.apache.org/dist/spark/|http://www.apache.org/dist/spark/] 

Submission done by older spark versions integrated e.g. in livy

  was:
Kuberentes 1.10.6
Spark 2.4.0

Spark containers are built from the archive served by 
[www.apache.org/dist/spark/|http://www.apache.org/dist/spark/] 

Description: 
I want to use the volume feature to mount an existing PVC as readonly volume 
into the driver and also executor. 

The executor gets the PVC mounted, but the driver is missing the mount 
{code:java}
/opt/spark/bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.app.name=spark-pi \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.namespace=spark-demo \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.container.image=kube-spark:2.4.0 \
--conf spark.master=k8s://https:// \
--conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv \
--conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true 
\
--conf 
spark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc
 \
--conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv \
--conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true
 \
--conf 
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=nfs-pvc
 \
/srv/spark-examples_2.11-2.4.0.jar
{code}
When i use the jar included in the container
{code:java}
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
{code}
the call works and i can review the pod descriptions to review the behavior

*Driver description*
{code:java}
Name: spark-pi-1544018157391-driver
[...]
Containers:
  spark-kubernetes-driver:
Container ID:   
docker://3a31d867c140183247cb296e13a8b35d03835f7657dd7e625c59083024e51e28
Image:  kube-spark:2.4.0
Image ID:   [...]
Port:   
Host Port:  
State:  Terminated
  Reason:   Completed
  Exit Code:0
  Started:  Wed, 05 Dec 2018 14:55:59 +0100
  Finished: Wed, 05 Dec 2018 14:56:08 +0100
Ready:  False
Restart Count:  0
Limits:
  memory:  1408Mi
Requests:
  cpu: 1
  memory:  1Gi
Environment:
  SPARK_DRIVER_MEMORY:1g
  SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi
  SPARK_DRIVER_ARGS:
  SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
  SPARK_MOUNTED_CLASSPATH:
/opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
  SPARK_JAVA_OPT_1:   
-Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/srv
  SPARK_JAVA_OPT_3:   -Dspark.app.name=spark-pi
  SPARK_JAVA_OPT_4:   
-Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.path=/srv
  SPARK_JAVA_OPT_5:   -Dspark.submit.deployMode=cluster
  SPARK_JAVA_OPT_6:   -Dspark.driver.blockManager.port=7079
  SPARK_JAVA_OPT_7:   
-Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.mount.readOnly=true
  SPARK_JAVA_OPT_8:   
-Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
  SPARK_JAVA_OPT_9:   
-Dspark.driver.host=spark-pi-1544018157391-driver-svc.spark-demo.svc.cluster.local
  SPARK_JAVA_OPT_10:  
-Dspark.kubernetes.driver.pod.name=spark-pi-1544018157391-driver
  SPARK_JAVA_OPT_11:  
-Dspark.kubernetes.driver.volumes.persistentVolumeClaim.ddata.options.claimName=nfs-pvc
  SPARK_JAVA_OPT_12:  
-Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=true
  SPARK_JAVA_OPT_13:  -Dspark.driver.port=7078
  SPARK_JAVA_OPT_14:  
-Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
  SPARK_JAVA_OPT_15:  
-Dspark.kubernetes.executor.podNamePrefix=spark-pi-1544018157391
  SPARK_JAVA_OPT_16:  -Dspark.local.dir=/tmp/spark-local
  SPARK_JAVA_OPT_17:  -Dspark.master=k8s://https://
  SPARK_JAVA_OPT_18:  
-Dspark.app.id=spark-89420bd5fa8948c3aa9d14a4eb6ecfca
  SPARK_JAVA_OPT_19:  -Dspark.kubernetes.namespace=spark-demo
  SPARK_JAVA_OPT_21:  -Dspark.executor.instances=4
  SPARK_JAVA_OPT_22:  
-Dspark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=nfs-pvc
  SPARK_JAVA_OPT_23:  
-Dspark.kubernetes.container.image=kube-spark:2.4.0
  SPARK_JAVA_OPT_24:  

[jira] [Assigned] (SPARK-26303) Return partial results for bad JSON records

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26303:


Assignee: (was: Apache Spark)

> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26303) Return partial results for bad JSON records

2018-12-07 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26303:
--

 Summary: Return partial results for bad JSON records
 Key: SPARK-26303
 URL: https://issues.apache.org/jira/browse/SPARK-26303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, JSON datasource and JSON functions return row with all null for a 
malformed JSON string in the PERMISSIVE mode when specified schema has the 
struct type. All nulls are returned even some of fields were parsed and 
converted to desired types successfully. The ticket aims to solve the problem 
by returning already parsed fields. The corrupted column specified via JSON 
option `columnNameOfCorruptRecord` or SQL config should contain whole original 
JSON string. 

For example, if the input has one JSON string:
{code:json}
{"a":0.1,"b":{},"c":"def"}
{code}
and specified schema is:
{code:sql}
a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
{code}
expected output of `from_json` in the PERMISSIVE mode:
{code}
+---++---+--+
|a  |b   |c  |_corrupt_record   |
+---++---+--+
|0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
+---++---+--+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712603#comment-16712603
 ] 

Apache Spark commented on SPARK-26303:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23253

> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26303) Return partial results for bad JSON records

2018-12-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26303:


Assignee: Apache Spark

> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712600#comment-16712600
 ] 

Apache Spark commented on SPARK-26303:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23253

> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26302) retainedBatches configuration can cause memory leak

2018-12-07 Thread Behroz Sikander (JIRA)
Behroz Sikander created SPARK-26302:
---

 Summary: retainedBatches configuration can cause memory leak
 Key: SPARK-26302
 URL: https://issues.apache.org/jira/browse/SPARK-26302
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Behroz Sikander
 Attachments: heap_dump_detail.png

The documentation for configuration "spark.streaming.ui.retainedBatches" says

"How many batches the Spark Streaming UI and status APIs remember before 
garbage collecting"

The default for this configuration is 1000.
>From our experience, the documentation is incomplete and we found it the hard 
>way.

The size of a single BatchUIData is around 750KB. Increasing this value to 
something like 5000 increases the total size to ~4GB.

If your driver heap is not big enough, the job starts to slow down, has 
frequent GCs and has long scheduling days. Once the heap is full, the job 
cannot be recovered.

A note of caution should be added to the documentation to let users know the 
impact of this seemingly harmless configuration property.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26302) retainedBatches configuration can cause memory leak

2018-12-07 Thread Behroz Sikander (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712559#comment-16712559
 ] 

Behroz Sikander commented on SPARK-26302:
-

I am willing to do a PR for documentation once someone can give a go ahead.

> retainedBatches configuration can cause memory leak
> ---
>
> Key: SPARK-26302
> URL: https://issues.apache.org/jira/browse/SPARK-26302
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Behroz Sikander
>Priority: Minor
> Attachments: heap_dump_detail.png
>
>
> The documentation for configuration "spark.streaming.ui.retainedBatches" says
> "How many batches the Spark Streaming UI and status APIs remember before 
> garbage collecting"
> The default for this configuration is 1000.
> From our experience, the documentation is incomplete and we found it the hard 
> way.
> The size of a single BatchUIData is around 750KB. Increasing this value to 
> something like 5000 increases the total size to ~4GB.
> If your driver heap is not big enough, the job starts to slow down, has 
> frequent GCs and has long scheduling days. Once the heap is full, the job 
> cannot be recovered.
> A note of caution should be added to the documentation to let users know the 
> impact of this seemingly harmless configuration property.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26302) retainedBatches configuration can cause memory leak

2018-12-07 Thread Behroz Sikander (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Behroz Sikander updated SPARK-26302:

Attachment: heap_dump_detail.png

> retainedBatches configuration can cause memory leak
> ---
>
> Key: SPARK-26302
> URL: https://issues.apache.org/jira/browse/SPARK-26302
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Behroz Sikander
>Priority: Minor
> Attachments: heap_dump_detail.png
>
>
> The documentation for configuration "spark.streaming.ui.retainedBatches" says
> "How many batches the Spark Streaming UI and status APIs remember before 
> garbage collecting"
> The default for this configuration is 1000.
> From our experience, the documentation is incomplete and we found it the hard 
> way.
> The size of a single BatchUIData is around 750KB. Increasing this value to 
> something like 5000 increases the total size to ~4GB.
> If your driver heap is not big enough, the job starts to slow down, has 
> frequent GCs and has long scheduling days. Once the heap is full, the job 
> cannot be recovered.
> A note of caution should be added to the documentation to let users know the 
> impact of this seemingly harmless configuration property.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26295) [K8S] serviceAccountName is not set in client mode

2018-12-07 Thread Adrian Tanase (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Tanase updated SPARK-26295:
--
Description: 
When deploying spark apps in client mode (in my case from inside the driver 
pod), one can't specify the service account in accordance to the docs 
([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).]

The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is 
most likely added in cluster mode only, which would be consistent with 
{{spark.kubernetes.authenticate.driver}} being the cluster mode prefix.

We should either inject the service account specified by this property in the 
client mode pods, or specify an equivalent config: 
{{spark.kubernetes.authenticate.serviceAccountName}}

 This is the exception:
{noformat}
Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. pods "..." is forbidden: User 
"system:serviceaccount:mynamespace:default" cannot get pods in the namespace 
"mynamespace"{noformat}
The expectation was to see the user {{mynamespace:spark}} based on my submit 
command.

My current workaround is to create a clusterrolebinding with edit rights for 
the mynamespace:default account.

  was:
When deploying spark apps in client mode (in my case from inside the driver 
pod), one can't specify the service account in accordance to the docs 
([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).]

The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is 
most likely added in cluster mode only, which would be consistent with 
spark.kubernetes.authenticate.driver being the cluster mode prefix.

We should either inject the service account specified by this property in the 
client mode pods, or specify an equivalent config: 
spark.kubernetes.authenticate.serviceAccountName

 This is the exception:
{noformat}
Message: Forbidden!Configured service account doesn't have access. Service 
account may have been revoked. pods "..." is forbidden: User 
"system:serviceaccount:mynamespace:default" cannot get pods in the namespace 
"mynamespace"{noformat}
The expectation was to see the user `mynamespace:spark` based on my submit 
command.

My current workaround is to create a clusterrolebinding with edit rights for 
the mynamespace:default account.


> [K8S] serviceAccountName is not set in client mode
> --
>
> Key: SPARK-26295
> URL: https://issues.apache.org/jira/browse/SPARK-26295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Adrian Tanase
>Priority: Major
>
> When deploying spark apps in client mode (in my case from inside the driver 
> pod), one can't specify the service account in accordance to the docs 
> ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).]
> The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is 
> most likely added in cluster mode only, which would be consistent with 
> {{spark.kubernetes.authenticate.driver}} being the cluster mode prefix.
> We should either inject the service account specified by this property in the 
> client mode pods, or specify an equivalent config: 
> {{spark.kubernetes.authenticate.serviceAccountName}}
>  This is the exception:
> {noformat}
> Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. pods "..." is forbidden: User 
> "system:serviceaccount:mynamespace:default" cannot get pods in the namespace 
> "mynamespace"{noformat}
> The expectation was to see the user {{mynamespace:spark}} based on my submit 
> command.
> My current workaround is to create a clusterrolebinding with edit rights for 
> the mynamespace:default account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode

2018-12-07 Thread Adrian Tanase (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712493#comment-16712493
 ] 

Adrian Tanase commented on SPARK-26295:
---

[~vanzin] I'm not sure how it applies. I'd be happy to give that a shot, 
except, as I also commented on that PR, I can't see how kubectl and kube 
context are relevant in the client mode, where spark submit is being called 
from docker, inside the driver pod.

If there is code that propagates the kube context along this path, I'm not 
aware of it, would love to see some documentation:
{noformat}
laptop with kubectl and context > k apply -f spark-driver-client-mode.yaml -> 
deployment starts 1 instance of driver pod in arbitrary namespace -> spark 
submit from start.sh inside the docker container -> ... {noformat}
Also, I'd rather not make any assumptions about "implicit" configuration that 
may vary from computer to computer. Ideally the yaml templates are self 
sufficient (including config maps, env vars, etc) and, apart from your cluster 
credentials, you shouldn't need anything else on your machine.

Thanks for looking at the issue.

> [K8S] serviceAccountName is not set in client mode
> --
>
> Key: SPARK-26295
> URL: https://issues.apache.org/jira/browse/SPARK-26295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Adrian Tanase
>Priority: Major
>
> When deploying spark apps in client mode (in my case from inside the driver 
> pod), one can't specify the service account in accordance to the docs 
> ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).]
> The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is 
> most likely added in cluster mode only, which would be consistent with 
> spark.kubernetes.authenticate.driver being the cluster mode prefix.
> We should either inject the service account specified by this property in the 
> client mode pods, or specify an equivalent config: 
> spark.kubernetes.authenticate.serviceAccountName
>  This is the exception:
> {noformat}
> Message: Forbidden!Configured service account doesn't have access. Service 
> account may have been revoked. pods "..." is forbidden: User 
> "system:serviceaccount:mynamespace:default" cannot get pods in the namespace 
> "mynamespace"{noformat}
> The expectation was to see the user `mynamespace:spark` based on my submit 
> command.
> My current workaround is to create a clusterrolebinding with edit rights for 
> the mynamespace:default account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org