date:20200707

[jira] [Assigned] (SPARK-31317) Add withField method to Column class

2020-07-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31317:
---

Assignee: fqaiser94

> Add withField method to Column class
> 
>
> Key: SPARK-31317
> URL: https://issues.apache.org/jira/browse/SPARK-31317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: fqaiser94
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32219) Add SHOW CACHED TABLES Command

2020-07-07 Thread ulysses you (Jira)

ulysses you created SPARK-32219:
---

 Summary: Add SHOW CACHED TABLES Command
 Key: SPARK-32219
 URL: https://issues.apache.org/jira/browse/SPARK-32219
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32159:
--
   Fix Version/s: (was: 3.0.1)
Target Version/s: 3.0.1

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131
 ] 

JinxinTang edited comment on SPARK-32205 at 7/8/20, 5:10 AM:
-

[~nileshr.patil] Seems an issue, we can insert timestamp to datetime column 
predefined in mysql. The table could not auto create by spark currently.


was (Author: jinxintang):
[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql. The table could not auto create by spark currently.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32218) spark-ml must support one hot encoded output labels for classification

2020-07-07 Thread Raghuvarran V H (Jira)

Raghuvarran V H created SPARK-32218:
---

 Summary: spark-ml must support one hot encoded output labels for 
classification
 Key: SPARK-32218
 URL: https://issues.apache.org/jira/browse/SPARK-32218
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.0
Reporter: Raghuvarran V H


In any classification algorithm, for target labels that have no ordinal 
relationship, it is advised to one hot encode the target labels. Refer here:

[https://stackoverflow.com/questions/51384911/one-hot-encoding-of-output-labels/53291690#53291690]

[https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/]

spark-ml is not supporting the one hot encoded target labels. When I try, i get 
the below error:

IllegalArgumentException: u'requirement failed: Column label_ohe must be of 
type numeric but was actually of type 
struct,values:array>.'

So it will be nice if OHE is supported for target labels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32213) saveAsTable deletes all files in path

2020-07-07 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153200#comment-17153200
 ] 

angerszhu edited comment on SPARK-32213 at 7/8/20, 3:13 AM:


https://issues.apache.org/jira/browse/SPARK-28551

seems you meet problem I mentioned in 


was (Author: angerszhuuu):
https://issues.apache.org/jira/browse/SPARK- 
[25290|https://github.com/apache/spark/pull/25290/files]

seems you meet problem I mentioned in 

> saveAsTable deletes all files in path
> -
>
> Key: SPARK-32213
> URL: https://issues.apache.org/jira/browse/SPARK-32213
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuval Rochman
>Priority: Major
>
> The problem is presented in the following link:
> [https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637]
> Apparently, without no warning, all files is desktop where deleted after 
> writing a file.
> There is no warning in Pyspark that the "Path"  parameter makes that problem. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32213) saveAsTable deletes all files in path

2020-07-07 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153200#comment-17153200
 ] 

angerszhu commented on SPARK-32213:
---

https://issues.apache.org/jira/browse/SPARK- 
[25290|https://github.com/apache/spark/pull/25290/files]

seems you meet problem I mentioned in 

> saveAsTable deletes all files in path
> -
>
> Key: SPARK-32213
> URL: https://issues.apache.org/jira/browse/SPARK-32213
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Yuval Rochman
>Priority: Major
>
> The problem is presented in the following link:
> [https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637]
> Apparently, without no warning, all files is desktop where deleted after 
> writing a file.
> There is no warning in Pyspark that the "Path"  parameter makes that problem. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153180#comment-17153180
 ] 

Dongjoon Hyun commented on SPARK-32163:
---

This lands at branch-3.0 via https://github.com/apache/spark/pull/29027 .

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32163:
--
Fix Version/s: 3.0.1

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30703) Add a documentation page for ANSI mode

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153177#comment-17153177
 ] 

Apache Spark commented on SPARK-30703:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29033

> Add a documentation page for ANSI mode
> --
>
> Key: SPARK-30703
> URL: https://issues.apache.org/jira/browse/SPARK-30703
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> ANSI mode is introduced in Spark 3.0. We need to clearly document the 
> behavior difference when spark.sql.ansi.enabled is on and off. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20680:


Assignee: (was: Apache Spark)

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-20680.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28833
[https://github.com/apache/spark/pull/28833]

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.1.0
>
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-20680:
-

Assignee: Lantao Jin

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20680:


Assignee: Apache Spark

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20680:
--
Affects Version/s: 2.4.6
   3.0.0

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20680:
---

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Lantao Jin
>Priority: Major
>  Labels: bulk-closed
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20680:
--
Labels:   (was: bulk-closed)

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Lantao Jin
>Priority: Major
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2020-07-07 Thread AidenZhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153170#comment-17153170
 ] 

AidenZhang commented on SPARK-29038:


Hi [~cltlfcjin],Thanks for you reply

The situation is that Recently our company are about to implement materialized 
view in sparkSQL,we are going to optimize catalyst to support query rewrite,and 
replace table using materialized view if applicable,The corresponding data of 
materialized view is stored on HDFS, and the structure information of 
materialized view is stored in hive metastore，Our plan is to implement 
materialized view management of spark SQL based on hive.There are two people in 
our team now. could you please  evaluate how long it will take to implement 
this function?

 

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31760) Simplification Based on Containment

2020-07-07 Thread Nikita Glashenko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153161#comment-17153161
 ] 

Nikita Glashenko commented on SPARK-31760:
--

Hi, [~yumwang]. I'd like to work on this.

> Simplification Based on Containment
> ---
>
> Key: SPARK-31760
> URL: https://issues.apache.org/jira/browse/SPARK-31760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: starter
>
> https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32217:


Assignee: Apache Spark

> Track whether the worker is also being decommissioned along with an executor
> 
>
> Key: SPARK-32217
> URL: https://issues.apache.org/jira/browse/SPARK-32217
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Assignee: Apache Spark
>Priority: Major
>
> When an executor is decommissioned, we would like to know if its shuffle data 
> is truly going to be lost. In the case of external shuffle service, this 
> means knowing that the worker (or the node that the executor is on) is also 
> going to be lost. 
>  
> ( I don't think we need to worry about disaggregated remote shuffle storage 
> at present since those are only used in a couple of web companies – but when 
> there is remote shuffle then yes the shuffle won't be lost with a 
> decommissioned executor )
>  
> We know for sure that a worker is being decommissioned when the Master is 
> asked to decommission a worker. In case of other schedulers:
>  * Yarn support for decommissioning isn't implemented yet. But the idea would 
> be for Yarn preeemption to not mark that the worker is being lost, but 
> machine level decommissioning (like for kernel upgrades) to do mark such.
>  * K8s isn't quite working with external shuffle service as yet, so when the 
> executor is lost, the worker isn't quite lost with it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32217:


Assignee: (was: Apache Spark)

> Track whether the worker is also being decommissioned along with an executor
> 
>
> Key: SPARK-32217
> URL: https://issues.apache.org/jira/browse/SPARK-32217
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
>
> When an executor is decommissioned, we would like to know if its shuffle data 
> is truly going to be lost. In the case of external shuffle service, this 
> means knowing that the worker (or the node that the executor is on) is also 
> going to be lost. 
>  
> ( I don't think we need to worry about disaggregated remote shuffle storage 
> at present since those are only used in a couple of web companies – but when 
> there is remote shuffle then yes the shuffle won't be lost with a 
> decommissioned executor )
>  
> We know for sure that a worker is being decommissioned when the Master is 
> asked to decommission a worker. In case of other schedulers:
>  * Yarn support for decommissioning isn't implemented yet. But the idea would 
> be for Yarn preeemption to not mark that the worker is being lost, but 
> machine level decommissioning (like for kernel upgrades) to do mark such.
>  * K8s isn't quite working with external shuffle service as yet, so when the 
> executor is lost, the worker isn't quite lost with it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153157#comment-17153157
 ] 

Apache Spark commented on SPARK-32217:
--

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29032

> Track whether the worker is also being decommissioned along with an executor
> 
>
> Key: SPARK-32217
> URL: https://issues.apache.org/jira/browse/SPARK-32217
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
>
> When an executor is decommissioned, we would like to know if its shuffle data 
> is truly going to be lost. In the case of external shuffle service, this 
> means knowing that the worker (or the node that the executor is on) is also 
> going to be lost. 
>  
> ( I don't think we need to worry about disaggregated remote shuffle storage 
> at present since those are only used in a couple of web companies – but when 
> there is remote shuffle then yes the shuffle won't be lost with a 
> decommissioned executor )
>  
> We know for sure that a worker is being decommissioned when the Master is 
> asked to decommission a worker. In case of other schedulers:
>  * Yarn support for decommissioning isn't implemented yet. But the idea would 
> be for Yarn preeemption to not mark that the worker is being lost, but 
> machine level decommissioning (like for kernel upgrades) to do mark such.
>  * K8s isn't quite working with external shuffle service as yet, so when the 
> executor is lost, the worker isn't quite lost with it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32217) Track whether the worker is also being decommissioned along with an executor

2020-07-07 Thread Devesh Agrawal (Jira)

Devesh Agrawal created SPARK-32217:
--

 Summary: Track whether the worker is also being decommissioned 
along with an executor
 Key: SPARK-32217
 URL: https://issues.apache.org/jira/browse/SPARK-32217
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Devesh Agrawal


When an executor is decommissioned, we would like to know if its shuffle data 
is truly going to be lost. In the case of external shuffle service, this means 
knowing that the worker (or the node that the executor is on) is also going to 
be lost. 

 

( I don't think we need to worry about disaggregated remote shuffle storage at 
present since those are only used in a couple of web companies – but when there 
is remote shuffle then yes the shuffle won't be lost with a decommissioned 
executor )

 

We know for sure that a worker is being decommissioned when the Master is asked 
to decommission a worker. In case of other schedulers:
 * Yarn support for decommissioning isn't implemented yet. But the idea would 
be for Yarn preeemption to not mark that the worker is being lost, but machine 
level decommissioning (like for kernel upgrades) to do mark such.
 * K8s isn't quite working with external shuffle service as yet, so when the 
executor is lost, the worker isn't quite lost with it. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131
 ] 

JinxinTang edited comment on SPARK-32205 at 7/8/20, 12:41 AM:
--

[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql. The table could not auto create by spark currently.


was (Author: jinxintang):
[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql. The table should not auto create by spark currently.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-07-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32057.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28912
[https://github.com/apache/spark/pull/28912]

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Assignee: Ali Smesseim
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32057) SparkExecuteStatementOperation does not set CANCELED state correctly

2020-07-07 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32057:


Assignee: Ali Smesseim

> SparkExecuteStatementOperation does not set CANCELED state correctly 
> -
>
> Key: SPARK-32057
> URL: https://issues.apache.org/jira/browse/SPARK-32057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Smesseim
>Assignee: Ali Smesseim
>Priority: Major
>
> https://github.com/apache/spark/pull/28671 introduced changes that changed 
> the way cleanup is done in SparkExecuteStatementOperation. In cancel(), 
> cleanup (killing jobs) used to be done after setting state to CANCELED. Now, 
> the order is reversed. Jobs are killed first, causing exception to be thrown 
> inside execute(), so the status of the operation becomes ERROR before being 
> set to CANCELED.
> cc [~juliuszsompolski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32216:


Assignee: (was: Apache Spark)

> Remove redundant ProjectExec
> 
>
> Key: SPARK-32216
> URL: https://issues.apache.org/jira/browse/SPARK-32216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently Spark executed plan can have redundant `ProjectExec` node. For 
> example: 
> After Filter: 
> {code:java}
> == Physical Plan ==
> *(1) Project [a#14L, b#15L, c#16, key#17] 
> +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
> The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
> exactly the same as filter's output.
> Before Aggregate:
> {code:java}
> == Physical Plan ==
> *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
> output=[sum_a#39L, key#17, last_b#41L])
> +- Exchange hashpartitioning(key#17, 5), true, [id=#77]
>+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
> partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
>   +- *(1) Project [key#17, a#14L, b#15L]
>  +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
> +- *(1) ColumnarToRow
>+- FileScan parquet [a#14L,b#15L,key#17] {code}
> The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
> doesn't require child plan's output to be in a specific order.
>  
> In general, a project is redundant when
>  # It has the same output attributes and order as its child's output when 
> ordering of these attributes is required.
>  # It has the same output attributes as its child's output when attribute 
> output ordering is not required.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32216) Remove redundant ProjectExec

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153136#comment-17153136
 ] 

Apache Spark commented on SPARK-32216:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/29031

> Remove redundant ProjectExec
> 
>
> Key: SPARK-32216
> URL: https://issues.apache.org/jira/browse/SPARK-32216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently Spark executed plan can have redundant `ProjectExec` node. For 
> example: 
> After Filter: 
> {code:java}
> == Physical Plan ==
> *(1) Project [a#14L, b#15L, c#16, key#17] 
> +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
> The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
> exactly the same as filter's output.
> Before Aggregate:
> {code:java}
> == Physical Plan ==
> *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
> output=[sum_a#39L, key#17, last_b#41L])
> +- Exchange hashpartitioning(key#17, 5), true, [id=#77]
>+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
> partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
>   +- *(1) Project [key#17, a#14L, b#15L]
>  +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
> +- *(1) ColumnarToRow
>+- FileScan parquet [a#14L,b#15L,key#17] {code}
> The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
> doesn't require child plan's output to be in a specific order.
>  
> In general, a project is redundant when
>  # It has the same output attributes and order as its child's output when 
> ordering of these attributes is required.
>  # It has the same output attributes as its child's output when attribute 
> output ordering is not required.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32216:


Assignee: Apache Spark

> Remove redundant ProjectExec
> 
>
> Key: SPARK-32216
> URL: https://issues.apache.org/jira/browse/SPARK-32216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark executed plan can have redundant `ProjectExec` node. For 
> example: 
> After Filter: 
> {code:java}
> == Physical Plan ==
> *(1) Project [a#14L, b#15L, c#16, key#17] 
> +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
> The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
> exactly the same as filter's output.
> Before Aggregate:
> {code:java}
> == Physical Plan ==
> *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
> output=[sum_a#39L, key#17, last_b#41L])
> +- Exchange hashpartitioning(key#17, 5), true, [id=#77]
>+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
> partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
>   +- *(1) Project [key#17, a#14L, b#15L]
>  +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
> +- *(1) ColumnarToRow
>+- FileScan parquet [a#14L,b#15L,key#17] {code}
> The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
> doesn't require child plan's output to be in a specific order.
>  
> In general, a project is redundant when
>  # It has the same output attributes and order as its child's output when 
> ordering of these attributes is required.
>  # It has the same output attributes as its child's output when attribute 
> output ordering is not required.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32216) Remove redundant ProjectExec

2020-07-07 Thread Allison Wang (Jira)

Allison Wang created SPARK-32216:


 Summary: Remove redundant ProjectExec
 Key: SPARK-32216
 URL: https://issues.apache.org/jira/browse/SPARK-32216
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Allison Wang


Currently Spark executed plan can have redundant `ProjectExec` node. For 
example: 

After Filter: 
{code:java}
== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17] 
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
exactly the same as filter's output.

Before Aggregate:
{code:java}
== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
  +- *(1) Project [key#17, a#14L, b#15L]
 +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
+- *(1) ColumnarToRow
   +- FileScan parquet [a#14L,b#15L,key#17] {code}
The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
doesn't require child plan's output to be in a specific order.

 

In general, a project is redundant when
 # It has the same output attributes and order as its child's output when 
ordering of these attributes is required.
 # It has the same output attributes as its child's output when attribute 
output ordering is not required.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131
 ] 

JinxinTang edited comment on SPARK-32205 at 7/7/20, 11:31 PM:
--

[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql. The table should not auto create by spark currently.


was (Author: jinxintang):
[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql, but spark and mysql have different timestamp range. The 
table should not auto create by spark currently.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153131#comment-17153131
 ] 

JinxinTang commented on SPARK-32205:


[~nileshr.patil] Seems a issue, we can insert timestamp to datetime column 
predefined in mysql, but spark and mysql have different timestamp range. The 
table should not auto create by spark currently.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153127#comment-17153127
 ] 

Apache Spark commented on SPARK-32215:
--

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29015

> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153126#comment-17153126
 ] 

Apache Spark commented on SPARK-32215:
--

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29015

> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32215:


Assignee: Apache Spark

> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32215:


Assignee: (was: Apache Spark)

> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Devesh Agrawal (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devesh Agrawal updated SPARK-32215:
---
Description: 
The use case here is to allow some external entity that has made a 
decommissioning decision to inform the Master (in case of Standalone scheduling 
mode)

The current decommissioning is triggered by the Worker getting getting a SIGPWR
 (out of band possibly by some cleanup hook), which then informs the Master
 about it. This approach may not be feasible in some environments that cannot
 trigger a clean up hook on the Worker.

Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
 external agent to inform the master about all the nodes being decommissioned in
 bulk. The workers are identified by either their {{host:port}} or just the host
 – in which case all workers on the host would be decommissioned.

This API is merely a new entry point into the existing decommissioning
 logic. It does not change how the decommissioning request is handled in
 its core.

The path /workers/kill is so chosen to be consistent with the other endpoint 
names on the MasterWebUI. 

Since this is a sensitive operation, this API will be disabled by default.

  was:
The use case here is to allow some external entity that has made a 
decommissioning decision to inform the Master (in case of Standalone scheduling 
mode)

The current decommissioning is triggered by the Worker getting getting a SIGPWR
(out of band possibly by some cleanup hook), which then informs the Master
about it. This approach may not be feasible in some environments that cannot
trigger a clean up hook on the Worker.

Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
external agent to inform the master about all the nodes being decommissioned in
bulk. The workers are identified by either their {{host:port}} or just the host
-- in which case all workers on the host would be decommissioned.

This API is merely a new entry point into the existing decommissioning
logic. It does not change how the decommissioning request is handled in
its core.

The path /workers/kill is so chosen to be consistent with the other endpoint 
names on the MasterWebUI. 


> Expose end point on Master so that it can be informed about decommissioned 
> workers out of band
> --
>
> Key: SPARK-32215
> URL: https://issues.apache.org/jira/browse/SPARK-32215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: Standalone Scheduler 
>Reporter: Devesh Agrawal
>Priority: Major
> Fix For: 3.1.0
>
>
> The use case here is to allow some external entity that has made a 
> decommissioning decision to inform the Master (in case of Standalone 
> scheduling mode)
> The current decommissioning is triggered by the Worker getting getting a 
> SIGPWR
>  (out of band possibly by some cleanup hook), which then informs the Master
>  about it. This approach may not be feasible in some environments that cannot
>  trigger a clean up hook on the Worker.
> Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
>  external agent to inform the master about all the nodes being decommissioned 
> in
>  bulk. The workers are identified by either their {{host:port}} or just the 
> host
>  – in which case all workers on the host would be decommissioned.
> This API is merely a new entry point into the existing decommissioning
>  logic. It does not change how the decommissioning request is handled in
>  its core.
> The path /workers/kill is so chosen to be consistent with the other endpoint 
> names on the MasterWebUI. 
> Since this is a sensitive operation, this API will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32215) Expose end point on Master so that it can be informed about decommissioned workers out of band

2020-07-07 Thread Devesh Agrawal (Jira)

Devesh Agrawal created SPARK-32215:
--

 Summary: Expose end point on Master so that it can be informed 
about decommissioned workers out of band
 Key: SPARK-32215
 URL: https://issues.apache.org/jira/browse/SPARK-32215
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
 Environment: Standalone Scheduler 
Reporter: Devesh Agrawal
 Fix For: 3.1.0


The use case here is to allow some external entity that has made a 
decommissioning decision to inform the Master (in case of Standalone scheduling 
mode)

The current decommissioning is triggered by the Worker getting getting a SIGPWR
(out of band possibly by some cleanup hook), which then informs the Master
about it. This approach may not be feasible in some environments that cannot
trigger a clean up hook on the Worker.

Add a new post endpoint {{/workers/kill}} on the MasterWebUI that allows an
external agent to inform the master about all the nodes being decommissioned in
bulk. The workers are identified by either their {{host:port}} or just the host
-- in which case all workers on the host would be decommissioned.

This API is merely a new entry point into the existing decommissioning
logic. It does not change how the decommissioning request is handled in
its core.

The path /workers/kill is so chosen to be consistent with the other endpoint 
names on the MasterWebUI. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153123#comment-17153123
 ] 

Apache Spark commented on SPARK-32093:
--

User 'bharatviswa504' has created a pull request for this issue:
https://github.com/apache/spark/pull/29030

> Add hadoop-ozone-filesystem jar to ozone profile
> 
>
> Key: SPARK-32093
> URL: https://issues.apache.org/jira/browse/SPARK-32093
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Bharat Viswanadham
>Priority: Major
>
>  This Jira is to include Ozone filesystem jar with a new profile "ozone" in 
> mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32093:


Assignee: Apache Spark

> Add hadoop-ozone-filesystem jar to ozone profile
> 
>
> Key: SPARK-32093
> URL: https://issues.apache.org/jira/browse/SPARK-32093
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Bharat Viswanadham
>Assignee: Apache Spark
>Priority: Major
>
>  This Jira is to include Ozone filesystem jar with a new profile "ozone" in 
> mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32093) Add hadoop-ozone-filesystem jar to ozone profile

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32093:


Assignee: (was: Apache Spark)

> Add hadoop-ozone-filesystem jar to ozone profile
> 
>
> Key: SPARK-32093
> URL: https://issues.apache.org/jira/browse/SPARK-32093
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Bharat Viswanadham
>Priority: Major
>
>  This Jira is to include Ozone filesystem jar with a new profile "ozone" in 
> mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-07 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Fix Version/s: 3.0.1

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
> Fix For: 3.0.1
>
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32214:


Assignee: Kousuke Saruta  (was: Apache Spark)

> The type conversion function generated in makeFromJava for "other"  type uses 
> a wrong variable.
> ---
>
> Key: SPARK-32214
> URL: https://issues.apache.org/jira/browse/SPARK-32214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> `makeFromJava` in `EvaluatePython` create a type conversion function for some 
> Java/Scala types.
> For `other` type, the parameter of the type conversion function is named 
> `obj` but `other` is mistakenly used rather than `obj` in the function body.
> {code:java}
> case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153118#comment-17153118
 ] 

Apache Spark commented on SPARK-32214:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29029

> The type conversion function generated in makeFromJava for "other"  type uses 
> a wrong variable.
> ---
>
> Key: SPARK-32214
> URL: https://issues.apache.org/jira/browse/SPARK-32214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> `makeFromJava` in `EvaluatePython` create a type conversion function for some 
> Java/Scala types.
> For `other` type, the parameter of the type conversion function is named 
> `obj` but `other` is mistakenly used rather than `obj` in the function body.
> {code:java}
> case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153117#comment-17153117
 ] 

Apache Spark commented on SPARK-32214:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29029

> The type conversion function generated in makeFromJava for "other"  type uses 
> a wrong variable.
> ---
>
> Key: SPARK-32214
> URL: https://issues.apache.org/jira/browse/SPARK-32214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> `makeFromJava` in `EvaluatePython` create a type conversion function for some 
> Java/Scala types.
> For `other` type, the parameter of the type conversion function is named 
> `obj` but `other` is mistakenly used rather than `obj` in the function body.
> {code:java}
> case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32214:


Assignee: Apache Spark  (was: Kousuke Saruta)

> The type conversion function generated in makeFromJava for "other"  type uses 
> a wrong variable.
> ---
>
> Key: SPARK-32214
> URL: https://issues.apache.org/jira/browse/SPARK-32214
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> `makeFromJava` in `EvaluatePython` create a type conversion function for some 
> Java/Scala types.
> For `other` type, the parameter of the type conversion function is named 
> `obj` but `other` is mistakenly used rather than `obj` in the function body.
> {code:java}
> case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32214) The type conversion function generated in makeFromJava for "other" type uses a wrong variable.

2020-07-07 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-32214:
--

 Summary: The type conversion function generated in makeFromJava 
for "other"  type uses a wrong variable.
 Key: SPARK-32214
 URL: https://issues.apache.org/jira/browse/SPARK-32214
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.6, 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


`makeFromJava` in `EvaluatePython` create a type conversion function for some 
Java/Scala types.

For `other` type, the parameter of the type conversion function is named `obj` 
but `other` is mistakenly used rather than `obj` in the function body.
{code:java}
case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32213) saveAsTable deletes all files in path

2020-07-07 Thread Yuval Rochman (Jira)

Yuval Rochman created SPARK-32213:
-

 Summary: saveAsTable deletes all files in path
 Key: SPARK-32213
 URL: https://issues.apache.org/jira/browse/SPARK-32213
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Yuval Rochman


The problem is presented in the following link:

[https://stackoverflow.com/questions/62782637/saveastable-can-delete-all-my-files-in-desktop?noredirect=1#comment111026138_62782637]

Apparently, without no warning, all files is desktop where deleted after 
writing a file.
There is no warning in Pyspark that the "Path"  parameter makes that problem. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31625) Unregister application from YARN resource manager outside the shutdown hook

2020-07-07 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31625.
--
Resolution: Not A Problem

> Unregister application from YARN resource manager outside the shutdown hook
> ---
>
> Key: SPARK-31625
> URL: https://issues.apache.org/jira/browse/SPARK-31625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, an application is unregistered from YARN resource manager as a 
> shutdown hook. In the scenario where the shutdown hook does not run (e.g., 
> timeouts, etc.), the application is not unregistered, resulting in YARN 
> resubmitting the application even if it succeeded.
> For example, you could see the following on the driver log:
> {code:java}
> 20/04/30 06:20:29 INFO SparkContext: Successfully stopped SparkContext
> 20/04/30 06:20:29 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 20/04/30 06:20:59 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, 
> java.util.concurrent.TimeoutException
> java.util.concurrent.TimeoutException
>   at java.util.concurrent.FutureTask.get(FutureTask.java:205)
>   at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> {code}
> On the YARN RM side:
> {code:java}
> 2020-04-30 06:21:25,083 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1588227360159_0001_01_01 Container Transitioned from RUNNING to 
> COMPLETED
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Updating application attempt appattempt_1588227360159_0001_01 with final 
> state: FAILED, and exit status: 0
> 2020-04-30 06:21:25,085 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1588227360159_0001_01 State change from RUNNING to 
> FINAL_SAVING on event = CONTAINER_FINISHED
> {code}
> You see that the final state of the application becomes FAILED since the 
> container is finished before the application is unregistered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153060#comment-17153060
 ] 

Apache Spark commented on SPARK-32212:
--

User 'izchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/29028

> RDD.takeOrdered can choose to merge intermediate results in executor or driver
> --
>
> Key: SPARK-32212
> URL: https://issues.apache.org/jira/browse/SPARK-32212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Chen Zhang
>Priority: Major
>
> In the list of issues, I saw some discussions about exceeding the memory 
> limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit 
> xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved.
> In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
> algorithm in guava is used in the executor process to calculate the local 
> TopK results of each RDD partition. These intermediate results are packaged 
> into java.util.PriorityQueue and returned to the driver process. In the 
> driver process, these intermediate results are merged to get the global TopK 
> results.
> The problem with this implementation is that if the intermediate results are 
> too large and too many partitions, the intermediate results may accumulate in 
> the memory of the driver process, causing excessive memory pressure.
> We can use an optional config to determine whether the intermediate 
> results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
> driver process or executor process. If set to true, merge in driver 
> process(by util.PriorityQueue), which will get shorter waiting time for 
> return. But if the intermediate results are too large and too many 
> partitions, the intermediate results may accumulate in the memory of the 
> driver process, causing excessive memory pressure. If set to false, merge in 
> executor process(by guava.QuickSelect), intermediate results will not 
> accumulate in memory, but will cause longer runtimes.
> something like:
> _(org.apache.spark.rdd.RDD)_
> {code:scala}
>   def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
> if (num == 0 || partitions.length == 0) {
>   Array.empty
> } else {
>   if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the 
> ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= collectionUtils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> mapRDDs.reduce { (queue1, queue2) =>
>   queue1 ++= queue2
>   queue1
> }.toArray.sorted(ord)
>   } else {
> mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.repartition(1).mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.collect()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32212:


Assignee: (was: Apache Spark)

> RDD.takeOrdered can choose to merge intermediate results in executor or driver
> --
>
> Key: SPARK-32212
> URL: https://issues.apache.org/jira/browse/SPARK-32212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Chen Zhang
>Priority: Major
>
> In the list of issues, I saw some discussions about exceeding the memory 
> limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit 
> xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved.
> In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
> algorithm in guava is used in the executor process to calculate the local 
> TopK results of each RDD partition. These intermediate results are packaged 
> into java.util.PriorityQueue and returned to the driver process. In the 
> driver process, these intermediate results are merged to get the global TopK 
> results.
> The problem with this implementation is that if the intermediate results are 
> too large and too many partitions, the intermediate results may accumulate in 
> the memory of the driver process, causing excessive memory pressure.
> We can use an optional config to determine whether the intermediate 
> results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
> driver process or executor process. If set to true, merge in driver 
> process(by util.PriorityQueue), which will get shorter waiting time for 
> return. But if the intermediate results are too large and too many 
> partitions, the intermediate results may accumulate in the memory of the 
> driver process, causing excessive memory pressure. If set to false, merge in 
> executor process(by guava.QuickSelect), intermediate results will not 
> accumulate in memory, but will cause longer runtimes.
> something like:
> _(org.apache.spark.rdd.RDD)_
> {code:scala}
>   def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
> if (num == 0 || partitions.length == 0) {
>   Array.empty
> } else {
>   if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the 
> ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= collectionUtils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> mapRDDs.reduce { (queue1, queue2) =>
>   queue1 ++= queue2
>   queue1
> }.toArray.sorted(ord)
>   } else {
> mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.repartition(1).mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.collect()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153059#comment-17153059
 ] 

Apache Spark commented on SPARK-32212:
--

User 'izchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/29028

> RDD.takeOrdered can choose to merge intermediate results in executor or driver
> --
>
> Key: SPARK-32212
> URL: https://issues.apache.org/jira/browse/SPARK-32212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Chen Zhang
>Priority: Major
>
> In the list of issues, I saw some discussions about exceeding the memory 
> limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit 
> xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved.
> In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
> algorithm in guava is used in the executor process to calculate the local 
> TopK results of each RDD partition. These intermediate results are packaged 
> into java.util.PriorityQueue and returned to the driver process. In the 
> driver process, these intermediate results are merged to get the global TopK 
> results.
> The problem with this implementation is that if the intermediate results are 
> too large and too many partitions, the intermediate results may accumulate in 
> the memory of the driver process, causing excessive memory pressure.
> We can use an optional config to determine whether the intermediate 
> results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
> driver process or executor process. If set to true, merge in driver 
> process(by util.PriorityQueue), which will get shorter waiting time for 
> return. But if the intermediate results are too large and too many 
> partitions, the intermediate results may accumulate in the memory of the 
> driver process, causing excessive memory pressure. If set to false, merge in 
> executor process(by guava.QuickSelect), intermediate results will not 
> accumulate in memory, but will cause longer runtimes.
> something like:
> _(org.apache.spark.rdd.RDD)_
> {code:scala}
>   def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
> if (num == 0 || partitions.length == 0) {
>   Array.empty
> } else {
>   if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the 
> ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= collectionUtils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> mapRDDs.reduce { (queue1, queue2) =>
>   queue1 ++= queue2
>   queue1
> }.toArray.sorted(ord)
>   } else {
> mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.repartition(1).mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.collect()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32212:


Assignee: Apache Spark

> RDD.takeOrdered can choose to merge intermediate results in executor or driver
> --
>
> Key: SPARK-32212
> URL: https://issues.apache.org/jira/browse/SPARK-32212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Chen Zhang
>Assignee: Apache Spark
>Priority: Major
>
> In the list of issues, I saw some discussions about exceeding the memory 
> limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit 
> xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved.
> In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
> algorithm in guava is used in the executor process to calculate the local 
> TopK results of each RDD partition. These intermediate results are packaged 
> into java.util.PriorityQueue and returned to the driver process. In the 
> driver process, these intermediate results are merged to get the global TopK 
> results.
> The problem with this implementation is that if the intermediate results are 
> too large and too many partitions, the intermediate results may accumulate in 
> the memory of the driver process, causing excessive memory pressure.
> We can use an optional config to determine whether the intermediate 
> results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
> driver process or executor process. If set to true, merge in driver 
> process(by util.PriorityQueue), which will get shorter waiting time for 
> return. But if the intermediate results are too large and too many 
> partitions, the intermediate results may accumulate in the memory of the 
> driver process, causing excessive memory pressure. If set to false, merge in 
> executor process(by guava.QuickSelect), intermediate results will not 
> accumulate in memory, but will cause longer runtimes.
> something like:
> _(org.apache.spark.rdd.RDD)_
> {code:scala}
>   def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
> if (num == 0 || partitions.length == 0) {
>   Array.empty
> } else {
>   if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the 
> ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= collectionUtils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> mapRDDs.reduce { (queue1, queue2) =>
>   queue1 ++= queue2
>   queue1
> }.toArray.sorted(ord)
>   } else {
> mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.repartition(1).mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.collect()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32212) RDD.takeOrdered can choose to merge intermediate results in executor or driver

2020-07-07 Thread Chen Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated SPARK-32212:
---
Summary: RDD.takeOrdered can choose to merge intermediate results in 
executor or driver  (was: RDD.takeOrdered merge intermediate results can be 
configured in driver or executor)

> RDD.takeOrdered can choose to merge intermediate results in executor or driver
> --
>
> Key: SPARK-32212
> URL: https://issues.apache.org/jira/browse/SPARK-32212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Chen Zhang
>Priority: Major
>
> In the list of issues, I saw some discussions about exceeding the memory 
> limit of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit 
> xx)_. I think that the implementation of _RDD.takeOrdered_ can be improved.
> In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
> algorithm in guava is used in the executor process to calculate the local 
> TopK results of each RDD partition. These intermediate results are packaged 
> into java.util.PriorityQueue and returned to the driver process. In the 
> driver process, these intermediate results are merged to get the global TopK 
> results.
> The problem with this implementation is that if the intermediate results are 
> too large and too many partitions, the intermediate results may accumulate in 
> the memory of the driver process, causing excessive memory pressure.
> We can use an optional config to determine whether the intermediate 
> results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
> driver process or executor process. If set to true, merge in driver 
> process(by util.PriorityQueue), which will get shorter waiting time for 
> return. But if the intermediate results are too large and too many 
> partitions, the intermediate results may accumulate in the memory of the 
> driver process, causing excessive memory pressure. If set to false, merge in 
> executor process(by guava.QuickSelect), intermediate results will not 
> accumulate in memory, but will cause longer runtimes.
> something like:
> _(org.apache.spark.rdd.RDD)_
> {code:scala}
>   def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
> if (num == 0 || partitions.length == 0) {
>   Array.empty
> } else {
>   if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the 
> ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= collectionUtils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> mapRDDs.reduce { (queue1, queue2) =>
>   queue1 ++= queue2
>   queue1
> }.toArray.sorted(ord)
>   } else {
> mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.repartition(1).mapPartitions { items =>
>   collectionUtils.takeOrdered(items, num)(ord)
> }.collect()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152977#comment-17152977
 ] 

Apache Spark commented on SPARK-32163:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29027

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32163.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28988
[https://github.com/apache/spark/pull/28988]

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32212) RDD.takeOrdered merge intermediate results can be configured in driver or executor

2020-07-07 Thread Chen Zhang (Jira)

Chen Zhang created SPARK-32212:
--

 Summary: RDD.takeOrdered merge intermediate results can be 
configured in driver or executor
 Key: SPARK-32212
 URL: https://issues.apache.org/jira/browse/SPARK-32212
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Chen Zhang


In the list of issues, I saw some discussions about exceeding the memory limit 
of the driver when using _RDD.takeOrdered_ or _SQL(order by xx limit xx)_. I 
think that the implementation of _RDD.takeOrdered_ can be improved.

In the original code implementation of _RDD.takeOrdered_, the QuickSelect 
algorithm in guava is used in the executor process to calculate the local TopK 
results of each RDD partition. These intermediate results are packaged into 
java.util.PriorityQueue and returned to the driver process. In the driver 
process, these intermediate results are merged to get the global TopK results.

The problem with this implementation is that if the intermediate results are 
too large and too many partitions, the intermediate results may accumulate in 
the memory of the driver process, causing excessive memory pressure.

We can use an optional config to determine whether the intermediate 
results(local TopK) of each partition in _RDD.takeOrdered_ will be merged in 
driver process or executor process. If set to true, merge in driver process(by 
util.PriorityQueue), which will get shorter waiting time for return. But if the 
intermediate results are too large and too many partitions, the intermediate 
results may accumulate in the memory of the driver process, causing excessive 
memory pressure. If set to false, merge in executor process(by 
guava.QuickSelect), intermediate results will not accumulate in memory, but 
will cause longer runtimes.

something like:
_(org.apache.spark.rdd.RDD)_
{code:scala}
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0 || partitions.length == 0) {
  Array.empty
} else {
  if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= collectionUtils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)
  } else {
mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.repartition(1).mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.collect()
  }
}
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32174) toPandas attempted Arrow optimization but has reached an error and can not continue

2020-07-07 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152958#comment-17152958
 ] 

Bryan Cutler commented on SPARK-32174:
--

>From the stacktrace, it looks like you are using JDK9 or above, which Arrow 
>(really netty) needs the JVM system property 
>\{{io.netty.tryReflectionSetAccessible}} set to true, see 
>https://issues.apache.org/jira/browse/SPARK-29923 , also in the release notes. 
>Could you confirm if this solves your issue?

> toPandas attempted Arrow optimization but has reached an error and can not 
> continue
> ---
>
> Key: SPARK-32174
> URL: https://issues.apache.org/jira/browse/SPARK-32174
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, PySpark
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, running in *stand-alone* mode
>Reporter: Ramin Hazegh
>Priority: Major
>
> h4. Converting a dataframe to Panda data frame using toPandas() fails.
>  
> *Spark 3.0.0 Running in stand-alone mode* using docker containers based on 
> jupyter docker stack here:
> [https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile]
>  
> $ conda list | grep arrow
>  *arrow-cpp 0.17.1* py38h1234567_5_cpu conda-forge
>  *pyarrow 0.17.1* py38h1234567_5_cpu conda-forge
> $ conda list | grep pandas
>  *pandas 1.0.5* py38hcb8c335_0 conda-forge
>  
> *To reproduce:*
> {code:java}
> import numpy as np
> import pandas as pd
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("spark://10.0.1.40:7077") \
> .config("spark.sql.execution.arrow.enabled", "true") \
> .appName('test_arrow') \
> .getOrCreate()
> 
> # Generate a pandas DataFrame
> pdf = pd.DataFrame(np.random.rand(100, 3))
> # Create a Spark DataFrame from a pandas DataFrame using Arrow
> df = spark.createDataFrame(pdf)
> # Convert the Spark DataFrame back to a pandas DataFrame using Arrow
> result_pdf = df.select("*").toPandas()
> {code}
>  
> ==
> /usr/local/spark/python/pyspark/sql/pandas/conversion.py:134: UserWarning: 
> toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>  An error occurred while calling o55.getResult.
>  : org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:88)
>  at 
> org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:84)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>  at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.base/java.lang.Thread.run(Thread.java:834)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 14 in stage 0.0 failed 4 times, most recent failure: Lost task 
> 14.3 in stage 0.0 (TID 31, 10.0.1.43, executor 0): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>  at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:490)
>  at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>  at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>  at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>  at org.apache.arrow.vector.ipc.ReadChannel.readFully(ReadChannel.java:81)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessageBody(MessageSerializer.java:696)
>  at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeRecordBatch(MessageSerializer.java:344)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$.loadBatch(ArrowConverters.scala:189)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.nextBatch(ArrowConverters.scala:165)
>  at 
> org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$2.(ArrowConverters.scala:144)
>

[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32163:
--
Description: 
If the expressions extracting nested fields have cosmetic variations like 
qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a 
query, but the nested column extractors of them are treated differently when we 
deal with nested column pruning.






  was:
Note that this is just an optimization issue and not a regression. The newly 
introduced optimizer doesn't optimize this corner case.

If the expressions extracting nested fields have cosmetic variations like 
qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a 
query, but the nested column extractors of them are treated differently when we 
deal with nested column pruning.







> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31317) Add withField method to Column class

2020-07-07 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152894#comment-17152894
 ] 

fqaiser94 commented on SPARK-31317:
---

Done. 

> Add withField method to Column class
> 
>
> Key: SPARK-31317
> URL: https://issues.apache.org/jira/browse/SPARK-31317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32211) Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32211:
--
Summary: Pin mariadb-plugin-gssapi-server version to fix 
MariaDBKrbIntegrationSuite  (was: MariaDBKrbIntegrationSuite fails because of 
unwanted server upgrade)

> Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite
> --
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered unwanted database upgrade 
> inside the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32211:
-

Assignee: Gabor Somogyi

> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered unwanted database upgrade 
> inside the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32211.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29025
[https://github.com/apache/spark/pull/29025]

> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered unwanted database upgrade 
> inside the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31317) Add withField method to Column class

2020-07-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31317.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27066
[https://github.com/apache/spark/pull/27066]

> Add withField method to Column class
> 
>
> Key: SPARK-31317
> URL: https://issues.apache.org/jira/browse/SPARK-31317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152871#comment-17152871
 ] 

Apache Spark commented on SPARK-32018:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29026

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32018:


Assignee: (was: Apache Spark)

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32018:


Assignee: Apache Spark

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152869#comment-17152869
 ] 

Apache Spark commented on SPARK-32018:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29026

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152867#comment-17152867
 ] 

Apache Spark commented on SPARK-28067:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29026

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Assignee: Sunitha Kambhampati
>Priority: Critical
>  Labels: correctness
> Fix For: 3.1.0
>
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32163:
--
Description: 
Note that this is just an optimization issue and not a regression. The newly 
introduced optimizer doesn't optimize this corner case.

If the expressions extracting nested fields have cosmetic variations like 
qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a 
query, but the nested column extractors of them are treated differently when we 
deal with nested column pruning.






  was:
If the expressions extracting nested fields have cosmetic variations like 
qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a 
query, but the nested column extractors of them are treated differently when we 
deal with nested column pruning.







> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Note that this is just an optimization issue and not a regression. The newly 
> introduced optimizer doesn't optimize this corner case.
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-07 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32164.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28990
[https://github.com/apache/spark/pull/28990]

> GeneralizedLinearRegressionSummary optimization
> ---
>
> Key: SPARK-32164
> URL: https://issues.apache.org/jira/browse/SPARK-32164
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-07 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32164:
--

Assignee: zhengruifeng

> GeneralizedLinearRegressionSummary optimization
> ---
>
> Key: SPARK-32164
> URL: https://issues.apache.org/jira/browse/SPARK-32164
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30985) Propagate SPARK_CONF_DIR files to driver and exec pods.

2020-07-07 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-30985:

Description: 
SPARK_CONF_DIR hosts configuration files like, 
 1) spark-defaults.conf - containing all the spark properties.
 2) log4j.properties - Logger configuration.
 3) spark-env.sh - Environment variables to be setup at driver and executor.
 4) core-site.xml - Hadoop related configuration.
 5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
 6) metrics.properties - Spark metrics.
 7) Any user specific - library or framework specific configuration file.

Traditionally, SPARK_CONF_DIR has been the home to all user specific 
configuration files.

So this feature, will let the user specific configuration files be mounted on 
the driver and executor pods' SPARK_CONF_DIR.

Please review the attached design doc, for more details.

 

[https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing]

 

  was:
SPARK_CONF_DIR hosts configuration files like, 
 1) spark-defaults.conf - containing all the spark properties.
 2) log4j.properties - Logger configuration.
 3) spark-env.sh - Environment variables to be setup at driver and executor.
 4) core-site.xml - Hadoop related configuration.
 5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
 6) metrics.properties - Spark metrics.
 7) Any user specific - library or framework specific configuration file.

Traditionally, SPARK_CONF_DIR has been the home to all user specific 
configuration files.

So this feature, will let the user specific configuration files be mounted on 
the driver and executor pods' SPARK_CONF_DIR.

Please review the attached design doc, for more details.

 

https://drive.google.com/file/d/1p6gaJyOJdlB1rosJDFner3bj5VekTCJ3/view?usp=sharing


> Propagate SPARK_CONF_DIR files to driver and exec pods.
> ---
>
> Key: SPARK-30985
> URL: https://issues.apache.org/jira/browse/SPARK-30985
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Priority: Major
>
> SPARK_CONF_DIR hosts configuration files like, 
>  1) spark-defaults.conf - containing all the spark properties.
>  2) log4j.properties - Logger configuration.
>  3) spark-env.sh - Environment variables to be setup at driver and executor.
>  4) core-site.xml - Hadoop related configuration.
>  5) fairscheduler.xml - Spark's fair scheduling policy at the job level.
>  6) metrics.properties - Spark metrics.
>  7) Any user specific - library or framework specific configuration file.
> Traditionally, SPARK_CONF_DIR has been the home to all user specific 
> configuration files.
> So this feature, will let the user specific configuration files be mounted on 
> the driver and executor pods' SPARK_CONF_DIR.
> Please review the attached design doc, for more details.
>  
> [https://docs.google.com/document/d/1DUmNqMz5ky55yfegdh4e_CeItM_nqtrglFqFxsTxeeA/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32047) Add provider disable possibility just like in delegation token provider

2020-07-07 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152784#comment-17152784
 ] 

Gabor Somogyi commented on SPARK-32047:
---

Started to work on this.

> Add provider disable possibility just like in delegation token provider
> ---
>
> Key: SPARK-32047
> URL: https://issues.apache.org/jira/browse/SPARK-32047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There is an enable flag in delegation provider area 
> "spark.security.credentials.%s.enabled".
> It would be good to add similar to the JDBC secure connection provider area 
> because this would make embedded providers interchangeable (embedded can be 
> turned off and another provider w/ a different name can be registered). This 
> make sense only if we create API for the secure JDBC connection provider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32209.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28999
[https://github.com/apache/spark/pull/28999]

> Re-use GetTimestamp in ParseToDate
> --
>
> Key: SPARK-32209
> URL: https://issues.apache.org/jira/browse/SPARK-32209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Replace the combination of expressions SecondsToTimestamp and UnixTimestamp 
> by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary 
> parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> 
> date. After the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32209:
-

Assignee: Maxim Gekk

> Re-use GetTimestamp in ParseToDate
> --
>
> Key: SPARK-32209
> URL: https://issues.apache.org/jira/browse/SPARK-32209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Replace the combination of expressions SecondsToTimestamp and UnixTimestamp 
> by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary 
> parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> 
> date. After the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32211:
--
Description: 
The test fails with the following error:
{code:java}
2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 
'unauthenticated' host: '172.17.0.1' (This connection closed normally without 
authentication)
{code}
This is because the docker image contains MariaDB version 
1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
mariadb-plugin-gssapi-server installation triggered unwanted database upgrade 
inside the docker image.

  was:
The test fails with the following error:
{code:java}
2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 
'unauthenticated' host: '172.17.0.1' (This connection closed normally without 
authentication)
{code}
This is because the docker image contains MariaDB version 
1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
mariadb-plugin-gssapi-server installation triggered database upgrade inside the 
docker image.


> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered unwanted database upgrade 
> inside the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152763#comment-17152763
 ] 

Apache Spark commented on SPARK-32211:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29025

> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered database upgrade inside 
> the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32211:


Assignee: (was: Apache Spark)

> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered database upgrade inside 
> the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32211:


Assignee: Apache Spark

> MariaDBKrbIntegrationSuite fails because of unwanted server upgrade
> ---
>
> Key: SPARK-32211
> URL: https://issues.apache.org/jira/browse/SPARK-32211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> The test fails with the following error:
> {code:java}
> 2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' 
> user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally 
> without authentication)
> {code}
> This is because the docker image contains MariaDB version 
> 1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
> mariadb-plugin-gssapi-server installation triggered database upgrade inside 
> the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32211) MariaDBKrbIntegrationSuite fails because of unwanted server upgrade

2020-07-07 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-32211:
-

 Summary: MariaDBKrbIntegrationSuite fails because of unwanted 
server upgrade
 Key: SPARK-32211
 URL: https://issues.apache.org/jira/browse/SPARK-32211
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi


The test fails with the following error:
{code:java}
2020-07-07 12:43:13 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 
'unauthenticated' host: '172.17.0.1' (This connection closed normally without 
authentication)
{code}
This is because the docker image contains MariaDB version 
1:10.4.12+maria~bionic but 1:10.4.13+maria~bionic came out and 
mariadb-plugin-gssapi-server installation triggered database upgrade inside the 
docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32210) Failed to serialize large MapStatuses

2020-07-07 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-32210:
---

 Summary: Failed to serialize large MapStatuses
 Key: SPARK-32210
 URL: https://issues.apache.org/jira/browse/SPARK-32210
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.4
Reporter: Yuming Wang


Driver side exception:
{noformat}
20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
spark.MapOutputTrackerMaster:91 :
java.lang.NegativeArraySizeException
at 
org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
spark.MapOutputTrackerMaster:91 :
java.lang.NegativeArraySizeException
at 
org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
at 
org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
at 
org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
at 
org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
spark.MapOutputTrackerMaster:91 :
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-07-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31975:
---

Assignee: ulysses you

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31975) Throw user facing error when use WindowFunction directly

2020-07-07 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31975.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28808
[https://github.com/apache/spark/pull/28808]

> Throw user facing error when use WindowFunction directly
> 
>
> Key: SPARK-31975
> URL: https://issues.apache.org/jira/browse/SPARK-31975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2020-07-07 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738
 ] 

Lantao Jin edited comment on SPARK-29038 at 7/7/20, 1:14 PM:
-

Hi [~AidenZhang], our focusings of MV in recent months are two parts. One is 
the rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.


was (Author: cltlfcjin):
Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the 
rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2020-07-07 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152738#comment-17152738
 ] 

Lantao Jin commented on SPARK-29038:


Hi [~AidenZhang], my focusings of MV in recent months are two parts. One is the 
rewrite algothim optimization. Such as forbidding count distict post 
aggregation, avoid unnecessary rewrite when do relation replacement. Another is 
bugfix in MV refresh. Use a Spark listener to deliver the metastore events to 
refresh. Some parts depends on third part system. So maybe only interfaces are 
available in community Spark. I don't do the partial/incremental refresh since 
it's not a blocker for us. I am not sure the community are still interested the 
feature, but we are moving existing implementation to Spark3.0 now.

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32001:


Assignee: Apache Spark

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32001:


Assignee: (was: Apache Spark)

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152725#comment-17152725
 ] 

Apache Spark commented on SPARK-32001:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29024

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> Adding embedded provider to all the possible databases would generate high 
> maintenance cost on Spark side.
> Instead an API can be introduced which would allow to implement further 
> providers independently.
> One important requirement what I suggest is: JDBC connection providers must 
> be loaded independently just like delegation token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread Nilesh Patil (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152683#comment-17152683
 ] 

Nilesh Patil commented on SPARK-32205:
--

[~JinxinTang] In mysql there is having DATETIME datatype that accepts 
*-12-31 23:59:59* value.

Using spark is it possible to write in DATETIME datatype ?

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152679#comment-17152679
 ] 

JinxinTang edited comment on SPARK-32205 at 7/7/20, 11:44 AM:
--

[~nileshr.patil]

Sure, this problem seems in mysql side, because mysql  can not accpect 
*-12-31 23:59:59* as timestamp type. We can insert directly from mysql 
client to make sure.


was (Author: jinxintang):
Sure, this problem seems in mysql side, because we cannot insert *-12-31 
23:59:59* to mysql timestamp type.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32205) Writing timestamp in mysql gets fails

2020-07-07 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152679#comment-17152679
 ] 

JinxinTang commented on SPARK-32205:


Sure, this problem seems in mysql side, because we cannot insert *-12-31 
23:59:59* to mysql timestamp type.

> Writing timestamp in mysql gets fails 
> --
>
> Key: SPARK-32205
> URL: https://issues.apache.org/jira/browse/SPARK-32205
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.4
>Reporter: Nilesh Patil
>Priority: Major
>
> When we are writing to mysql with TIMESTAMP column it supports only range 
> '1970-01-01 00:00:01' UTC to '2038-01-19 03:14:07'. Mysql has DATETIME 
> datatype which has 1000-01-01 00:00:00' to '-12-31 23:59:59' range.
> How to map spark timestamp datatype to mysql datetime datatype in order to 
> use higher supporting range ?
> [https://dev.mysql.com/doc/refman/5.7/en/datetime.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31635) Spark SQL Sort fails when sorting big data points

2020-07-07 Thread George George (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152674#comment-17152674
 ] 

George George commented on SPARK-31635:
---

Hello [~Chen Zhang], 

Thanks a lot for getting back on this. 

I would agree with you that it is an improvement. However, I thought because it 
failed when using dataframe api and there is no documentation on it, that it is 
a bug.

Your suggestion sounds really good to me and I think it's good to give the user 
the opportunity to configure this. Basically, then the user can decide if he 
waits a little more on the result or put more pressure on the driver. 

I could also try to submit a PR, but I guess I would need a more time on it. 
Just let me know if you would rather wait for my pr or do it yourself.

Best,

George

 

> Spark SQL Sort fails when sorting big data points
> -
>
> Key: SPARK-31635
> URL: https://issues.apache.org/jira/browse/SPARK-31635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: George George
>Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](25)(Point(1,2, 100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
>  *Sorting* big data objects using *Spark Dataframe* is failing with following 
> exception: 
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized 
> results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize 
> (100.0 MB)
> [Stage 0:==> (12 + 3) / 
> 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total 
> size of serialized results of 13 tasks (100.1 MB) is bigger than 
> spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](25)(Point(1,2, 100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same 
> arguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
> {code}
> Even if we increase the spark.driver.maxResultSize, the executors still get 
> killed for our use case. The interesting thing is that when using the RDD API 
> directly the problem is not there. *Looks like there is a bug in dataframe 
> sort because is shuffling too much data to the driver?* 
> Note: this is a small example and I reduced the spark.driver.maxResultSize to 
> a smaller size, but in our application I've tried setting it to 8GB but as 
> mentioned above the job was killed. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32209:


Assignee: Apache Spark

> Re-use GetTimestamp in ParseToDate
> --
>
> Key: SPARK-32209
> URL: https://issues.apache.org/jira/browse/SPARK-32209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Replace the combination of expressions SecondsToTimestamp and UnixTimestamp 
> by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary 
> parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> 
> date. After the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32209:


Assignee: (was: Apache Spark)

> Re-use GetTimestamp in ParseToDate
> --
>
> Key: SPARK-32209
> URL: https://issues.apache.org/jira/browse/SPARK-32209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Replace the combination of expressions SecondsToTimestamp and UnixTimestamp 
> by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary 
> parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> 
> date. After the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152641#comment-17152641
 ] 

Apache Spark commented on SPARK-32209:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28999

> Re-use GetTimestamp in ParseToDate
> --
>
> Key: SPARK-32209
> URL: https://issues.apache.org/jira/browse/SPARK-32209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Replace the combination of expressions SecondsToTimestamp and UnixTimestamp 
> by GetTimestamp in ParseToDate. This will allow to eliminate unnecessary 
> parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> 
> date. After the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32209) Re-use GetTimestamp in ParseToDate

2020-07-07 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-32209:
--

 Summary: Re-use GetTimestamp in ParseToDate
 Key: SPARK-32209
 URL: https://issues.apache.org/jira/browse/SPARK-32209
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Replace the combination of expressions SecondsToTimestamp and UnixTimestamp by 
GetTimestamp in ParseToDate. This will allow to eliminate unnecessary parsing 
overhead in: string -> timestamp -> long (seconds) -> timestamp -> date. After 
the changes, the chain will be: string -> timestamp -> date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31635) Spark SQL Sort fails when sorting big data points

2020-07-07 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147653#comment-17147653
 ] 

Chen Zhang edited comment on SPARK-31635 at 7/7/20, 10:14 AM:
--

In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_

The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global 
bucket Sorting, and the required number of data is returned after the global 
sorting result is obtained.All major computation are performed in the executor 
process.

The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD 
partition in the executor process(by QuickSelect), and then return each TOPK 
result to the driver process for merging(by PriorityQueue).

To get the same result, it is obvious that the second method based on 
QuickSelect/PriorityQueue has better performance.

I think that the implementation of _RDD.takeOrdered()_ can be improved, using a 
configurable option to decide whether the TOPK data merge process occurs in the 
driver process or the executor process. If it occurs in the driver process, it 
can reduce the time for waiting for computation. If it occurs in the executor 
process, it can reduce the memory pressure of the driver process.

something like:
 (org.apache.spark.rdd.RDD class)
{code:scala}
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0 || partitions.length == 0) {
  Array.empty
} else {
  if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= collectionUtils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)
  } else {
mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.repartition(1).mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.collect()
  }
}
  }
{code}


was (Author: chen zhang):
In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_

The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global 
bucket Sorting, and the required number of data is returned after the global 
sorting result is obtained.All major computation are performed in the executor 
process.

The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD 
partition in the executor process(by QuickSelect), and then return each TOPK 
result to the driver process for merging(by PriorityQueue).

To get the same result, it is obvious that the second method based on 
PriorityQueue has better performance.

I think that the implementation of _RDD.takeOrdered()_ can be improved, using a 
configurable option to decide whether the TOPK data merge process occurs in the 
driver process or the executor process. If it occurs in the driver process, it 
can reduce the time for waiting for computation. If it occurs in the executor 
process, it can reduce the memory pressure of the driver process.

something like:
 (org.apache.spark.rdd.RDD class)
{code:scala}
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0 || partitions.length == 0) {
  Array.empty
} else {
  if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= collectionUtils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)
  } else {
mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.repartition(1).mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.collect()
  }
}
  }
{code}

> Spark SQL Sort fails when sorting big data points
> -
>
> Key: SPARK-31635
> URL: https://issues.apache.org/jira/browse/SPARK-31635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: George George
>Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
>

[jira] [Commented] (SPARK-31635) Spark SQL Sort fails when sorting big data points

2020-07-07 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152618#comment-17152618
 ] 

Chen Zhang commented on SPARK-31635:


This problem is not a bug, but I think it is necessary to improve the code 
implementation.


I will create a new Improvement issue to discuss and try to submit a PR.

> Spark SQL Sort fails when sorting big data points
> -
>
> Key: SPARK-31635
> URL: https://issues.apache.org/jira/browse/SPARK-31635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: George George
>Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](25)(Point(1,2, 100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
>  *Sorting* big data objects using *Spark Dataframe* is failing with following 
> exception: 
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized 
> results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize 
> (100.0 MB)
> [Stage 0:==> (12 + 3) / 
> 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total 
> size of serialized results of 13 tasks (100.1 MB) is bigger than 
> spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](25)(Point(1,2, 100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same 
> arguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
> {code}
> Even if we increase the spark.driver.maxResultSize, the executors still get 
> killed for our use case. The interesting thing is that when using the RDD API 
> directly the problem is not there. *Looks like there is a bug in dataframe 
> sort because is shuffling too much data to the driver?* 
> Note: this is a small example and I reduced the spark.driver.maxResultSize to 
> a smaller size, but in our application I've tried setting it to 8GB but as 
> mentioned above the job was killed. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31635) Spark SQL Sort fails when sorting big data points

2020-07-07 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147653#comment-17147653
 ] 

Chen Zhang edited comment on SPARK-31635 at 7/7/20, 10:12 AM:
--

In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_

The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global 
bucket Sorting, and the required number of data is returned after the global 
sorting result is obtained.All major computation are performed in the executor 
process.

The execution logic of _RDD.takeOrdered()_ is to compute TOPK in each RDD 
partition in the executor process(by QuickSelect), and then return each TOPK 
result to the driver process for merging(by PriorityQueue).

To get the same result, it is obvious that the second method based on 
PriorityQueue has better performance.

I think that the implementation of _RDD.takeOrdered()_ can be improved, using a 
configurable option to decide whether the TOPK data merge process occurs in the 
driver process or the executor process. If it occurs in the driver process, it 
can reduce the time for waiting for computation. If it occurs in the executor 
process, it can reduce the memory pressure of the driver process.

something like:
 (org.apache.spark.rdd.RDD class)
{code:scala}
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0 || partitions.length == 0) {
  Array.empty
} else {
  if (conf.getBoolean("spark.rdd.takeOrdered.mergeInDriver", true)) {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= collectionUtils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)
  } else {
mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.repartition(1).mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.collect()
  }
}
  }
{code}


was (Author: chen zhang):
In fact, the RDD API corresponding to _DF.sort().take()_ is _RDD.takeOrdered()_

The execution logic of _RDD.sortBy().take()_ is the reservoir sampling + global 
bucket Sorting, and the required number of data is returned after the global 
sorting result is obtained.All major computation are performed in the executor 
process.

The execution logic of _RDD.takeOrdered()_ is to compute TOPK(by PriorityQueue) 
in each RDD partition in the executor process, and then return each TOPK result 
to the driver process for merging.

To get the same result, it is obvious that the second method based on 
PriorityQueue has better performance.

I think that the implementation of _RDD.takeOrdered()_ can be improved, using a 
configurable option to decide whether the TOPK data merge process occurs in the 
driver process or the executor process. If it occurs in the driver process, it 
can reduce the time for waiting for computation. If it occurs in the executor 
process, it can reduce the memory pressure of the driver process.

something like:
 (org.apache.spark.rdd.RDD class)
{code:scala}
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0 || partitions.length == 0) {
  Array.empty
} else {
  if (conf.getBoolean("spark.rdd.take.ordered.driver.merge", true)) {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= collectionUtils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
mapRDDs.reduce { (queue1, queue2) =>
  queue1 ++= queue2
  queue1
}.toArray.sorted(ord)
  } else {
mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.repartition(1).mapPartitions { items =>
  collectionUtils.takeOrdered(items, num)(ord)
}.collect()
  }
}
  }
{code}

> Spark SQL Sort fails when sorting big data points
> -
>
> Key: SPARK-31635
> URL: https://issues.apache.org/jira/browse/SPARK-31635
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: George George
>Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](25)(Point(1,2, 100)
>

1 2 >

1 - 100 of 137 matches

Mail list logo